RHEL Server Hang

|^^|

This is to acknowledge your support inquiry with regards to the sudden hang of 00011NameServ server.

We'll proceed with the checking of the logs.

To further analyze the issue, kindly provide these informations.

    1. What was the current status of the server?

    2. What was the last activity done to the server before the incident?

    3. Was there a reconfiguration done before the hang-up?

    4. What was the application running on the server?

    5. Was there a generated vmcore after the hang-up?

    6. Was there a storage, network, or any external issues on your infrastructure before the hangup?

    7. Any additional informations that can help us to identify the root cause.

Please include the answers to these questions on your next email. Kindly paste these questions and the answers on your next mail.

This is to provide you the root cause analysis of sudden reboot of ROCAPPS1 server.

Findings:

1. The server was restarted at 3:08PM this afternoon.

15:08:25 LINUX RESTART

2. The CPU utilization before the server hang was normal.

00:00:01 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle 13:40:01 all 36.38 0.00 0.47 0.01 0.00 0.00 0.07 0.00 63.08 13:50:01 all 34.27 0.00 0.43 0.00 0.00 0.00 0.05 0.00 65.25 14:00:01 all 35.36 0.00 0.43 0.00 0.00 0.00 0.06 0.00 64.14 14:10:01 all 29.99 0.00 0.43 0.01 0.00 0.00 0.06 0.00 69.51 14:20:01 all 33.24 0.00 0.45 0.01 0.00 0.00 0.07 0.00 66.24 14:30:01 all 33.39 0.00 0.43 0.00 0.00 0.00 0.06 0.00 66.12 Average: all 15.08 0.00 0.41 0.01 0.00 0.00 0.05 0.00 84.44

3. The memory utilization was nominal.

00:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 13:40:01 119507936 12603068 9.54 1555220 5013512 2343328 0.88 13:50:01 119404356 12706648 9.62 1556336 5172804 2365456 0.89 14:00:01 119376736 12734268 9.64 1557312 5146532 2353936 0.88 14:10:01 120013272 12097732 9.16 1558468 4793824 2347864 0.88 14:20:01 119611052 12499952 9.46 1559356 5053784 2419584 0.91 14:30:01 119536312 12574692 9.52 1560260 5247776 2345612 0.88 Average: 109370516 22740488 17.21 1511155 15537049 2214713 0.83

4. The swap space was not being utilized.

00:00:01 kbswpfree kbswpused %swpused kbswpcad %swpcad 13:40:01 134209408 0 0.00 0 0.00 13:50:01 134209408 0 0.00 0 0.00 14:00:01 134209408 0 0.00 0 0.00 14:10:01 134209408 0 0.00 0 0.00 14:20:01 134209408 0 0.00 0 0.00 14:30:01 134209408 0 0.00 0 0.00 Average: 134209408 0 0.00 0 0.00

5. There was no indication of software issue before the hang.

Jan 15 14:00:46 ROCAPPS1 ds_agent[5333]: BioWaitWithTimeout() - Error on select: operation timed out. Jan 15 14:00:46 ROCAPPS1 ds_agent[5333]: CHTTPServer::AcceptSSL(10.131.5.103:4120) - timed out while awaiting data from peer. Jan 15 14:01:16 ROCAPPS1 ds_agent[5333]: 4012| Jan 15 14:06:46 ROCAPPS1 ds_agent[5333]: BioWaitWithTimeout() - Error on select: operation timed out. Jan 15 14:06:46 ROCAPPS1 ds_agent[5333]: CHTTPServer::AcceptSSL(10.131.5.103:4120) - timed out while awaiting data from peer. Jan 15 14:07:16 ROCAPPS1 ds_agent[5333]: 4012| Jan 15 14:12:47 ROCAPPS1 ds_agent[5333]: BioWaitWithTimeout() - Error on select: operation timed out. Jan 15 14:12:47 ROCAPPS1 ds_agent[5333]: CHTTPServer::AcceptSSL(10.131.5.103:4120) - timed out while awaiting data from peer. Jan 15 14:13:17 ROCAPPS1 ds_agent[5333]: 4012| Jan 15 14:16:47 ROCAPPS1 ds_agent[5333]: BioWaitWithTimeout() - Error on select: operation timed out. Jan 15 14:16:47 ROCAPPS1 ds_agent[5333]: CHTTPServer::AcceptSSL(10.131.5.103:4120) - timed out while awaiting data from peer. Jan 15 14:17:17 ROCAPPS1 ds_agent[5333]: 4012| Jan 15 14:24:47 ROCAPPS1 ds_agent[5333]: BioWaitWithTimeout() - Error on select: operation timed out. Jan 15 14:24:47 ROCAPPS1 ds_agent[5333]: CHTTPServer::AcceptSSL(10.131.5.103:4120) - timed out while awaiting data from peer. Jan 15 14:25:17 ROCAPPS1 ds_agent[5333]: 4012| Jan 15 14:30:47 ROCAPPS1 ds_agent[5333]: BioWaitWithTimeout() - Error on select: operation timed out. Jan 15 14:30:47 ROCAPPS1 ds_agent[5333]: CHTTPServer::AcceptSSL(10.131.5.103:4120) - timed out while awaiting data from peer. Jan 15 14:31:17 ROCAPPS1 ds_agent[5333]: 4012| Jan 15 15:08:34 ROCAPPS1 kernel: imklog 5.8.10, log source = /proc/kmsg started. Jan 15 15:08:34 ROCAPPS1 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="4937" x-info="http://www.rsyslog.com"] start Jan 15 15:08:34 ROCAPPS1 kernel: Initializing cgroup subsys cpuset Jan 15 15:08:34 ROCAPPS1 kernel: Initializing cgroup subsys cpu Jan 15 15:08:34 ROCAPPS1 kernel: Linux version 2.6.32-279.el6.x86_64 (mockbuild@x86-008.build.bos.redhat.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jun 13 18:24:36 EDT 2012 Jan 15 15:08:34 ROCAPPS1 kernel: Command line: ro root=UUID=8f056559-6307-4bda-8c20-e485bb9e9c97 intel_iommu=on rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet

6. We've verified that the kdump was enabled.

Jan 15 15:08:34 ROCAPPS1 kernel: Command line: ro root=UUID=8f056559-6307-4bda-8c20-e485bb9e9c97 intel_iommu=on rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet kdump 0:off 1:off 2:on 3:on 4:on 5:on 6:off

7. Confirmed that the kernel.panic was not set.

kernel.panic = 0

Cause:

    • Possibly not caused by the operating system side. Operating system side hang-ups usually generate a trace logs on /var/log/messages.

    • Possibly not caused by any application. Most of the applications was logging errors on the /var/log/messages as well.

    • Root cause was not fully able to determined due to absences of vmcore dump file.

Recommendatation:

1. Configure the kernel.panic option on kernel settings. Adjust the value of kernel panic to 60 seconds. Reboot was not necessary to apply these configuration and system process will not be affected.

1. Edit /etc/sysctl.conf. 2. Append this line to the configuration file. kernel.panic = 60 3. Re-read sysctl configuration file by running this command. sysctl -p

If you have any concerns with regards to the provided report, please let us know.

Good day,

Hi,

Just want to inquire why crash dump was not created eventhough kdump was enable.

Thanks.

The possible cause of the failure of dump creation was the trigger.

We've also checked your kernel settings and we've noticed that the kernel panic was only set upon panic on oops.

kernel.panic = 0

kernel.panic_on_oops = 1

kernel.softlockup_panic = 0

kernel.unknown_nmi_panic = 0

kernel.panic_on_unrecovered_nmi = 0

kernel.panic_on_io_nmi = 0

kernel.hung_task_panic = 0

vm.panic_on_oom = 0

The kernel panic will only be invoked upon incorrect behavior on the kernel. It was set on kernel.panic_on_oops setting.

It was possible that the trigger of the system hang was a hardware issue, thus not triggering a kernel panic, since hardware related kernel panic settings were not enable. The hardware related kernel panic settings here were "kernel.unknown_nmi_panic", "kernel.panic_on_unrecovered_nmi", and "kernel.panic_on_io_nmi".

It was also possible the hangup was triggered by CPU softlocking. It means that the CPU resources was being locked for a long time, triggering a system hang. It was set via "kernel.softlockup_panic" settings.

These settings were not recommended to be enable since enabling these may cause frequent reboot on the server.

If you suspected that the reboot may possibly caused by a hardware issue, we recommend to enable all the hardware related kernel panic settings until we've generated a vmcore.

If there was some findings with the hardware side, we may assist you in configuring these kernel panic settings to verify its cause when the issue re-occur.

eof