VMCORE | KDUMP

|^^|

source: https://access.redhat.com/labsinfo/kdumphelper

read this: http://olivermellinuxacademy.com/blog/how-to-configure-kdump-and-generate-a-vmcore-file/

Kdump Helper

Leave a comment

Info Application

Language English

Kdump is a reliable kernel crash-dumping mechanism that captures crash dumps for troubleshooting issues like kernel crashes, hangs, and reboots. Setting up Kdump usually requires a series of steps and configurations. The Kdump Helper app is designed to simplify the process and reduce the effort required to set up Kdump on your machines.

Input a minimum amount of information and this app will generate an all-in-one script for you to set up Kdump to dump data in memory into a dump file called a vmcore. It has two modes: guided mode and manual mode . If you just want to input minimum information and use default or recommended parameters whenever possible, simply use the guided mode (the default mode), which then presents you a few steps with clear instructions to help you choose or input a proper parameter. If you have a very good understanding of Kdump and want to control as many options as possible, the manual mode will be your choice. It allows you to adjust all parameters.

Once you click the "Generate" button, the app will generate a script for you to set up Kdump based on the information you provided. Just run it on your server, and after it finishes, your Kdump service will be ready to capture crash dumps.

source: https://access.redhat.com/solutions/284623

Under what situations will system generate vmcore file?

Solution Verified - Updated February 14 2014 at 10:57 PM -

English

Environment

    • Red Hat Enterprise Linux [ALL]

Issue

    • When system automatically reboots no vmcore file is generated, why?

Resolution

For generating the vmcore the system has to reboot into dump-capture kernel. The system will reboot into the dump-capture kernel only if a system crash is triggered. Trigger points are located in functions like panic(), die(), die_nmi() and in the sysrq handler (ALT-SysRq-c). Trigger points are not located in reboot function, this is the reason why vmcore is not generated for a reboot.

The following conditions will execute a crash trigger point:

    • If a hard lockup is detected and NMI watchdog is configured, the system will boot into the dump-capture kernel ( die_nmi() ).

    • If die() is called, and it happens to be a thread with pid 0 or 1, or die() is called inside interrupt context or die() is called and panic_on_oops is set, the system will boot into the dump-capture kernel.

    • For testing purposes, you can trigger a crash by using ALT-SysRq-c or echo c > /proc/sysrq-trigger or write a module to force the panic. Refer following link for detail information on How can I use the SysRq facility to collect information from a server which has hung?

The below snippet which gives a clear idea on how panic signal is captured and further control is transferred to kexec/kdump to generate vmcore.

Raw

NORET_TYPE void panic(const char * fmt, ...) { static DEFINE_SPINLOCK(panic_lock); static char buf[1024]; va_list args; long i; /* * It's possible to come here directly from a panic-assertion and * not have preempt disabled. Some functions called from here want * preempt to be disabled. No point enabling it later though... * * Only one CPU is allowed to execute the panic code from here. For * multiple parallel invocations of panic, all other CPUs either * stop themself or will wait until they are stopped by the 1st CPU * with smp_send_stop(). */ if (!spin_trylock(&panic_lock)) panic_smp_self_stop(); console_verbose(); bust_spinlocks(1); va_start(args, fmt); vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf); #ifdef CONFIG_DEBUG_BUGVERBOSE dump_stack(); #endif /* * If we have crashed and we have a crash kernel loaded let it handle * everything else. * Do we want to call this before we try to display a message? */ crash_kexec(NULL); <<<<-------- Here the control is transferred to kexec/kdump.

source: https://access.redhat.com/articles/1406253

Vmcore analysis techniques

Updated April 10 2015 at 2:11 PM -

English

Initial crash information

Any vmcore analysis will start with the basic header information produced by crash when it starts up.

Raw

KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/vmlinux DUMPFILE: /cores/retrace/tasks/907889667/crash/vmcore [PARTIAL DUMP] CPUS: 8 DATE: Wed Apr 1 10:01:58 2015 UPTIME: 3 days, 23:54:48 LOAD AVERAGE: 82.30, 23.92, 8.31 TASKS: 705 NODENAME: <nodename> RELEASE: 2.6.32-279.el6.x86_64 VERSION: #1 SMP Wed Jun 13 18:24:36 EDT 2012 MACHINE: x86_64 (2400 Mhz) MEMORY: 16 GB PANIC: "Kernel panic - not syncing: softlockup: hung tasks" PID: 5829 COMMAND: "udisks-daemon" TASK: ffff88024f8b1500 [THREAD_INFO: ffff880239c9e000] CPU: 0 STATE: TASK_RUNNING (PANIC)

The useful information from above includes:

    • The location of the vmlinux and vmcore files. If either of these are not from the standard locations used by retrace-server then make sure that the correct vmcore and corresponding vmlinux file are being loaded. When vmcores for custom built kernels are submitted to retrace-server it will fail to find the correct kernel-debuginfo package and therefore will need to have the correct package manually installed and the path to the vmcore explicitly specified. If the vmlinux file for the wrong kernel version is used then crash wont necessarily prevent it from loading.

    • The '[PARTIAL DUMP]' tag means that some data is missing from the vmcore. In this case it is usually because the retrace-server system strips all zero pages from the vmcore in order to save space. This is usually a good thing since zero pages don't contain any data that cannot be reconstructed when needed. But when following a kernel address into unmapped space there's no way to know if the target address was in a stripped zero page or a non-zeroed page that was not included in the dump.

    • The CPU count. If this is very high (ie 64+) then this system may be susceptible to issues involving scalability (ie soft lockups due to severe spinlock contention). If the CPU count is 1 then it may no t be a supported configuration and there could be deadlocks with and SMP kernel as a result of serialisation constraints.

    • The date. Check that it is a recent time and matches the time period of the issue the case was raised for. Sometimes customers provide the wrong vmcore - either an old vmcore or a duplicate of a vmcore already supplied and analysed. Note that the data may be converted based on the local timezone information.

    • The uptime. Check how long the system has been running before it panicked. If it was only a matter of minutes then it may be something that happened during the post boot process and could be reproducible. If the system has been running for a long time before panicking then there's a chance that some other customer has seen the same problem and it may even be fixed. If the time is around the magic ~209 days then it could be a know clock underflow issue.

    • The load averages. The load averages represent the count of runnable processes on the system that are waiting for a CPU plus any processes in uninterruptible state. It is comprised of 3 numbers for the load over the previous 1 minute, 5 minutes and 15 minutes. If the numbers are increasing to the left then the load is increasing and may indicate the trigger for the issue is recent. If the load is steady then it may indicate the system is stuck. If the load decreases to the left then the system has become idle. If the recent load is high then it is important to distinguish whether the load is due to runnable processes that cannot get a CPU or uninterruptible process that may be blocked waiting for I/O or some other event.

    • The task count. If this is high (> 2000) there may be issues of resource starvation. There's often little reason to have a high process count. Even on systems with large CPU counts there is only a limited amount of work it can do with a finite number of CPUs. Radically increasing the active process count only causes congestion in the runqueues.

    • The memory size. This is physical RAM and does not include swap. Systems with large memory footprints can expose scalability issues that may no t be obvious with less memory.

    • The panic string. This is the first indication that will let us know what the problem is. Typical messages often seen are:

Raw

"Oops: 0000 [#1] SMP " (check log for details)"

This indicates the system panicked due to dereferencing a bad address

Raw

"SysRq : Trigger a crashdump"

This indicates the core dump was user initiated with sysrq-c or by echoing c into /proc/sysrq-trigger.

Raw

"kernel BUG at <pathname/filename>:<line number>!"

This is the standard format for a failed BUG check (which is just like an ASSERT but the logic is inverted). The filename and line number will indicate which BUG check failed.

Raw

"Kernel panic - not syncing: softlockup: hung tasks"

The soft lockup detector has found a CPU that has not scheduled the watchdog task within the soft lockup threshold.

Raw

"Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"

The hard lockup detector has found a CPU that has not received any hrtimer interrupts within the hard lockup threshold.

Raw

"Kernel panic - not syncing: hung_task: blocked tasks"

The hung task watchdog has detected at least one task that has been in uninterruptible state for more than the blocked task timeout value (default is 120 seconds).

Raw

"Kernel panic - not syncing: out of memory. panic_on_oom is selected"

The system has run out of memory and swap and has been forced to start killing processes to free up memory (not default behaviour).

Raw

"Kernel panic - not syncing: Out of memory and no killable processes..."

The system has run out of memory and swap and has been killing processes to free up memory but has run out of processes to kill off.

Raw

"Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details."

The HP watchdog is installed and has intercepted an NMI (non-maskable interrupt).

Raw

"Kernel panic - not syncing: NMI IOCK error: Not continuing"

The system received an IO check NMI from the hardware (not a memory parity error) and kernel.panic_on_io_nmi was set (not the default).

Raw

"Kernel panic - not syncing: NMI: Not continuing"

The system received an NMI (either hardware or memory parity error) and kernel.panic_on_unrecovered_nmi was set (not the default).

Raw

"Kernel panic - not syncing: nmi watchdog"

The system received an NMI and either kernel.panic_on_timeout or kernel.panic_on_oops was set (not the default values).

Raw

"Kernel panic - not syncing: Fatal Machine check"

A machine check exception event has been raised for a fatal condition. Check the mce log.

Raw

"Kernel panic - not syncing: Attempted to kill init!"

The init process is the first process to be started and should never exit.

Raw

"Kernel panic - not syncing: GAB: Port h halting system due to client process failure"

This indicates the HP cluster heartbeat watchdog mechanism has failed to get the necessary response in time.

    • The PID of the process currently executing on the CPU that initiated the panic. If this is 0 then the CPU was idle at the time and has probably panicked within an interrupt handler.

    • The command name of the process currently executing on the CPU that initiated the panic. If this is "swapper" then the CPU was idle at the time and has probably panicked within an interrupt handler.

    • The task address. This is the address of the task_struct for the process that was executing on the CPU that initiated the panic. The thread_info address is the location of the thread_info structure followed by the stack space for the process.

    • The CPU that initiated the panic.

    • The process state of the process that was executing on the CPU that initiated the panic. This should always be TASK_RUNNING since no task should be executing in any other state (but it does happen).

Function argument passing method

On the x86_64 architecture function arguments are passed via 6 specific registers. The registers are setup by the caller before calling the function and the callee may use the registers directly. If the caller needs to use more registers then it must save the current values of the registers to the stack before overwriting the registers and it must also restore the original values from the stack before returning to the caller. Registers that must be backed up before using are: %rbp, %r15, %r14, %r13, %r12 and %rbx. Registers %r10, %11 and %rax can be used immediately by the callee without needing to back up their values to the stack.

Up to 6 function arguments are passed by these registers:

Raw

1st argument - %rdi 2nd argument - %rsi 3rd argument - %rdx 4th argument - %rcx 5th argument - %r8 6th argument - %r9

If a function has more than 6 arguments then additional arguments beyond the first 6 will be passed via the stack at %rsp, %rsp+8, %rsp+16, etc. The caller will have allocated the necessary stack space for these function arguments before setting them up.

Example setting up a function call with 10 arguments:

Raw

0xffffffffa04dd094 <xfs_da_do_buf+1060>: mov -0x78(%rbp),%rdx // 3rd argument 0xffffffffa04dd098 <xfs_da_do_buf+1064>: mov $0x18,%eax 0xffffffffa04dd09d <xfs_da_do_buf+1069>: mov -0xb8(%rbp),%rcx // 4th argument 0xffffffffa04dd0a4 <xfs_da_do_buf+1076>: mov -0x88(%rbp),%rdi // 1st argument 0xffffffffa04dd0ab <xfs_da_do_buf+1083>: mov %r15,%rsi // 2nd argument 0xffffffffa04dd0ae <xfs_da_do_buf+1086>: mov %r12d,-0x34(%rbp) 0xffffffffa04dd0b2 <xfs_da_do_buf+1090>: movq $0x0,0x18(%rsp) // 10th argument 0xffffffffa04dd0bb <xfs_da_do_buf+1099>: cmove %eax,%r8d // 5th argument 0xffffffffa04dd0bf <xfs_da_do_buf+1103>: mov %r10d,%eax 0xffffffffa04dd0c2 <xfs_da_do_buf+1106>: mov %rdx,0x8(%rsp) // 8th argument 0xffffffffa04dd0c7 <xfs_da_do_buf+1111>: mov %rax,-0xa8(%rbp) 0xffffffffa04dd0ce <xfs_da_do_buf+1118>: mov -0xa8(%rbp),%rdx // 3rd argument 0xffffffffa04dd0d5 <xfs_da_do_buf+1125>: lea -0x34(%rbp),%rax 0xffffffffa04dd0d9 <xfs_da_do_buf+1129>: xor %r9d,%r9d // 6th argument 0xffffffffa04dd0dc <xfs_da_do_buf+1132>: movl $0x0,(%rsp) // 7th argument 0xffffffffa04dd0e3 <xfs_da_do_buf+1139>: mov %rax,0x10(%rsp) // 9th argument 0xffffffffa04dd0e8 <xfs_da_do_buf+1144>: callq 0xffffffffa04d2dc0 <xfs_bmapi>

The callee saves off all registers:

Raw

crash> dis xfs_bmapi 0xffffffffa04d2dc0 <xfs_bmapi>: push %rbp 0xffffffffa04d2dc1 <xfs_bmapi+1>: mov %rsp,%rbp 0xffffffffa04d2dc4 <xfs_bmapi+4>: push %r15 0xffffffffa04d2dc6 <xfs_bmapi+6>: push %r14 0xffffffffa04d2dc8 <xfs_bmapi+8>: push %r13 0xffffffffa04d2dca <xfs_bmapi+10>: push %r12 0xffffffffa04d2dcc <xfs_bmapi+12>: push %rbx 0xffffffffa04d2dcd <xfs_bmapi+13>: sub $0x188,%rsp 0xffffffffa04d2dd4 <xfs_bmapi+20>: nopl 0x0(%rax,%rax,1) 0xffffffffa04d2dd9 <xfs_bmapi+25>: mov 0x10(%rbp),%r12d // 7th argument 0xffffffffa04d2ddd <xfs_bmapi+29>: mov 0x18(%rbp),%rbx // 8th argument ...

The function call will have saved the return address to the stack and then it pushes %rbp to the stack before saving the stack pointer %rsp into %rbp. The stack pointer is then adjusted for this function. So there are now two extra values on the stack after the function arguments 7-10. This means that to access them the code uses %rbp but adds an extra 16 bytes to find the correct location in the stack.

Working with register states

When analysing a process in a vmcore we are often interested in the values used to pass to functions in the stack trace. When a system panics an NMI is sent to all CPUs to tell them to save their process state before entering a busy loop. When all CPUs have entered the busy loop the kernel can dump the core to disk knowing that all currently executing processes will have their process states accessible when using crash. A core that is generated externally by saving the state of a virtual machine will not have the register state saved for each of the currently executing processes and this can make it difficult to analyse.

Example showing the register state at the time an exception occurred:

Raw

crash> bt PID: 641 TASK: ffff8817d15d3500 CPU: 9 COMMAND: "qla2xxx_3_dpc" #0 [ffff8817d15d5870] machine_kexec at ffffffff8103111b #1 [ffff8817d15d58d0] crash_kexec at ffffffff810b61c2 #2 [ffff8817d15d59a0] oops_end at ffffffff814de9f0 #3 [ffff8817d15d59d0] no_context at ffffffff81040cdb #4 [ffff8817d15d5a20] __bad_area_nosemaphore at ffffffff81040f65 #5 [ffff8817d15d5a70] bad_area_nosemaphore at ffffffff81041033 #6 [ffff8817d15d5a80] __do_page_fault at ffffffff810416ed #7 [ffff8817d15d5ba0] do_page_fault at ffffffff814e09fe #8 [ffff8817d15d5bd0] page_fault at ffffffff814ddd85 [exception RIP: scsi_is_host_device+11] RIP: ffffffff8134fa1b RSP: ffff8817d15d5c80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff880bcf094000 RCX: 0000000000005ee0 RDX: ffff880bd5b37850 RSI: 0000000000000297 RDI: 0000000000000000 RBP: ffff8817d15d5c80 R8: 0000000000000006 R9: ffff880bd5b39210 R10: ffff8817d15d5d18 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8817d15d5d60 R14: ffff880bd5b39000 R15: ffff8817d15d5e10 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff8817d15d5c88] fc_remote_port_delete at ffffffffa002d701 [scsi_transport_fc] #10 [ffff8817d15d5cb8] qla2x00_rport_del at ffffffffa0044e1d [qla2xxx] #11 [ffff8817d15d5cd8] qla2x00_update_fcport at ffffffffa0046f6a [qla2xxx] #12 [ffff8817d15d5db8] qla2x00_async_login_done at ffffffffa004979b [qla2xxx] #13 [ffff8817d15d5de8] qla2x00_do_work at ffffffffa003f990 [qla2xxx] #14 [ffff8817d15d5e88] qla2x00_do_dpc at ffffffffa0040378 [qla2xxx] #15 [ffff8817d15d5ee8] kthread at ffffffff8108dc46 #16 [ffff8817d15d5f48] kernel_thread at ffffffff8100c1ca

We can see the sequence of instructions that led to this fault by disassembling the code up the instruction pointer. The instruction pointer (RIP) is listed at the start of the register state dump.

Raw

crash> dis -r scsi_is_host_device+11 0xffffffff8134fa10 <scsi_is_host_device>: push %rbp 0xffffffff8134fa11 <scsi_is_host_device+1>: mov %rsp,%rbp 0xffffffff8134fa14 <scsi_is_host_device+4>: nopl 0x0(%rax,%rax,1) 0xffffffff8134fa19 <scsi_is_host_device+9>: xor %eax,%eax 0xffffffff8134fa1b <scsi_is_host_device+11>: cmpq $0xffffffff81b00e00,0x58(%rdi)

The current values of all the registers used up to this exception are available above. The faulting instruction was dereferencing register %rdi after adding an offset of 0x58 to it. Above we can see that register %rdi has a NULL value so the code tried to dereference 0x58 as an address and that triggered the exception. Register %rdi is used for the first argument to the function and it isn't overwritten

Obtaining the values of function arguments

Often we need to know the values of the arguments passed to functions. There are various methods that can be used depending on which registers are used by the caller and whether the registers get saved by the callee to another location. The easiest method is to see what registers are used to initialise the argument registers before the call and then see if the original registers are saved to the stack with a push operation.

Using the following example, let's say we need to know the file pointer object for the file being written to:

Raw

PID: 19384 TASK: ffff880601cd2080 CPU: 0 COMMAND: "python" #0 [ffff8806139c1a40] panic at ffffffff8152939c #1 [ffff8806139c1ac0] oops_end at ffffffff8152e0b4 #2 [ffff8806139c1af0] no_context at ffffffff8104c80b #3 [ffff8806139c1b40] __bad_area_nosemaphore at ffffffff8104ca95 #4 [ffff8806139c1b90] bad_area at ffffffff8104cbbe #5 [ffff8806139c1bc0] __do_page_fault at ffffffff8104d36f #6 [ffff8806139c1ce0] do_page_fault at ffffffff8152ffde #7 [ffff8806139c1d10] page_fault at ffffffff8152d395 [exception RIP: up_write+17] RIP: ffffffff810a4151 RSP: ffff8806139c1dc8 RFLAGS: 00010202 RAX: 00000000000003d0 RBX: ffff880610c91a00 RCX: 0000000000000000 RDX: 00000000ffffffff RSI: 0000000000000200 RDI: 00000000000003d0 RBP: ffff8806139c1dc8 R8: ffff880614bc9648 R9: 00000000ffffffff R10: 0000000000000002 R11: 0000000000000000 R12: ffff880601cccaa0 R13: 00000000000041c6 R14: 0000000000000001 R15: ffff8804a6d02300 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff8806139c1dd0] attach_task_by_pid at ffffffff810d2742 #9 [ffff8806139c1e20] cgroup_procs_write at ffffffff810d2838 #10 [ffff8806139c1e40] cgroup_file_write at ffffffff810ce66a #11 [ffff8806139c1ef0] vfs_write at ffffffff8118e068 #12 [ffff8806139c1f30] sys_write at ffffffff8118ea31 #13 [ffff8806139c1f80] system_call_fastpath at ffffffff8100b072 RIP: 00007f8331dbf53d RSP: 00007f832da67e00 RFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffffff8100b072 RCX: 00007f8332ee3005 RDX: 0000000000000005 RSI: 00007f8332ee3000 RDI: 0000000000000006 RBP: 00007f8332ee3000 R8: 00007f832da6b700 R9: 00007f832801fcb9 R10: 00007f8332cc5580 R11: 0000000000000293 R12: 0000000000000005 R13: 00007f832805d830 R14: 0000000000000005 R15: 00007f832801fcb4 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

We know that the file pointer is passed to the function vfs_write():

Raw

crash> whatis vfs_write ssize_t vfs_write(struct file *, const char *, size_t, loff_t *);

It's the first argument so it will be in register %rdi when sys_write() calls vfs_write(). Using the return address for sys_write() we can disassemble the function up to where it calls vfs_write():

Raw

crash> dis -r ffffffff8118ea31 ... 0xffffffff8118ea1b <sys_write+59>: lea -0x30(%rbp),%rcx // 4th argument 0xffffffff8118ea1f <sys_write+63>: mov %r13,%rdx // 3rd argument 0xffffffff8118ea22 <sys_write+66>: mov %r12,%rsi // 2nd argument 0xffffffff8118ea25 <sys_write+69>: mov %rbx,%rdi // 1st argument 0xffffffff8118ea28 <sys_write+72>: mov %rax,-0x30(%rbp) 0xffffffff8118ea2c <sys_write+76>: callq 0xffffffff8118dfb0 <vfs_write>

We can see from this that the first argument which is our file pointer has been copied from the original register %rbx into the argument register %rdi. Now we should see what the called function vfs_write() does with register %rbx.

Raw

crash> dis -r ffffffff8118e068 0xffffffff8118dfb0 <vfs_write>: push %rbp // push base pointer to stack 0xffffffff8118dfb1 <vfs_write+1>: mov %rsp,%rbp // save current stack pointer to %rbp 0xffffffff8118dfb4 <vfs_write+4>: sub $0x30,%rsp // adjust stack pointer for this function 0xffffffff8118dfb8 <vfs_write+8>: mov %rbx,-0x18(%rbp) // save %rbp to stack, contains a copy of %rdi 0xffffffff8118dfbc <vfs_write+12>: mov %r12,-0x10(%rbp) // save %r12 to stack, contains a copy of %rsi 0xffffffff8118dfc0 <vfs_write+16>: mov %r13,-0x8(%rbp) // save %r13 to stack, contains a copy of %rdx ...

Register %rbx is saved to the stack at the address in register %rbp - 0x18 bytes. Register %rbp has a copy of the stack pointer after pushing %rbp to the stack.

On entry to this function the stack pointer, %rsp, will be the location where the caller's return address is saved to the stack. This address is in the backtrace output in square brackets before the function name so is 0xffff8806139c1f30. The first instruction pushes %rbp to the stack and this will decrement the stack pointer by 8 bytes so %rsp becomes 0xffff8806139c1f28. The stack pointer is then saved to register %rbp. The register we are interested in is %rbx and this is saved at %rbp - 0x18 bytes which is 0xffff8806139c1f10. Dumping the value at this stack location provides:

Raw

crash> rd 0xffff8806139c1f10 ffff8806139c1f10: ffff880613a3c5c0 ........

So our file pointer is at 0xffff880613a3c5c0.

Sometimes it's not this easy. Often the original register is not one of the designated registers that will be saved by the callee or the callee doesn't need to save the register.

For example here is a case where the 3rd argument it copied from register %rax. Register %rax is a special purpose register that contains the return value of function calls. So in this case the return value of the call to simple_strtoull() is saved in %rax and copied to %rdx.

Raw

crash> dis -r ffffffff810ce66a ... 0xffffffff810ce646 <cgroup_file_write+662>: callq 0xffffffff81296a00 <simple_strtoull> 0xffffffff810ce64b <cgroup_file_write+667>: mov -0x88(%rbp),%rdx 0xffffffff810ce652 <cgroup_file_write+674>: cmpb $0x0,(%rdx) 0xffffffff810ce655 <cgroup_file_write+677>: jne 0xffffffff810ce46b <cgroup_file_write+187> 0xffffffff810ce65b <cgroup_file_write+683>: mov %rax,%rdx // 3rd argument 0xffffffff810ce65e <cgroup_file_write+686>: mov %rbx,%rsi // 2nd argument 0xffffffff810ce661 <cgroup_file_write+689>: mov %r12,%rdi // 1st argument 0xffffffff810ce664 <cgroup_file_write+692>: callq *0x88(%rbx)

Since register %rax is not saved to the stack in the call to cgroup_procs_write() we'll need to find some other way to locate this value.

Raw

crash> dis -r ffffffff810d2838 0xffffffff810d2810 <cgroup_procs_write>: push %rbp 0xffffffff810d2811 <cgroup_procs_write+1>: mov %rsp,%rbp 0xffffffff810d2814 <cgroup_procs_write+4>: push %r12 0xffffffff810d2816 <cgroup_procs_write+6>: push %rbx 0xffffffff810d2817 <cgroup_procs_write+7>: nopl 0x0(%rax,%rax,1) 0xffffffff810d281c <cgroup_procs_write+12>: mov %rdi,%r12 0xffffffff810d281f <cgroup_procs_write+15>: mov %rdx,%rbx // %rdx saved to %rbx 0xffffffff810d2822 <cgroup_procs_write+18>: nopw 0x0(%rax,%rax,1) 0xffffffff810d2828 <cgroup_procs_write+24>: mov $0x1,%edx // %rdx overwritten, 3rd argument 0xffffffff810d282d <cgroup_procs_write+29>: mov %rbx,%rsi // %rbx saved to %rsi, 2nd argument 0xffffffff810d2830 <cgroup_procs_write+32>: mov %r12,%rdi // 1st argument 0xffffffff810d2833 <cgroup_procs_write+35>: callq 0xffffffff810d26a0 <attach_task_by_pid>

We can see above that register %rdx is copied to register %rbx then register %edx (the 32-bit version of %rdx) is overwritten so we cannot track that any further. Register %rbx is further copied to register %rsi before the call to attach_task_by_pid(). Register %rbx is one of the registers that may be saved by the callee so let's check that:

Raw

crash> dis attach_task_by_pid 0xffffffff810d26a0 <attach_task_by_pid>: push %rbp 0xffffffff810d26a1 <attach_task_by_pid+1>: mov %rsp,%rbp 0xffffffff810d26a4 <attach_task_by_pid+4>: sub $0x40,%rsp 0xffffffff810d26a8 <attach_task_by_pid+8>: mov %rbx,-0x28(%rbp) 0xffffffff810d26ac <attach_task_by_pid+12>: mov %r12,-0x20(%rbp) 0xffffffff810d26b0 <attach_task_by_pid+16>: mov %r13,-0x18(%rbp) 0xffffffff810d26b4 <attach_task_by_pid+20>: mov %r14,-0x10(%rbp) 0xffffffff810d26b8 <attach_task_by_pid+24>: mov %r15,-0x8(%rbp) ...

And register %rbx is saved to the stack here at %rbp - 0x28. The location of the return address of the caller is at 0xffff8806139c1e20, the code pushes %rbp to the stack so that decrements the stack pointer by 8 bytes to become 0xffff8806139c1e18. Then register %rbx is saved at %rbp - 0x28 which is 0xffff8806139c1df0.

Raw

crash> rd 0xffff8806139c1df0 ffff8806139c1df0: 00000000000041c6 .A......

The value originally returned by the simple_strtoull() function is then 0x41c6.

So sometimes we need to follow a trail to get the value we need.

OTHER RESOURCES:

https://www.veritas.com/support/en_US/article.TECH69923

https://www.veritas.com/support/en_US/article.TECH69923

Problem

What is required to review a vmcore on Linux Redhat and SLES.

Solution

Step 1. Locate the vmcore located in /var/crash.

ls -al /var/crash/127.0.0.1-2009-03-11-09\:27/

total 886280

drwx------ 2 root root 4096 Mar 11 09:32 .

drwxr-xr-x 5 netdump netdump 4096 Mar 11 09:31 ..

-rw------- 1 root root 2147127296 Mar 11 09:32 vmcore

Note: In RHEL3 the /boot/System.map is required. In RHEL4 and above the symbols are included in the debug kernel.

In Linux, the System.map file is a symbol table used by the kernel. A symbol table is a look-up between symbol names and their addresses in memory. A symbol name may be the name of a variable or the name of a function. The System.map is required when the address of a symbol name, or the symbol name of an address, is needed. It is especially useful for debugging kernel panics and kernel oops. The kernel does the address-to-name translation itself when CONFIG_KALLSYMS is enabled so that tools like ksymoops are not required.

Step 2. Find out which kernel the customer is running. From the VRTSexplorer you can `cat uname_a |awk '{print $3}'`

$ cat uname_a |awk '{print $3}'

2.6.9-67.ELsmp

Note: You can also look at /etc/redhat-release but note that if there is a patched kernel this file may not get updated.

$ cat etc/redhat-release

Red Hat Enterprise Linux AS release 4 (Nahant Update 6)

OTHER REDHAT KNOWLEDGE BASE:

https://access.redhat.com/solutions/23069

https://access.redhat.com/solutions/6038

EOF