• 周六. 7 月 13th, 2024

5G编程聚合网

5G时代下一个聚合的编程学习网

热门标签

linux系统奔溃之vmcore:kdump 的亲密战友 crash

King Wang

1 月 4, 2022
转自https://blog.csdn.net/yuanfang_way/article/details/77987399

crash 是目前广泛使用的 linux 内核崩溃转储文件的分析工具,掌握 crash 的使用技巧,对于分析定位内核崩溃的问题,有着非常重要的作用。本文首先介绍了 crash 的基本概念和安装方法,其次详细介绍了如何使用 crash 工具分析内核崩溃转储文件,包括各种常用调试命令的使用方法,最后以几个实际工作中遇到的真实案例向读者展示了 crash 的强大功能。在这篇文章中,既有详细的工具使用方法,又有丰富的实际案例分析,相信您读过以后定会受益匪浅。

什么是 crash

如前文所述,当 linux 系统内核发生崩溃的时候,可以通过 kdump 等方式收集内核崩溃之前的内存,生成一个转储文件 vmcore。内核开发者通过分析该 vmcore 文件就可以诊断出内核崩溃的原因,从而进行操作系统的代码改进。那么 crash 就是一个被广泛使用的内核崩溃转储文件分析工具,掌握 crash 的使用技巧,对于定位问题有着十分重要的作用。

使用 crash 的先决条件

由于 crash 用于调试内核崩溃的转储文件,因此使用 crash 需要依赖如下条件:

  1. kernel 映像文件 vmlinux 在编译的时候必须指定了 -g 参数,即带有调试信息。
  2. 需要有一个内存崩溃转储文件(例如 vmcore),或者可以通过 /dev/mem 或 /dev/crash 访问的实时系统内存。如果 crash 命令行没有指定转储文件,则 crash 默认使用实时系统内存,这时需要 root 权限。
  3. crash 支持的平台处理器包括:x86, x86_64, ia64, ppc64, arm, s390, s390x ( 也有部分 crash 版本支持 Alpha 和 32-bit PowerPC,但是对于这两种平台的支持不保证长期维护 )。
  4. crash 支持 2.2.5-15(含)以后的 Linux 内核版本。随着 Linux 内核的更新,crash 也在不断升级以适应新的内核。

crash 安装指南

要想使用 crash 调试内核转储文件,需要安装 crash 工具和内核调试信息包。不同的发行版安装包名称略有差异,这里仅列出 RHEL 和 SLES 发行版对应的安装包名称如下:

表 1. crash 工具和内核调试包

linux系统奔溃之vmcore:kdump 的亲密战友 crash

以 RHEL 为例,安装 crash 及内核调试信息包的步骤如下:

 rpm -ivh crash-5.1.8-1.el6.ppc64.rpm rpm -ivh kernel-debuginfo-common-ppc64-2.6.32-220.el6.ppc64.rpm rpm -ivh kernel-debuginfo-2.6.32-220.el6.ppc64.rpm

启动 crash

启动参数说明

使用 crash 调试转储文件,需要在命令行输入两个参数:debug kernel 和 dump file,其中 dump file 是内核转储文件的名称,debug kernel 是由内核调试信息包安装的,不同的发行版名称略有不同,以 RHEL 和 SLES 为例:

 RHEL6.2:/usr/lib/debug/lib/modules/2.6.32-220.el6.ppc64/vmlinux SLES11SP2:/usr/lib/debug/boot/vmlinux-3.0.13-0.27-ppc64.debug

使用 crash -h 或 man crash 可以查看 crash 支持的一系列选项,这里仅以常用的选项为例说明如下:

-h:打印帮助信息

-d:设置调试级别

-S:使用 /boot/System.map 作为默认的映射文件

-s:不显示版本、初始调试信息等,直接进入命令行

-i file:启动之后自动运行 file 中的命令,再接受用户输入

crash 报告分析

crash 命令启动后,会产生一个转储文件的分析报告摘要,如下图所示。

[root@curlylp1 ~]# crash crash 5.1.8-1.el6 Copyright (C) 2002-2011 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "powerpc64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.el6.ppc64/vmlinux DUMPFILE: /dev/mem CPUS: 2 DATE: Thu Feb 2 00:31:34 2012 UPTIME: 58 days, 22:52:43 LOAD AVERAGE: 76.11, 77.40, 77.83 TASKS: 481 NODENAME: curlylp1.upt.austin.ibm.com RELEASE: 2.6.32-220.el6.ppc64 VERSION: #1 SMP Wed Nov 9 08:02:37 EST 2011 MACHINE: ppc64 (5009 Mhz) MEMORY: 4 GB PID: 30510 COMMAND: "crash" TASK: c00000006ddbe460 [THREAD_INFO: c000000073268000] CPU: 0 STATE: TASK_RUNNING (ACTIVE) crash>

KERNEL: 系统崩溃时运行的 kernel 文件

DUMPFILE: 内核转储文件

CPUS: 所在机器的 CPU 数量

DATE: 系统崩溃的时间

TASKS: 系统崩溃时内存中的任务数

NODENAME: 崩溃的系统主机名

RELEASE: 和 VERSION: 内核版本号

MACHINE: CPU 架构

MEMORY: 崩溃主机的物理内存

PANIC: 崩溃类型,常见的崩溃类型包括:

SysRq (System Request):通过魔法组合键导致的系统崩溃,通常是测试使用。通过 echo c > /proc/sysrq-trigger,就可以触发系统崩溃。

oops:可以看成是内核级的 Segmentation Fault。应用程序如果进行了非法内存访问或执行了非法指令,会得到 Segfault 信号,一般行为是 coredump,应用程序也可以自己截获 Segfault 信号,自行处理。如果内核自己犯了这样的错误,则会弹出 oops 信息。

crash 内置命令简介

crash 命令行启动后,可以通过一些内置命令来打印系统崩溃前的信息。

bt – backtrace

bt 命令用于查看系统崩溃前的堆栈等信息,这是系统调试中非常常用和好用的一个命令。

清单 2. bt 命令结果

crash> bt PID: 2860 TASK: c0000000677e9550 CPU: 0 COMMAND: "bash" R0: 0000000000000001 R1: c0000000018978b0 R2: c00000000061c460 R3: c000000001897920 R4: 0000000000000000 R5: 0000000000000000 R6: 0000000000019e07 R7: 0000000000000000 R8: 000000000a000000 R9: c000000072938d80 R10: c0000000006b5d58 R11: c000000000740178 R12: 0000000000000000 R13: c00000000054ea80 R14: 00000000100d0000 R15: 0000000000000000 R16: 00000000100e2ab8 R17: 00000000100b0000 R18: 00000000100d0000 R19: 00000000100d0000 R20: 0000000000000000 R21: 0000000000000000 R22: 00000000100e8a28 R23: 0000000000000000 R24: 8000000000009032 R25: 0000000000000000 R26: 0000000000000000 R27: 0000000000000063 R28: 0000000000000006 R29: 0000000000000000 R30: c00000000058bfe8 R31: c0000000005a5ed0 NIP: c00000000009d9b0 MSR: 8000000000001032 OR3: c000000001897ab0 CTR: c00000000028b6ec LR: c00000000028b708 XER: 0000000000000005 CCR: 0000000000000006 MQ: 0000000000000000 DAR: c0000000005a5ed0 DSISR: c000000001897b10 Syscall Result: 0000000000000000 NIP [c00000000009d9b0] .crash_kexec LR [c00000000028b708] .sysrq_handle_crashdump #0 [c0000000018978b0] .crash_kexec at c00000000009d9e0 #1 [c000000001897a90] .sysrq_handle_crashdump at c00000000028b708 #2 [c000000001897b10] .__handle_sysrq at c00000000028b1fc #3 [c000000001897bc0] .write_sysrq_trigger at c00000000015eadc #4 [c000000001897c50] .proc_reg_write at c000000000156670 #5 [c000000001897cf0] .vfs_write at c0000000000fd490 #6 [c000000001897d90] .sys_write at c0000000000fdc00 #7 [c000000001897e30] syscall_exit at c0000000000086a4 syscall [c00] exception frame: R0: 0000000000000004 R1: 00000000ffb6f820 R2: 00000000f7fe95c0 R3: 0000000000000001 R4: 00000000f7d70000 R5: 0000000000000002 R6: 0000000000000001 R7: ffffffffffffffff R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00000000100dc8e8 R14: 00000000100d0000 R15: 0000000000000000 R16: 00000000100e2ab8 R17: 00000000100b0000 R18: 00000000100d0000 R19: 00000000100d0000 R20: 0000000000000000 R21: 0000000000000000 R22: 00000000100e8a28 R23: 0000000000000000 R24: 0000000000000001 R25: 00000000100e9718 R26: 0000000000000000 R27: 0000000000000002 R28: 000000000ff703f8 R29: 00000000f7d70000 R30: 000000000ff6fff4 R31: 0000000000000002 NIP: 000000000fec0988 MSR: 000000000000d032 OR3: 0000000000000001 CTR: 000000000fe59270 LR: 000000000fe592dc XER: 0000000000000000 CCR: 0000000040242442 MQ: 00000000010b6c30 DAR: 00000000f7d70000 DSISR: 0000000042000000 Syscall Result: 0000000000000000 crash>

如上输出中,以“# 数字”开头的行为调用堆栈,即系统崩溃前内核依次调用的一系列函数,通过这个可以迅速推断内核在何处崩溃。

log – dump system message buffer

log 命令可以打印系统消息缓冲区,从而可能找到系统崩溃的线索。log 命令的截图如下(为节省篇幅,已将部分行省略):

清单 3. log 命令结果

 crash> log Crash kernel location must be 0x2000000 Using pSeries machine description Page orders: linear mapping = 24, virtual = 16, io = 12 Found initrd at 0xc000000001500000:0xc000000001c90400 Partition configured for 2 cpus. Starting Linux PPC64 #1 SMP Tue Jan 24 20:12:50 EST 2012 ----------------------------------------------------- ppc64_pft_size = 0x19 physicalMemorySize = 0x80000000 ppc64_caches.dcache_line_size = 0x80 ppc64_caches.icache_line_size = 0x80 htab_address = 0x0000000000000000 htab_hash_mask = 0x3ffff ----------------------------------------------------- Linux version 2.6.18-307.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Tue Jan 24 20:12:50 EST 2012 [boot]0012 Setup Arch Node 0 Memory: 0x0-0x80000000

ps – display process status information

ps 命令用于显示进程的状态,(如图)带 > 标识代表是活跃的进程。ps 命令的截图如下(省略部分行):

清单 4. ps 命令结果

 crash> ps PID PPID CPU TASK ST %MEM VSZ RSS COMM 0 0 0 c00000000054e190 RU 0.0 0 0 [swapper] 0 1 1 c00000007ff15150 RU 0.0 0 0 [swapper] 1 0 1 c00000007ff15960 IN 0.1 4672 2688 init 2 1 0 c00000007ff14940 IN 0.0 0 0 [migration/0] 3 1 0 c00000007ff14130 IN 0.0 0 0 [ksoftirqd/0] 4 1 0 c00000007ff13920 IN 0.0 0 0 [watchdog/0] 5 1 1 c00000007ff13110 IN 0.0 0 0 [migration/1] 6 1 1 c00000007ff12900 IN 0.0 0 0 [ksoftirqd/1] 7 1 1 c00000007ff120f0 IN 0.0 0 0 [watchdog/1] 8 1 0 c00000007ff118e0 IN 0.0 0 0 [events/0] 9 1 1 c00000007ff1ba20 IN 0.0 0 0 [events/1] 10 1 1 c00000007ff110d0 IN 0.0 0 0 [khelper] 139 1 0 c0000000015822f0 IN 0.0 0 0 [kthread] 143 139 0 c000000001c6eb00 IN 0.0 0 0 [kblockd/0] 144 139 1 c000000001580ac0 IN 0.0 0 0 [kblockd/1] 145 139 0 c000000001c6f310 IN 0.0 0 0 [cqueue/0] 146 139 1 c0000000015802b0 IN 0.0 0 0 [cqueue/1] 150 139 0 c00000007ff1e270 IN 0.0 0 0 [khubd] 152 139 0 c00000007ff1ea80 IN 0.0 0 0 [kseriod] 169 1 1 c000000001c62170 IN 0.0 0 0 [rtasd] 209 139 0 c00000007f4ca370 IN 0.0 0 0 [khungtaskd] > 1771 1 1 c000000001c36a80 RU 0.1 4096 2240 syslogd

dis – disassembling instruction

dis 命令用于对给定地址的内容进行反汇编。dis 命令的截图如下:

清单 5. dis 命令结果

 crash> dis -l c000000000255900 /usr/src/debug/kernel-ppc64-3.0.8/linux-3.0/fs/proc/mmu.c: 47 0xc000000000255900 <.get_vmalloc_info+112>: ld r10,8(r11) 5.5 struct – view data struct struct 命令用于查看数据结构的定义原型。命令截图如下: crash> struct -o vm_struct struct vm_struct { [0] struct vm_struct *next; [8] void *addr; [16] long unsigned int size; [24] long unsigned int flags; [32] struct page **pages; [40] unsigned int nr_pages; [48] phys_addr_t phys_addr; [56] void *caller; } SIZE: 64

精彩案例

如前文所述,当 linux 系统内核发生崩溃的时候,可以通过 kdump 等方式收集内核崩溃之前的内存,生成一个转储文件 vmcore。内核开发者通过分析该 vmcore 文件就可以诊断出内核崩溃的原因,从而进行操作系统的代码改进。那么 crash 就是一个被广泛使用的内核崩溃转储文件分析工具,掌握 crash 的使用技巧,对于定位问题有着十分重要的作用。

这里采用笔者在实际测试工作中发现的 SLES 系统下的系统崩溃问题作为案例来进行讲解。该系统已经配置了 kdump 启用,因此在系统发生崩溃之后,在 /var/crash/ 当天日期 / 目录下面生成一个 vmcore 文件,下面我们来对这个文件进行分析。

  1. 首先启动 crash

清单 6. 启动 crash

 # crash vmlinux-3.0.8-0.11-ppc64 vmcore crash 5.1.9 Copyright (C) 2002-2011 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "powerpc64-unknown-linux-gnu"... KERNEL: vmlinux-3.0.8-0.11-ppc64 DUMPFILE: vmcore CPUS: 40 DATE: Wed Nov 16 20:17:11 2011 UPTIME: 10:37:23 LOAD AVERAGE: 60.00, 60.00, 60.00 TASKS: 811 NODENAME: eellp1 RELEASE: 3.0.8-0.11-ppc64 VERSION: #1 SMP Thu Nov 10 16:28:46 UTC 2011 (3cea58b) MACHINE: ppc64 (3550 Mhz) MEMORY: 4 GB PANIC: "Oops: Kernel access of bad area, sig: 11 [#1]" (check log for details) PID: 5563 COMMAND: "sh" TASK: c0000000faac3700 [THREAD_INFO: c0000000f8ce0000] CPU: 36 STATE: TASK_RUNNING (PANIC) crash>

可以看到内核版本是 3.0.8-0.11-ppc64,这是一个 sles11sp2 的开发版本。

  1. 接下来,我们用 bt 命令来看一下堆栈

清单 7. bt 命令

crash> bt PID: 5563 TASK: c0000000faac3700 CPU: 36 COMMAND: "sh" #0 [c0000000f8ce31b0] .crash_kexec at c0000000001039f8 #1 [c0000000f8ce33b0] .die at c000000000020158 #2 [c0000000f8ce3450] .bad_page_fault at c000000000045004 #3 [c0000000f8ce34d0] handle_page_fault at c000000000005ec8 Data Access error [300] exception frame: R0: 0000000000130000 R1: c0000000f8ce37c0 R2: c000000000f876d8 R3: c000000001224dc8 R4: 0000000000000001 R5: 0000000000000000 R6: cfffffffffffffff R7: 0000000002220000 R8: 2ffffffff1f10000 R9: d00000000e0f0000 R10: 0000000000000000 R11: 0000000100000000 R12: 0000000082002424 R13: c000000001f06c00 R14: 000000001003e270 R15: 0000000000000001 R16: 0000000000000001 R17: 0000000000000000 R18: 0000000000000000 R19: c0000000f820b4b8 R20: c0000000f8ce3df8 R21: c000000000fe2400 R22: 00000fffb53d0000 R23: fffffffffffff000 R24: 0000000000000400 R25: 000000000000ed99 R26: 0000000000002000 R27: 0000000000002e58 R28: c000000001224dc8 R29: c000000001224dc0 R30: c000000000ef2658 R31: c0000000f8ce39a0 NIP: c000000000255900 MSR: 8000000000009032 OR3: c000000000005278 CTR: c000000000263a08 LR: c0000000002558dc XER: 0000000000000001 CCR: 0000000022002444 MQ: 0000000000000001 DAR: 0000000100000008 DSISR: 0000000040000000 Syscall Result: 0000000000000000 ..... #4 [c0000000f8ce37c0] .get_vmalloc_info at c000000000255900 [Link Register ] [c0000000f8ce37c0] .get_vmalloc_info at c0000000002558dc (un reliable) #5 [c0000000f8ce3850] .meminfo_proc_show at c000000000263ad8 #6 [c0000000f8ce3b40] .seq_read at c00000000020aa44 #7 [c0000000f8ce3c30] .proc_reg_read at c000000000258ccc #8 [c0000000f8ce3ce0] .vfs_read at c0000000001dee60 #9 [c0000000f8ce3d80] .sys_read at c0000000001df06c #10 [c0000000f8ce3e30] syscall_exit at c0000000000097ec syscall [c01] exception frame: R0: 0000000000000003 R1: 00000ffff3cceb60 R2: 00000fffb5305c40 R3: 0000000000000008 R4: 00000fffb53d0000 R5: 0000000000000400 R6: 0000000000000001 R7: 00000fffb5249f88 R8: 800000000200f032 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00000fffb50b8110 NIP: 00000fffb523d0c4 MSR: 800000000200f032 OR3: 0000000000000008 CTR: 00000fffb51dae70 LR: 00000fffb51daeac XER: 0000000000000001 CCR: 0000000044002422 MQ: 0000000000000001 DAR: 00000fffb51dcd60 DSISR: 0000000040000000 Syscall Result: 00000fffb53d0000 Crash>
  1. 我们看到系统崩溃前的最后一个调用是“#4 [c0000000f8ce37c0] .get_vmalloc_info at c000000000255900”,现在用 dis 命令来看一下该地址的反汇编结果

清单 8. dis 命令

 crash> dis -l c000000000255900 /usr/src/debug/kernel-ppc64-3.0.8/linux-3.0/fs/proc/mmu.c: 47 0xc000000000255900 <.get_vmalloc_info+112>: ld r10,8(r11)
  1. 从上面的反汇编结果中,我们看到问题出在 mmu.c 第 47 行代码,翻开 linux 源码的相应位置

清单 9. linux 源码

 21 void get_vmalloc_info(struct vmalloc_info *vmi) 22 { 23 struct vm_struct *vma; …… 46 for (vma = vmlist; vma; vma = vma->next) { 47 unsigned long addr = (unsigned long) vma->addr

用 struct 命令查看数据结构

清单 10. struct 命令

crash> struct -o vm_struct struct vm_struct { [0] struct vm_struct *next; [8] void *addr; [16] long unsigned int size; [24] long unsigned int flags; [32] struct page **pages; [40] unsigned int nr_pages; [48] phys_addr_t phys_addr; [56] void *caller; } SIZE: 64 crash>

对照源码和反汇编代码,我们发现第 47 行的源码,实际对应的就是反汇编的代码

ld r10,8(r11) # 将寄存器 r11 的第 8 个 byte 后的内容,load 到寄存器 r10

  1. 那么 r11 中应该是 vm_struct 结构,我们再用 struct 来看看

清单 11. struct 命令

 crash> struct vm_struct 0000000100000000 struct: invalid kernel virtual address: 0000000100000000 crash>

说明 r11 的内容已经被破坏,并不是指向一个 vm_struct 结构了。

经过上面的层层分析,我们推测问题的产生过程如下:mmu.c 第 46 行, vma = vma->next 取到了一个错误的地址,导致第 47 行 addr = (unsigned long) vma->addr 产生了内核错误。当然,更深层的原因,还需要对代码逻辑进行分析,找出导致这个现象的根源。

常见问题

本节列出了使用 crash 过程中可能会碰到的问题,并给出了相应的解决对策。

  1. 缺少调试信息包

清单 12. 缺少调试信息包

 [root@bondlp1 2012-02-02-01:37]# crash crash 5.1.8-1.el5 Copyright (C) 2002-2011 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. crash: /boot/vmlinuz-2.6.18-307.el5: no debugging data available crash: vmlinuz-2.6.18-307.el5.debug: debuginfo file not found crash: either install the appropriate kernel debuginfo package, or copy vmlinuz-2.6.18-307.el5.debug to this machine

遇到这种问题时,需要安装内核调试信息包,再重新运行 crash 命令。

  1. vmlinux 和 vmcore 版本不匹配

清单 13. vmlinux 和 vmcore 版本不匹配

 [root@bondlp1 2012-02-02-01:37]# crash /usr/lib/debug/lib/modules/2.6.18-305.el5/vmlinux vmcore crash 5.1.8-1.el5 Copyright (C) 2002-2011 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "powerpc64-unknown-linux-gnu"... WARNING: kernel version inconsistency between vmlinux and dumpfile please wait... (gathering module symbol data) WARNING: cannot access vmalloc'd module memory crash: invalid kernel virtual address: 8000000000b663c8 type: "runqueues entry (per_cpu)"

这种情况说明你所使用的 vmlinux 与产生 vmcore 的内核版本不一致,需要使用相同版本的内核来调试 vmcore 文件。

  1. core 文件不完整

清单 14. core 文件不完整

[root@bondlp1 2012-02-02-01:37]# crash /usr/lib/debug/lib/modules/2.6.18-305.el5/vmlinux vmcore crash 5.1.8-1.el5 Copyright (C) 2002-2011 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. WARNING: vmcore: may be truncated or incomplete PT_LOAD p_offset: 1610681124 p_filesz: 234881024 bytes required: 1845562148 dumpfile size: 1638400000

这个提示说明你使用的 vmcore 文件不完整。导致这个问题的原因可能有多种,硬盘空间不足,网络 dump 时网络中断等等。对于这种情况,我们需要重新 dump 一个完整的 vmcore 进行分析调试。

小结

对于内核开发人员,crash 已经成了必不可少的一个工具。内核固然高深,但是通过 kdump 和 crash 这对战友的亲密配合,很多问题都会迎刃而解。本文仅为您介绍了 crash 的基本知识,更多的技巧还需要读者在实践中不断探索和总结。

发表回复