If your kernel decides that something just happened which should not
have, it may 'Oops', which indicates that this is a Bad Thing. With this
Oops, it prints out a so called stack trace. This stack trace can tell
you, if you manage to decode it, where the kernel was when things went
wrong.
A raw stack trace is in the form of addresses via which the kernel
arrived at the offending location. This looks something like this:
Oct 20 19:21:32 hubert kernel: Stack: c8831060 c7643280 c88330dd c7643280 c8838800 c78c6000 c77281e0 c1327e00
Oct 20 19:21:32 hubert kernel: c88330f5 c8862c84 c8831acb c8838800 c8859fe0 c8862c40 00000042 00000001
Oct 20 19:21:32 hubert kernel: c885137c c8861280 c1327e00 000493e0 c78c6000 c8862520 c8862520 c12826c0
This is not much use as these numbers refer to addresses that are specific
to your kernel.
So if you only see numbers in your trace, you first need to decode
these with 'ksymoops'. This is not very hard but does require the System.map
of your running kernel and, if you were running with modules, that you load
the same modules, and in the same order as when the Oops happened. This may
be easy if the Oops did not kill your machine but otherwise it can be a bit
of a burden. With recent 2.5 kernels, this decoding has already been done for
you:
kksymoops
As of Linux 2.5.35 or so, ksymoops functionality is built into the
kernel, which is called 'kksymoops'. Be sure to turn the relevant
kernel option (' Load all symbols for debugging/kksymoops', in the
'Kernel Hacking' menu) on when compiling. Your distribution will surely
have done that for you. This means a slighly bigger kernel image but
does not degrade performance in any way.
After decoding, you get a 'call trace' that looks like this:
Oct 20 19:21:32 hubert kernel: Call Trace:
Oct 20 19:21:32 hubert kernel: [<c8831060>] auth_domain_drop+0x50/0x60 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c88330dd>] cache_clean+0x16d/0x170 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c8838800>] auth_domain_cache+0x0/0x60 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c88330f5>] cache_flush+0x15/0x50 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c8862c84>] hash_sem+0x0/0x1c [nfsd]
Oct 20 19:21:32 hubert kernel: [<c8831acb>] svcauth_unix_purge+0x1b/0x20 [sunrpc]
Which is pretty nice already. So far, we see that
auth_domain_drop() caused the problem, and that it was called by
cache_clean() which was called by auth_domain_cache() etc. Often this
suffices to find out what went wrong, but sometimes we want more
details yet.
Now to get the line number, the long way
The following is a different example, one that does not involve modules.
This is why we suddenly trace 0xc01cf0d1, and not 'auth_domain_drop()'.
The canonical way to get from this dump to the line number is to
recompile the entire kernel with debugging, which is done by adding
'-g' to the toplevel Makefile CFLAGS, and then executing the following
on the 'vmlinux' uncompressed kernel binary produced:
$ addr2line -e ./vmlinux 0xc01cf0d1
/mnt/linux-2.5.26/include/asm/bitops.h:244
or
$ gdb ./vmlinux
...
(gdb) l *0xc01cf0d1
0xc01cf0d1 is in read_chan (include/asm/bitops.h:244).
(...)
244 return ((1UL << (nr & 31)) & (((const volatile unsigned int *) addr)[nr >> 5])) != 0;
(...)
Note that this even works with raw numbers, without parsing through
(k)symoops!
However, for several reasons this may not be what you want:
- It requires recompiling the entire kernel
- It will not work with modules, as these are not in the vmlinux
file
- Compiling a kernel with -g takes heaps of time
Because of these reasons, the output above is not from the stacktrace
mentioned earlier. That oops occurred in a module and neither addr2line
nor gdb can help us with that.
So, what can we do? Well, besides the raw address where the oops
occurred, 0xc8831060 in this case, the decoded version is kind enough to
provide us with a
relative
location of the problem. This is the second part of the top line above,
'auth_domain_drop+0x50/0x60'. By this, we are told that the offending
address is 0x50 bytes from the start of the function auth_domain_drop()
which is 0x60 bytes long in total.
The quick way
Now that we have a relative location, gdb can tell us where the exact
problem is, provided that we help it a bit. As we are not going to
recompile our entire kernel with debugging some work is required.
Furthermore, we are working on a call trace from a module, even
recompiling the entire kernel would not help as the error is in a
module.
We start out with the closest sources we can get, perferably those from
the same kernel version which caused the oops. The best way of
course is to use the exact tree from which your image was built.
If you try to recreate a copy of a kernel that is already running from
different sources, you
must
read the caveats section below. Not advised.
In our case, we know the problem is in the function auth_domain_drop(),
which we now need to locate. There are several ways of doing this, a
very blunt but effective one is:
$ find . -name "*.c" | xargs -n 500 grep auth_domain_drop
./net/sunrpc/svcauth.c:void auth_domain_drop(struct cache_head *item, struct cache_detail *cd)
./net/sunrpc/svcauth.c: .cache_put = auth_domain_drop,
./net/sunrpc/svcauth.c: auth_domain_drop(&dom->h, &auth_domain_cache);
It will not always be this clear, but in our case the file to watch is
obvious. Now we only need to compile svcauth.c with debugging, and let
gdb loose on that. The easy way to do that is to rm svcauth.o and
rerun 'make' like this:
$ EXTRA_CFLAGS=-g make ./net/sunrpc/svcauth.o
Now we are ready to run gdb:
$ gdb ./net/sunrpc/svcauth.o
(gdb) list *(auth_domain_drop+0x50)
0x1e0 is in auth_domain_drop (net/sunrpc/svcauth.c:147).
(...)
147 authtab[dom->flavour]->domain_release(dom);
And we have the offending line! Note how we asked a
list of a relative address,
auth_domain_drop+0x50, which is the top of our calltrace. I used the
technique described above to report this exact oops and was rewarded with a patch
within 6 hours, so it pays :-)
Important caveats
If you try to do this on a tree that is different from the one that is
currently running, there is a great risk of producing bad results. If
you feel that you really must do it, you need to make sure that you
have:
- The exact .config of the running kernel
- especially the CPU type should be identical
- The exact sources of the running kernel
- Identical gcc
- you can see this with 'dmesg | grep gcc'
When reporting line number derived this way, ALWAYS explain how you
managed to get a line number. Kernel hackers will hate you and get
really upset if you send them on a wild goose chase.
If things are wrong, it may look like this:
(gdb) list *(auth_domain_drop+0x50)
0x250 is in auth_domain_put (net/sunrpc/svcauth.c:160).
Which is decidedly bogus - suddenly our error has moved to another
function, auth_domain_put(). This bad result was arrived at by
recompiling the kernel with gcc 3.0.4 instead of gcc 3.2 that was used
to compile the running kernel.
If you follow the method above on the source tree that was also used to
build the kernel that produced the oops, you should be safe, but even
then, check if you are recompiling using the same compiler that was
used to build the kernel in the first place!
Furthermore, it IS theoretically possible for gcc to output different code
when the -g flag is given, in which case too your results may be unreliable.
In that case, reload the module
with debugging symbols, and see if
the Oops address changes.