Understanding Kernel Oops and Kernel Panic in Linux

Recently, I have seen many articles analyzing Oops, and on a whim, I would like to discuss Oops in conjunction with Panic.

In the Linux system, Oops (commonly referred to as Kernel Oops) is an error reporting mechanism triggered when the kernel encounters an error that it cannot handle normally (such as null pointer dereference, memory access out of bounds, etc.). However, Linux’s Oops typically does not directly lead to a system crash (unless the error occurs on a critical path, in which case it may escalate to a Kernel Panic). Oops and Panic are both kernel error handling mechanisms, but they differ in their triggering conditions, severity, and handling methods. Below, I will analyze them in detail from the perspectives of implementation principles and comparative differences.

1. Implementation Principles of Oops

1.1 Triggering Conditions

Oops (commonly referred to as Kernel Oops) occurs when the kernel detects a non-fatal error, such as:

Dereferencing a null pointer (NULL pointer dereference)
Memory access out of bounds (page fault)
Illegal instruction (undefined instruction)
Bugs in kernel modules (drivers)

1.2 Handling Process

When the CPU encounters an unhandled exception (such as a page fault or illegal instruction), it triggers a hardware exception, and the CPU jumps to the kernel’s exception handling code (such as ARM’s do_DataAbort or x86’s do_page_fault). The kernel’s exception handling process is as follows:

1.2.1 Save Context: Record register state and call stack (Call Trace).

1.2.2 Print Oops Information:

Error type (such as Unable to handle kernel NULL pointer dereference)
Faulting instruction address (PC register)
Call stack (Call Trace)
Memory mapping information (page tables)

1.2.3 Attempt Recovery:

If the error occurs on a recoverable path (such as a user-mode triggered system call), the kernel may simply kill the current process (sending SIGKILL).
If the error occurs on a critical path in the kernel (such as interrupt handling or the scheduler), it escalates to a Kernel Panic.

1.3 Code Implementation

The core function of Oops is die() (defined in kernel/panic.c), which calls oops_enter() and oops_exit() and prints error information:

void die(const char *str, struct pt_regs *regs, int err) {    oops_enter();    // Print register, call stack, etc.    __die(str, err, regs);    oops_exit();    // If the error is severe, it may trigger panic    if (panic_on_oops)        panic("Fatal exception");}

2. Implementation Principles of Panic

2.1 Triggering Conditions

Panic (Kernel Panic) is a fatal error that causes the system to stop immediately. Common triggering reasons include:

Corruption of critical kernel data structures (such as a crash in the schedule() function)

Double free (double free)
Hardware failure (such as CPU exceptions, memory corruption)
Explicitly calling panic() (such as the BUG_ON() macro)

2.2 Handling Process

2.2.1 Print Panic Information:

Error description (such as Kernel panic - not syncing: Attempted to kill init!)

Call stack (Call Trace)

List of loaded modules

2.2.2 Stop All CPUs:

Notify other CPUs to stop running via smp_send_stop().

2.2.3 May Attempt to Dump Memory (kdump):

If configured, kdump will generate a vmcore for subsequent analysis.

2.2.4 System Halt or Reboot:

The default behavior is to halt (halt), but it can be set to automatically reboot after a timeout via kernel.panic=10.

2.3 Code Implementation

The core function of Panic is panic() (defined in kernel/panic.c):

void panic(const char *fmt, ...) {    // Disable interrupts to prevent concurrency issues    local_irq_disable();    // Print panic information    va_list args;    va_start(args, fmt);    vprintk(fmt, args);    va_end(args);    // Stop other CPUs    smp_send_stop();    // May trigger kdump    kdump_panic();    // System halt or reboot    emergency_restart();}

3. Comparison of Oops and Panic

Feature	Oops	Panic
Severity	Non-fatal, may continue running	Fatal, system stops immediately
Triggering Conditions	Non-critical path errors (such as driver bugs)	Critical errors (such as scheduler crashes)
Recoverability	May only kill the current process	Non-recoverable
Log Output	`<span>dmesg</span>` or `<span>/var/log/kern.log</span>`	Screen print + logs
Handling Actions	Print call stack, may continue running	Stop all CPUs, may trigger kdump
Code Call	`<span>die()</span>` → may trigger `<span>panic()</span>`	Directly calls `<span>panic()</span>`
Common Causes	Null pointer, memory out of bounds	Double free, corruption of kernel data structures

4. Key Conclusions

4.1 Oops is a warning, Panic is a death sentence

Oops may allow the system to continue running, while Panic will definitely terminate the system.

4.2 Panic is an escalation of Oops

If panic_on_oops=1, any Oops will trigger Panic.
Oops on critical paths (such as interrupts, schedulers) will directly escalate to Panic.

4.3 Different Debugging Methods

Oops usually requires analyzing dmesg and the call stack.

Panic may require kdump and crash tools to analyze memory dumps.

5. Practical Examples

5.1 Oops Example (Driver Dereferencing Null Pointer)

Unable to handle kernel NULL pointer dereference at virtual address 00000000Call Trace:[&lt;ffffffff81234567&gt;] my_buggy_driver_write+0x17/0x30 [faulty_module][&lt;ffffffff81187654&gt;] vfs_write+0xa4/0x190

Analysis:

The driver faulty_module function my_buggy_driver_write dereferenced a null pointer.
The system may continue running, but the current process (such as the application writing to that driver) will be killed.

5.2 Panic Example (Kernel Scheduler Crash)

Kernel panic - not syncing: Fatal exception in interruptCall Trace:[&lt;ffffffff81012345&gt;] schedule+0x25/0x70[&lt;ffffffff81023456&gt;] schedule_timeout+0x16/0x20