Linux System 'Emergency Room': A Comprehensive Review of the 'Boot Storm' Triggered by NVIDIA Driver Installation

Introduction

This is a deep troubleshooting guide for Linux users, especially those using distributions like Pop!_OS and Ubuntu. It is based on a real experience of a system ’emergency’ that lasted several days, triggered by an interruption during the installation of NVIDIA drivers. We will start from the initial ‘unable to boot’ situation and peel back the layers to explore every detail of UEFI boot, LUKS full disk encryption, LVM logical volume management, initramfs boot mechanism, and the systemd-boot bootloader.

The goal of this article is not only to provide solutions but also to help you establish a systematic way of thinking to handle complex Linux boot issues by reviewing the error messages, diagnostics, and thought processes at each step.

Act One: The Beginning of the Storm – System Crash and Initial Diagnosis

The story begins with a routine CUDA installation. During the process of adding the apt repository and installing CUDA following the NVIDIA official tutorial, the system unexpectedly interrupted. Upon reboot, the familiar graphical interface disappeared, and we were thrown into the cold ’emergency mode’ (emergency mode).

Symptom 1: Endless Emergency Mode Loop

The system prompts You are in emergency mode and suggests running log commands. However, any repair attempts, such as apt upgrade, would lead the system back into this mode after failing.

Symptom 2: Clear Boot Error

The core error in the logs pointed to the boot partition:

kernelstub: ERROR: Could not find a block device for the partition
NoBlockDevError: Couldn't find the block device for /boot/efi

Interpretation: kernelstub (the boot management tool for Pop!_OS) could not find the EFI System Partition (ESP). This is the first ‘break’ in the boot process.

How to Identify My Partitions?

Before attempting any repairs, the first step is to ‘know yourself and your enemy’ by understanding your hard drive’s partition structure. In emergency mode or the terminal of a Live USB, you can use lsblk -f or sudo parted -ls commands.

EFI Partition (/boot/efi): Look for a partition size between 500MB and 1GB, with a filesystem type of vfat (FAT32). In the output of parted, it is usually marked with boot, esp. In our case, it is /dev/nvme0n1p1.
Encrypted Root Partition: This is usually the largest partition on the hard drive. In the output of lsblk -f, its filesystem type will show as crypto_LUKS. In our case, it is /dev/nvme0n1p3.
Recovery Partition: A partition specific to Pop!_OS, usually around 4GB in size, with a filesystem also as vfat, and the label in parted output is recovery. In our case, it is /dev/nvme0n1p2.

Act Two: Emergency Scene – The ‘Paralysis’ of `initramfs`

After identifying the partitions, we attempted to manually mount the EFI partition in emergency mode but encountered deeper failures.

FAT-fs (nvme0n1p1): IO charset iso8859-1 not found

This error indicates that the emergency mode micro-system itself is damaged, lacking the essential kernel modules required to read and write the EFI partition. This means that repairs cannot be completed within emergency mode.

Sometimes, the system directly enters a more limited (initramfs)<code> command line and throws a fatal error:

ALERT! UUID=... does not exist. Dropping to a shell!

This also confirms that the initramfs image is damaged, as the boot script inside cannot find the correct root partition address, leading to a complete interruption of the boot process.

Core Cause: All these symptoms point to the same culprit – an incomplete NVIDIA driver/CUDA installation that generated a corrupted initramfs boot image.

Act Three: Detective Work – Debugging Complex Encrypted Partitions

Before entering the final repair process, a key step is to successfully mount the main system partition in the Live USB environment. This process itself is a brilliant ‘detective work’ where we deciphered the error messages and peeled back the ‘encrypted-LVM’ composite structure of the hard drive.

Linux System 'Emergency Room': A Comprehensive Review of the 'Boot Storm' Triggered by NVIDIA Driver Installation

First Attempt: Direct Mounting We first tried the most straightforward mount command:

sudo mount /dev/nvme0n1p3 /mnt

Immediately, we encountered the first clue:

mount: /mnt: unknown filesystem type 'crypto_LUKS'.

Clue Interpretation: The system clearly tells us that /dev/nvme0n1p3 is not a directly mountable filesystem, but a crypto_LUKS encrypted volume. Like a locked safe, we cannot open it directly; we must first unlock it with a key.

Linux System 'Emergency Room': A Comprehensive Review of the 'Boot Storm' Triggered by NVIDIA Driver Installation

Second Attempt: Unlocking the Encryption Layer Based on the clue, we used the correct ‘key’ – the cryptsetup tool to unlock:

sudo cryptsetup luksOpen /dev/nvme0n1p3 unlocked_root

After entering the password, we confidently tried to mount the newly appeared virtual device /dev/mapper/unlocked_root, but received the second clue:

mount: /mnt: unknown filesystem type 'LVM2_member'.

Clue Interpretation: This error reveals a deeper structure. The unlocked device is still not the final filesystem, but an LVM2_member (LVM physical volume). This indicates that what is inside the ‘safe’ is not directly usable files, but another ‘filing cabinet system’ (LVM).

Final Solution: Activate LVM and Mount With this clue, we know we must first get the system to recognize and activate this ‘filing cabinet’ to access the final files.

# Activate LVM logical volumes
sudo vgchange -ay
# Mount the root partition logical volume in LVM
sudo mount /dev/mapper/data-root /mnt

This time, the mount finally succeeded. By following the clues like a detective, we successfully completed the entire process of ‘unlocking the safe -> activating the filing cabinet -> retrieving the files.’

Act Four: Ultimate Rescue – Live USB ‘Aseptic Surgery’

Since the system itself cannot self-rescue, we must rely on an external, healthy ‘operating room’ – the Live USB environment (for Pop!_OS users, when the system can still enter recovery mode, it can also play the same role).

3.1 Preparing the Surgical Environment

Create and Boot Live USB: Download the system ISO, use tools like BalenaEtcher to create a bootable drive, and boot the computer from the USB drive.
Connect to the Network: After entering the Live USB desktop, connect to Wi-Fi or a wired network, which is necessary for downloading software packages later.

3.2 Entering the ‘Aseptic Operating Zone’ (Chroot Environment)

Open the terminal, and we will ‘enter’ the ‘sick’ system on the hard drive through a series of commands.

Unlock the LUKS Encrypted Volume:

sudo cryptsetup luksOpen /dev/nvme0n1p3 cryptdata

Activate the LVM Logical Volumes:
```
sudo vgchange -ay
```

Mount the System Partition:

sudo mount /dev/mapper/data-root /mnt
sudo mount /dev/nvme0n1p1 /mnt/boot/efi

Bind System Directories and Enter Chroot:
```
for i in dev dev/pts proc sys run; do sudo mount -B /$i /mnt/$i; done
sudo chroot /mnt
```
Upon success, the terminal prompt will change to root@...:/#, and now all operations will directly affect your main system.

3.3 ‘Debridement’ and ‘Transplant’: Fixing Core Issues

In the chroot environment, we will perform a thorough ‘surgery.’

Completely Remove the Source of Infection (Remove All NVIDIA and CUDA Packages): This is the only reliable way to resolve version conflicts and residual configurations.
```
apt-get purge --auto-remove -y '*nvidia*' '*cuda*' 'libcuda1*' 'libxnvctrl*'
```

Transplant ‘Healthy Organs’ (Install CUDA and Drivers from NVIDIA Official Source): Choose to add NVIDIA’s official apt repository to install the latest CUDA and drivers, ensuring a single source and version compatibility. The command ubuntu-drivers install nvidia-drivers-580 is actually useless; I seem to still be using the open-source driver. If it doesn’t work, fix the graphical interface first and then tinker.

# (These commands are executed in chroot)
# First, install necessary tools
apt install software-properties-common

# Download and add NVIDIA's keyring
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb

# Update apt cache
apt-get update

# Install the required version of CUDA, which will automatically include the compatible proprietary driver
apt-get -y install cuda-toolkit-13-0

Generate a Brand New ‘Immune System’ (Rebuild initramfs): This is the most critical step, which will package the freshly installed drivers and all correct configurations into a new boot environment.
```
update-initramfs -u -k all
```

3.4 ‘Wake the Patient’: Wrap Up and Reboot

Exit the chroot Environment:exit
Reinstall the Bootloader:sudo bootctl --path=/mnt/boot/efi install
Reboot the Computer:sudo reboot (at this point, remove the USB drive)

Act Five: Post-Operative Recovery – The ‘Last Mile’ of the Boot Menu

Sometimes, after completing the above repairs, the system reboot may default to entering Recovery Mode. This indicates that the boot entry for the main system has been repaired, but the default boot order still points to the recovery partition.

When to No Longer Rely on Live USB? When you can successfully enter the recovery@recovery:~# command line, you no longer need the Live USB. You can directly execute the final boot fixes in the Recovery environment.

Fixing the Boot Menu

Mount the EFI Partition:sudo mount /dev/nvme0n1p1 /boot/efi
Manually Edit loader.conf:sudo nano /boot/efi/loader/loader.conf
Force Specify the Default Entry: Ensure the file content is as follows, where the default line’s filename must exactly match the main system configuration filename (including the .conf suffix) in the /boot/efi/loader/entries/ directory.
```
default Pop_OS-current.conf
timeout 3
```
Reboot to complete.

Note that Pop! OS does not use grub, which is different from Ubuntu.

Act Five: Troubleshooting (Q&A)

Q: Error when mounting unknown filesystem type 'crypto_LUKS' or LVM2_member? A: This is because you did not execute cryptsetup luksOpen and vgchange -ay in order to unlock and activate the encrypted volume and LVM.
Q: Error in chroot when running update-initramfs Failed to retrieve NVRAM data? A: This is normal; the chroot environment cannot access the motherboard firmware. You can temporarily move the /etc/initramfs/post-update.d/zz-kernelstub script, run the command, and then move it back.
Q: Error in chroot when running nvidia-smi Driver/library version mismatch? A: This is normal. The chroot shares the kernel of the Live USB, and it is inevitable that the driver version does not match your main system. To determine if the driver is successfully installed, check for errors with the apt and update-initramfs commands.

/etc/crypttab and /etc/fstab should not have issues due to driver installation, so no modifications are made; see the official website for format details.

Conclusion

The Linux boot process is intricate and complex, and a single accident can trigger a chain reaction of ‘system avalanches.’ Fortunately, through Live USB and Chroot, we can always gain the highest control over the system. I hope this detailed review can provide you with a reliable ‘first aid manual’ for your future explorations in Linux.

References

System76 Official Bootloader Repair Guide: https://support.system76.com/articles/bootloader/

Finally, here is a collection of commands to check various versions when installing drivers:

#!/bin/bash
# Comprehensive Diagnostic Command Set for NVIDIA &amp; CUDA Environment (Concise Version)

echo "=============== HARDWARE ==============="
# Check GPU hardware, driver, and kernel module usage
lspci -k | grep -A 3 -i "VGA|3D|Display"

echo "
=============== KERNEL &amp; OS ==============="
# View current running kernel, installed kernels, and system version
uname -r
ls /boot/vmlinuz-*
lsb_release -a

echo "
=============== DRIVER MODULES ==============="
# Check NVIDIA kernel module loading status
lsmod | grep nvidia
# Check DKMS compilation status (very critical)
dkms status
# View loaded driver version (if the module is loaded)
cat /proc/driver/nvidia/version

echo "
=============== PACKAGES (APT) ==============="
# View all installed NVIDIA and CUDA related packages
dpkg -l | grep -i nvidia
echo "---"
dpkg -l | grep -i cuda
# View the source policy of key packages
echo "---"
apt-cache policy nvidia-dkms-$(dpkg -l | grep -o 'nvidia-dkms-[0-9]\+' | head -n 1 | cut -d- -f3)
apt-cache policy cuda-toolkit

echo "
=============== NVIDIA &amp; CUDA STATUS ==============="
# Check NVIDIA driver communication status
nvidia-smi
# Check CUDA compiler version
nvcc --version
# Check OpenGL renderer
glxinfo | grep "OpenGL renderer"

echo "
=============== SYSTEM LOGS (LAST 20) ==============="
# Filter the latest NVIDIA related errors from kernel logs and system logs
dmesg | grep -i -E "nvidia|nvrm" | tail -n 20
echo "---"
journalctl -b | grep -i -E "nvidia|nvrm" | tail -n 20

echo -e "
Diagnosis complete."

Linux System ‘Emergency Room’: A Comprehensive Review of the ‘Boot Storm’ Triggered by NVIDIA Driver Installation

Introduction

Act One: The Beginning of the Storm – System Crash and Initial Diagnosis

Symptom 1: Endless Emergency Mode Loop

Symptom 2: Clear Boot Error

How to Identify My Partitions?

Act Two: Emergency Scene – The ‘Paralysis’ of `<span>initramfs</span>`

Act Three: Detective Work – Debugging Complex Encrypted Partitions

Act Four: Ultimate Rescue – Live USB ‘Aseptic Surgery’

3.1 Preparing the Surgical Environment

3.2 Entering the ‘Aseptic Operating Zone’ (Chroot Environment)

3.3 ‘Debridement’ and ‘Transplant’: Fixing Core Issues

3.4 ‘Wake the Patient’: Wrap Up and Reboot

Act Five: Post-Operative Recovery – The ‘Last Mile’ of the Boot Menu

Fixing the Boot Menu

Act Five: Troubleshooting (Q&A)

Conclusion

References

Leave a Comment Cancel reply

Introduction

Act One: The Beginning of the Storm – System Crash and Initial Diagnosis

Symptom 1: Endless Emergency Mode Loop

Symptom 2: Clear Boot Error

How to Identify My Partitions?

Act Two: Emergency Scene – The ‘Paralysis’ of <span>initramfs</span>

Act Three: Detective Work – Debugging Complex Encrypted Partitions

Act Four: Ultimate Rescue – Live USB ‘Aseptic Surgery’

3.1 Preparing the Surgical Environment

3.2 Entering the ‘Aseptic Operating Zone’ (Chroot Environment)

3.3 ‘Debridement’ and ‘Transplant’: Fixing Core Issues

3.4 ‘Wake the Patient’: Wrap Up and Reboot

Act Five: Post-Operative Recovery – The ‘Last Mile’ of the Boot Menu

Fixing the Boot Menu

Act Five: Troubleshooting (Q&A)

Conclusion

References

Related posts

Leave a Comment Cancel reply

Act Two: Emergency Scene – The ‘Paralysis’ of `<span>initramfs</span>`