
🚨 Linux System Boot Emergency Guide | Server Won’t Boot? These Skills Will Help You Fix It in 3 Minutes!
📖 Introduction: Do You Experience This Too?
First Paragraph: Pain Point Scenario
Have you ever encountered this nightmare scenario? In the middle of the night, the server suddenly crashes, and after a reboot, the screen is completely black, only showing “GRUB>” or simply failing to boot into the system. You frantically press F2, F12, or Delete, but the system just ignores you. Worse yet, the boss is already calling you relentlessly: “Why is the system down? When will it be restored? Who will bear the losses of our business?”
If you have also experienced such a moment of “despair at booting up,” then today FYC has good news for you:System boot issues actually follow a pattern; once you master the correct methods, you can make the server “come back to life” like a magician!
Moreover, we will also discusshow to identify hardware failures. After all, if there are issues with the hardware itself, no matter how good the software configuration is, it will be useless. By learning these skills, you can upgrade from a “firefighter” to a “system doctor”!
Second Paragraph: Overview of Solutions
Today we will talk aboutdiagnosing Linux system boot issues and identifying hardware failures, which is a survival skill that every operations engineer must master. We will start with the principles of BIOS/UEFI booting, guiding you step by step on how to fix boot loaders, resolve service dependency issues, recover lost root passwords, and quickly identify and locate hardware failures.
This article will bring you:
- 🎯 Three Steps to Diagnose Boot Issues: BIOS/UEFI boot repair, service dependency troubleshooting, root password recovery
- 🔧 Hardware Identification Tools: Comprehensive identification tools for CPU, memory, disk, PCI, USB
- 📊 Hardware Error Monitoring: mcelog and rasdaemon help you detect hardware issues in advance
- 🚀 Virtualization Troubleshooting: KVM support checks, resource over-allocation, network configuration issues
- 💡 Practical Case Analysis: The complete process of boot failure and hardware failure troubleshooting in real scenarios
Follow FYC, and say goodbye to the era of “despair at booting up”!
Third Paragraph: 5-Dimensional Scoring Table
| Dimension | Score | Description |
|---|---|---|
| Difficulty Level | ⭐⭐⭐⭐ | Requires in-depth understanding of the boot process and hardware principles |
| Practical Value | ⭐⭐⭐⭐⭐ | A lifesaver during critical moments in daily work |
| Technical Depth | ⭐⭐⭐⭐ | Comprehensive coverage from boot processes to hardware diagnostics |
| Operability | ⭐⭐⭐⭐⭐ | All commands and tools can be used directly |
| Urgency | ⭐⭐⭐⭐⭐ | System boot issues are usually the highest priority failures |
📚 Main Content: Packed with Useful Information, But Needs to Be “Fed to You”!
🚀 1. Troubleshooting Boot Issues: Bringing the System “Back to Life”
When a server won’t boot, it can be the most frustrating problem for operations engineers. But don’t panic, FYC is here to teach you how to troubleshoot and fix it step by step!
📋 Reviewing the Boot Process: Understanding Each Step of the Boot
Before we start fixing, we need to understand how the system boots. A complete boot process includes10 key steps:
1. BIOS firmware starts → Executes Power-On Self-Test (POST)
2. BIOS scans boot devices → Orders by priority
3. Looks for boot records → MBR or bootable partition
4. Loads the first stage Boot Loader → Reads from MBR
5. Loads the second stage Boot Loader → grub2 configuration file
6. Parses grub.cfg → Selects boot option
7. Loads kernel and initrd → Prepares to boot
8. Kernel initializes hardware → Loads drivers
9. Mounts root filesystem → Switches to real root
10. Starts systemd → Loads services
If any of these 10 steps fail, the system won’t boot! Most boot issues occur insteps 4-6, which is the Boot Loader phase.
🔧 2. Fixing Boot Issues on Traditional BIOS Systems
If your server is still using traditional BIOS, then the GRUB2 repair method is your lifesaver!
GRUB2 File Structure: Knowing Where the Files Are to Fix Them
The key files of GRUB2 are distributed in the following locations:
/boot/ # Kernel and initial RAMDISKS
/boot/grub2/ # Configuration files, extended modules, themes
/boot/grub2/grub.cfg # Main configuration file (auto-generated, do not edit manually!)
/etc/grub.d/ # Scripts that generate configuration files
/etc/default/grub # Configuration file variables (should edit this!)
/boot/grub2/grubenv # Stores environment variables
⚠️ Important Note:<span><span>/boot/grub2/grub.cfg</span></span> is auto-generated, do not edit it manually! To modify GRUB2 configuration, you should edit <span><span>/etc/default/grub</span></span>, then run <span><span>grub2-mkconfig</span></span> to regenerate.
Configuring GRUB2: Just Modify These Parameters
Common configuration parameters:
# Edit GRUB configuration
vim /etc/default/grub
# Main parameter descriptions:
GRUB_TIMEOUT=5 # Boot menu display time (seconds)
GRUB_DEFAULT=0 # Default boot option (counting starts from 0)
GRUB_DEFAULT=saved # Use the last saved selection
GRUB_CMDLINE_LINUX="..." # Kernel command line parameters
# Regenerate configuration file
grub2-mkconfig -o /boot/grub2/grub.cfg
Practical Example: Modifying Boot Timeout
# 1. Edit configuration file
vim /etc/default/grub
# Modify: GRUB_TIMEOUT=10 (change to 10 seconds)
# 2. Regenerate grub.cfg
grub2-mkconfig -o /boot/grub2/grub.cfg
# 3. Verify configuration
grep timeout /boot/grub2/grub.cfg
Reinstalling GRUB2 in MBR: The Ultimate Method to Fix Boot Loaders
If GRUB2 is corrupted or the MBR is damaged, you need to reinstall it in rescue mode:
Method 1: Using Rescue Mode (Recommended)
# 1. Boot from installation media, select "Rescue an installed system"
# 2. Choose option 1 (Continue), the system will mount to /mnt/sysimage
# 3. Press Enter to get a shell
# 4. chroot into the system
chroot /mnt/sysimage
# 5. Verify if /boot is mounted
ls -l /boot
# 6. Reinstall GRUB2 to MBR
grub2-install /dev/vda # Adjust according to your disk device name
# 7. Regenerate configuration file
grub2-mkconfig -o /boot/grub2/grub.cfg
# 8. Exit and reboot
exit
reboot
Method 2: Fixing on a Running System (If You Can Still Log In)
# If the system can still boot, run directly
grub2-install /dev/vda
grub2-mkconfig -o /boot/grub2/grub.cfg
🔌 3. Resolving Boot Loader Issues on UEFI Systems
If your server uses UEFI (most new servers do), the repair method will be different.
UEFI vs BIOS: Key Differences to Know
| Feature | BIOS | UEFI |
|---|---|---|
| Boot Record | MBR (512 bytes) | GPT (Partition Table) |
| Disk Size | Max 2TiB | Over 2TiB |
| Boot Registration | Scans devices | OS registration |
| Configuration File | /boot/grub2/grub.cfg | /boot/efi/EFI/redhat/grub.cfg |
UEFI Boot Chain: shim → grub → kernel
UEFI systems useshim as a bridge for secure boot:
UEFI firmware → shim.efi → grubx64.efi → kernel
Role of shim:
- Signed with keys trusted by UEFI firmware
- Verifies and loads grubx64.efi
- Supports Secure Boot
Fixing UEFI Boot Issues: Three Steps to Success
# 1. Reinstall grub2-efi and shim
yum reinstall grub2-efi shim
# 2. If the configuration file is deleted, regenerate it
grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
# 3. If the UEFI boot menu is deleted, it will be automatically added back on reboot
Managing UEFI Boot Targets: The efibootmgr Tool
<span><span>efibootmgr</span></span> is a dedicated tool for managing UEFI boot entries:
# Install the tool
yum install -y efibootmgr
# View current boot targets
efibootmgr
# Delete a boot entry
efibootmgr -B -b 001E
# Set a temporary boot target (effective next boot)
efibootmgr -n -b 002C
# Add a new boot entry
efibootmgr -c -d /dev/sda -p 2 -L "MyLinux" -l "\EFI\redhat\grubx64.efi"
# Note: Use backslashes in the path!
⚙️ 4. Handling Failed Services: Troubleshooting systemd Dependencies
The system can boot, but services won’t start? This is usually due toservice dependency issues!
Types of systemd Dependencies: Six Relationships to Know
systemd supports six types of dependencies:
-
Requires=: Strong dependency, required services
- If the required service fails to start, the current service will also fail
Wants=: Weak dependency, optional services
- If the required service fails to start, the current service can still start
Requisite=: Required and must be running
- The dependent service must already be running, otherwise it fails
Conflicts=: Conflict relationship
- Starting the current service will stop the conflicting service
Before= / After=: Start order
- Specifies the order in which services start
RequiresOverridable=: Overridable dependency
- When explicitly started by the administrator, failure will not cause the unit to fail
Checking Service Dependencies: Three Methods to Choose From
Method 1: View Service Details
systemctl show httpd.service
Method 2: View Dependency Tree (Most Common!)
# View all dependencies of the service (tree view)
systemctl list-dependencies httpd.service
# Recursively view all dependencies
systemctl list-dependencies --all httpd.service
# View reverse dependencies (who depends on this service)
systemctl list-dependencies --reverse httpd.service
Method 3: Graphical View of Dependencies
# Install graphviz
yum install -y graphviz
# Generate dependency graph
systemd-analyze dot httpd.service | dot -Tsvg > httpd-deps.svg
Resolving Service Dependency Issues: Practical Case Analysis
Scenario: Both httpd and vsftpd fail to start due to mutual dependencies, forming a circular dependency.
# 1. Check dependency relationships
systemctl list-dependencies httpd.service
systemctl list-dependencies vsftpd.service
# 2. Check service status
systemctl status httpd.service
systemctl status vsftpd.service
# 3. View service unit files
systemctl cat httpd.service
systemctl cat vsftpd.service
# 4. Modify dependency relationships (if necessary)
vim /etc/systemd/system/httpd.service.d/override.conf
# Remove or modify conflicting dependencies
# 5. Reload systemd configuration
systemctl daemon-reload
# 6. Restart services
systemctl restart httpd.service
systemctl restart vsftpd.service
Debugging Tips: Use debug-shell to Obtain Early Root Shell
If the system hangs on a service during boot, you can enable debug shell for troubleshooting:
# Enable debug shell (⚠️ Must disable after debugging!)
systemctl enable debug-shell.service
# After rebooting, there will be a root shell on tty9
# Press Ctrl+Alt+F9 to switch to tty9
# Disable debug shell after debugging
systemctl disable debug-shell.service
⚠️ Security Warning: debug-shell will provide root access to anyone; it must be disabled after debugging!
🔑 5. Recovering Lost Root Password: The rd.break Method
Forgot the root password? Don’t panic! FYC will teach you how to “break in” and reset the password.
Method 1: Interrupting the Boot Process with rd.break (Recommended)
This is the fastest method and does not require external media:
# 1. Reboot the system, when the GRUB menu appears, select the boot option and press 'e' to edit
# 2. Find the line starting with linux16 or linuxefi
# 3. Add rd.break at the end of the line (separated by a space)
# For example: linux16 /vmlinuz ... rd.break
# 4. Press Ctrl+X to boot
# 5. The system will stop at the initramfs stage, and you will see a switch_root shell
# 6. Remount the root filesystem as writable
mount -o remount,rw /sysroot
# 7. chroot into the system
chroot /sysroot
# 8. Reset the root password
echo redhat | passwd --stdin root
# 9. Fix SELinux context (Important!)
load_policy -i
restorecon -Rv /etc
# 10. Exit and reboot
exit
exit
Method 2: Using Rescue Mode
If more comprehensive repairs are needed, you can use the rescue mode from the installation media:
# 1. Boot from installation media, select "Rescue an installed system"
# 2. Choose option 1 (Continue), mount to /mnt/sysimage
# 3. chroot into the system
chroot /mnt/sysimage
# 4. Reset password or edit /etc/shadow file
passwd root
# or
vim /etc/shadow
# 5. Exit and reboot
exit
reboot
🔍 6. Identifying Hardware Issues: Letting Hardware “Speak”
The system can boot, but frequently has issues? It is likely that there are hardware failures! FYC will teach you how to let the hardware “speak”.
Hardware Identification Toolset: Six Essential Tools
1. Identifying CPU: lscpu Command
# View CPU information
lscpu
# View CPU supported flags (important!)
grep flags /proc/cpuinfo
# Check virtualization support
grep -E 'vmx|svm' /proc/cpuinfo
# vmx: Intel CPU virtualization support
# svm: AMD CPU virtualization support
2. Identifying Memory: dmidecode Tool
# Install the tool
yum install -y dmidecode
# View memory information
dmidecode -t memory
# View detailed memory information
dmidecode -t 17
3. Identifying Disks: lsscsi and hdparm
# Install the tools
yum install -y lsscsi hdparm
# View SCSI devices
lsscsi
# View detailed disk information
hdparm -I /dev/sda
4. Identifying PCI Hardware: lspci Command
# View PCI devices
lspci
# View detailed information (-v increases detail level)
lspci -v
lspci -vv # More detailed
lspci -vvv # Most detailed
# View specific devices
lspci | grep -i network
lspci | grep -i video
5. Identifying USB Hardware: lsusb Command
# Install the tool
yum install -y usbutils
# View USB devices
lsusb
# View detailed information
lsusb -v
Hardware Error Monitoring: The Eyes to Detect Problems Early
Tool 1: mcelog – Machine Check Exception Log
# Install the tool
yum install -y mcelog
# Start the service
systemctl start mcelog.service
systemctl enable mcelog.service
# View logs
journalctl -u mcelog.service
# View /var/log/mcelog file (if configured with cron)
tail -f /var/log/mcelog
Tool 2: rasdaemon – Reliability, Availability, Serviceability Daemon
# Install the tool
yum install -y rasdaemon
# Start the service
systemctl start rasdaemon.service
systemctl enable rasdaemon.service
# View status
ras-mc-ctl --status
# View errors
ras-mc-ctl --errors
Memory Testing: memtest86+ Helps Identify Memory Failures
If you suspect there are memory issues, you can use memtest86+ for testing:
# Install the tool
yum install -y memtest86+
# Run memtest-setup
memtest-setup
# Regenerate GRUB2 configuration
grub2-mkconfig -o /boot/grub2/grub.cfg
# Reboot the system, and the memtest86+ option will appear in the GRUB menu
# Select it for memory testing
🧩 7. Managing Kernel Modules: Making Hardware Drivers Work
After identifying the hardware, the next step is to ensure the drivers are functioning properly. Kernel module management is key!
Viewing and Managing Kernel Modules
# View loaded modules
lsmod
# View module details
modinfo megaraid_sas
# View parameters supported by the module
modinfo -p megaraid_sas
# Load module
modprobe megaraid_sas
# Unload module
modprobe -r megaraid_sas
# View module dependencies
modprobe --show-depends megaraid_sas
Configuring Module Parameters: Making Drivers Work as Needed
Many kernel modules support parameters to adjust behavior:
Method 1: Temporary Setting (Effective Immediately, Lost After Reboot)
# Set parameters when loading the module
modprobe megaraid_sas msix=0
# View current parameter values
cat /sys/module/megaraid_sas/parameters/msix
Method 2: Permanent Setting (Recommended)
# Create module configuration file
vim /etc/modprobe.d/megaraid_sas.conf
# Add configuration
options megaraid_sas msix=0
# If the module is already loaded, unload and reload it first
modprobe -r megaraid_sas
modprobe megaraid_sas
# Verify parameters
cat /sys/module/megaraid_sas/parameters/msix
# Should display: N
Practical Example: Disabling MSI-X Interrupts for SAS RAID Card
# Problem: Frequent MSI-X interrupt errors in logs
# Solution: Disable MSI-X interrupt handling in the driver
# 1. Create configuration file
cat > /etc/modprobe.d/megaraid_sas.conf << EOF
options megaraid_sas msix=0
EOF
# 2. Reload module
modprobe -r megaraid_sas
modprobe megaraid_sas
# 3. Verify
cat /sys/module/megaraid_sas/parameters/msix
# Should display: N
🖥️ 8. Handling Virtualization Issues: KVM Troubleshooting
If your environment uses KVM virtualization, then troubleshooting virtual machine issues is also an essential skill!
Checking Hardware Virtualization Support
KVM requires both CPU and firmware to support hardware virtualization:
# 1. Check if the CPU supports virtualization
grep -E 'vmx|svm' /proc/cpuinfo
# Intel CPU shows vmx, AMD CPU shows svm
# 2. Try loading KVM module
modprobe kvm-intel # Intel CPU
# or
modprobe kvm-amd # AMD CPU
# 3. Check using virsh
virsh capabilities | grep -A 5 kvm
# ⚠️ Important: If hardware virtualization is unavailable, VMs will run on emulated processors, which will be significantly slower!
Checking Resource Over-Allocation
libvirt allows you to allocate more virtual resources to VMs than the actual resources of the host, which can lead to performance issues:
# View host resources
virsh nodecpustats
virsh nodememstats
# View virtual machine resources
virsh dommemstats vm1
virsh vcpuinfo vm1
# Use top to view VM processes (VMs appear as regular processes on the host)
top -p $(pgrep -f qemu)
Resolving Resource Over-Allocation Issues:
- Add more physical resources (the simplest solution)
- Limit VM resource usage (using cgroups)
- Stop unnecessary VMs (to free up resources)
Validating libvirt XML Configuration
libvirt’s virtual machine configuration is stored in XML format, and configuration errors can prevent VMs from starting:
# Validate XML file syntax
xmllint --noout /etc/libvirt/qemu/vm1.xml
# Validate XML configuration correctness (more strict checks)
virt-xml-validate /etc/libvirt/qemu/vm1.xml /usr/share/libvirt/schemas/domain.rng
# ⚠️ Note: Do not manually edit files under /etc/libvirt/; use virsh or virt-manager instead!
Troubleshooting Virtual Network Issues
libvirt uses software bridges to implement virtual networks, and network issues are also common failures:
Common Issue 1: VM Cannot Be Accessed Externally
# Check the type of virtual network (it may be NAT type, inaccessible externally)
virsh net-list
virsh net-info default
# Check firewall rules
iptables -L -n | grep virbr
# Check the bridge
brctl show
Common Issue 2: External Access to VM Fails
# Check if the virtual network is isolated type
virsh net-info default
# Check hypervisor firewall rules
firewall-cmd --list-all
Common Issue 3: Complete Network Disruption
# If all iptables rules are cleared, libvirt's network may be disrupted
# Solution: Restart libvirtd or the virtual network
systemctl restart libvirtd.service
# or
virsh net-destroy default
virsh net-start default
💼 9. Practical Case: Complete Troubleshooting Process
Having discussed so much theory, FYC will provide you with a practical case to see how these skills are combined!
Case 1: System Fails to Boot (BIOS System)
Scenario: The server cannot boot after a restart, and the GRUB menu is not visible.
Troubleshooting Steps:
# 1. Boot from installation media, enter rescue mode
# Select "Rescue an installed system" → Option 1 (Continue)
# 2. chroot into the system
chroot /mnt/sysimage
# 3. Check /boot directory
ls -l /boot
# Found: Kernel files exist, but grub2 directory may have issues
# 4. Check disk device name
lsblk
# 5. Reinstall GRUB2
grub2-install /dev/vda
# 6. Regenerate configuration file
grub2-mkconfig -o /boot/grub2/grub.cfg
# 7. Verify configuration
ls -l /boot/grub2/grub.cfg
# 8. Exit and reboot
exit
exit
reboot
Case 2: Service Start Failure (Dependency Issue)
Scenario: Both httpd and vsftpd fail to start; the system can boot, but services cannot run.
Troubleshooting Steps:
# 1. Check service status
systemctl status httpd.service
systemctl status vsftpd.service
# 2. Check dependency relationships
systemctl list-dependencies httpd.service
systemctl list-dependencies vsftpd.service
# 3. Check service logs
journalctl -u httpd.service -n 50
journalctl -u vsftpd.service -n 50
# 4. Check service unit files
systemctl cat httpd.service
systemctl cat vsftpd.service
# 5. Identify the issue: the two services depend on each other, forming a circular dependency
# 6. Modify service configuration (remove conflicting dependencies)
vim /etc/systemd/system/httpd.service.d/override.conf
# [Unit]
# After=vsftpd.service
# 7. Reload configuration
systemctl daemon-reload
# 8. Restart services
systemctl restart httpd.service
systemctl restart vsftpd.service
# 9. Verify
systemctl status httpd.service
systemctl status vsftpd.service
Case 3: Hardware Failure Identification (Memory Error)
Scenario: The system frequently experiences memory errors, suspected to be hardware failures.
Troubleshooting Steps:
# 1. Check hardware error logs
journalctl -u mcelog.service
ras-mc-ctl --errors
# 2. View memory information
dmidecode -t memory
# 3. Run memory test
yum install -y memtest86+
memtest-setup
# 4. Reboot the system, select the memtest86+ option for testing
# After testing, check the results
# 5. Replace faulty memory modules based on test results
🎁 Conclusion!
📋 Value Summary
Today, FYC has brought you a complete guide to Linux system boot and hardware failure troubleshooting:
✅ Three Steps to Diagnose Boot Issues:
- BIOS/UEFI Boot Repair: Reinstalling GRUB2, Rebuilding Configuration Files
- systemd Service Dependency Troubleshooting: Checking Dependencies, Resolving Circular Dependencies
- Root Password Recovery: Interrupting Boot Process with rd.break, Resetting in Rescue Mode
✅ Hardware Identification Tools:
- Comprehensive identification tools for CPU, memory, disk, PCI, USB
- Hardware error monitoring: mcelog, rasdaemon to detect failures in advance
- Memory testing: memtest86+ to identify memory failures
✅ Kernel Modules and Virtualization:
- Kernel module parameter configuration: Making drivers work as needed
- KVM Virtualization Support Check: Ensuring VM Performance
- libvirt Network Troubleshooting: Resolving VM Network Issues
By mastering these skills, you can bring the system “back to life” in critical moments, upgrading from a “firefighter” to a “system doctor”!
🎯 Call to Action
Do you find this article not enough? Want to see more detailed GRUB2 configuration file interpretations, in-depth analysis of systemd dependencies, and more practical case studies for hardware failure troubleshooting?
👉 Click on the “Read the Original” below, to get:
- 📚 Complete Boot Failure Troubleshooting Checklist (Checklist)
- 🔧 Complete Interpretation of GRUB2 Configuration Files (All Parameter Descriptions)
- 📊 Guide to Visualizing systemd Dependencies Usage Guide
- 🎯 Quick Reference for Hardware Failure Diagnosis Tools (All Commands and Parameters)
- 💡 More Real Case Analyses (Covering BIOS/UEFI/Virtualization)
FYC’s Mission: To enable every operations engineer to become a system doctor! Technology should be hardcore, and the writing should be engaging! 🔥
#Operations #Linux #SystemBoot #HardwareFailure #Troubleshooting #GRUB2 #systemd #KVM #TechnicalContent #RedHat #RCA