Linux Troubleshooting Series 2 – Diagnosing System Boot Issues & Identifying Hardware Failures

Linux Troubleshooting Series 2 - Diagnosing System Boot Issues & Identifying Hardware Failures

🚨 Linux System Boot Emergency Guide | Server Won’t Boot? These Skills Will Help You Fix It in 3 Minutes!

📖 Introduction: Do You Experience This Too?

First Paragraph: Pain Point Scenario

Have you ever encountered this nightmare scenario? In the middle of the night, the server suddenly crashes, and after a reboot, the screen is completely black, only showing “GRUB>” or simply failing to boot into the system. You frantically press F2, F12, or Delete, but the system just ignores you. Worse yet, the boss is already calling you relentlessly: “Why is the system down? When will it be restored? Who will bear the losses of our business?”

If you have also experienced such a moment of “despair at booting up,” then today FYC has good news for you:System boot issues actually follow a pattern; once you master the correct methods, you can make the server “come back to life” like a magician!

Moreover, we will also discusshow to identify hardware failures. After all, if there are issues with the hardware itself, no matter how good the software configuration is, it will be useless. By learning these skills, you can upgrade from a “firefighter” to a “system doctor”!

Second Paragraph: Overview of Solutions

Today we will talk aboutdiagnosing Linux system boot issues and identifying hardware failures, which is a survival skill that every operations engineer must master. We will start with the principles of BIOS/UEFI booting, guiding you step by step on how to fix boot loaders, resolve service dependency issues, recover lost root passwords, and quickly identify and locate hardware failures.

This article will bring you:

  • 🎯 Three Steps to Diagnose Boot Issues: BIOS/UEFI boot repair, service dependency troubleshooting, root password recovery
  • 🔧 Hardware Identification Tools: Comprehensive identification tools for CPU, memory, disk, PCI, USB
  • 📊 Hardware Error Monitoring: mcelog and rasdaemon help you detect hardware issues in advance
  • 🚀 Virtualization Troubleshooting: KVM support checks, resource over-allocation, network configuration issues
  • 💡 Practical Case Analysis: The complete process of boot failure and hardware failure troubleshooting in real scenarios

Follow FYC, and say goodbye to the era of “despair at booting up”!

Third Paragraph: 5-Dimensional Scoring Table

Dimension Score Description
Difficulty Level ⭐⭐⭐⭐ Requires in-depth understanding of the boot process and hardware principles
Practical Value ⭐⭐⭐⭐⭐ A lifesaver during critical moments in daily work
Technical Depth ⭐⭐⭐⭐ Comprehensive coverage from boot processes to hardware diagnostics
Operability ⭐⭐⭐⭐⭐ All commands and tools can be used directly
Urgency ⭐⭐⭐⭐⭐ System boot issues are usually the highest priority failures

📚 Main Content: Packed with Useful Information, But Needs to Be “Fed to You”!

🚀 1. Troubleshooting Boot Issues: Bringing the System “Back to Life”

When a server won’t boot, it can be the most frustrating problem for operations engineers. But don’t panic, FYC is here to teach you how to troubleshoot and fix it step by step!

📋 Reviewing the Boot Process: Understanding Each Step of the Boot

Before we start fixing, we need to understand how the system boots. A complete boot process includes10 key steps:

1. BIOS firmware starts → Executes Power-On Self-Test (POST)
2. BIOS scans boot devices → Orders by priority
3. Looks for boot records → MBR or bootable partition
4. Loads the first stage Boot Loader → Reads from MBR
5. Loads the second stage Boot Loader → grub2 configuration file
6. Parses grub.cfg → Selects boot option
7. Loads kernel and initrd → Prepares to boot
8. Kernel initializes hardware → Loads drivers
9. Mounts root filesystem → Switches to real root
10. Starts systemd → Loads services

If any of these 10 steps fail, the system won’t boot! Most boot issues occur insteps 4-6, which is the Boot Loader phase.

🔧 2. Fixing Boot Issues on Traditional BIOS Systems

If your server is still using traditional BIOS, then the GRUB2 repair method is your lifesaver!

GRUB2 File Structure: Knowing Where the Files Are to Fix Them

The key files of GRUB2 are distributed in the following locations:

/boot/                      # Kernel and initial RAMDISKS
/boot/grub2/                  # Configuration files, extended modules, themes
/boot/grub2/grub.cfg             # Main configuration file (auto-generated, do not edit manually!)
/etc/grub.d/                  # Scripts that generate configuration files
/etc/default/grub              # Configuration file variables (should edit this!)
/boot/grub2/grubenv             # Stores environment variables

⚠️ Important Note:<span><span>/boot/grub2/grub.cfg</span></span> is auto-generated, do not edit it manually! To modify GRUB2 configuration, you should edit <span><span>/etc/default/grub</span></span>, then run <span><span>grub2-mkconfig</span></span> to regenerate.

Configuring GRUB2: Just Modify These Parameters

Common configuration parameters:

# Edit GRUB configuration
vim /etc/default/grub

# Main parameter descriptions:
GRUB_TIMEOUT=5              # Boot menu display time (seconds)
GRUB_DEFAULT=0              # Default boot option (counting starts from 0)
GRUB_DEFAULT=saved          # Use the last saved selection
GRUB_CMDLINE_LINUX="..."    # Kernel command line parameters

# Regenerate configuration file
grub2-mkconfig -o /boot/grub2/grub.cfg

Practical Example: Modifying Boot Timeout

# 1. Edit configuration file
vim /etc/default/grub
# Modify: GRUB_TIMEOUT=10  (change to 10 seconds)

# 2. Regenerate grub.cfg
grub2-mkconfig -o /boot/grub2/grub.cfg

# 3. Verify configuration
grep timeout /boot/grub2/grub.cfg

Reinstalling GRUB2 in MBR: The Ultimate Method to Fix Boot Loaders

If GRUB2 is corrupted or the MBR is damaged, you need to reinstall it in rescue mode:

Method 1: Using Rescue Mode (Recommended)

# 1. Boot from installation media, select "Rescue an installed system"
# 2. Choose option 1 (Continue), the system will mount to /mnt/sysimage
# 3. Press Enter to get a shell

# 4. chroot into the system
chroot /mnt/sysimage

# 5. Verify if /boot is mounted
ls -l /boot

# 6. Reinstall GRUB2 to MBR
grub2-install /dev/vda    # Adjust according to your disk device name

# 7. Regenerate configuration file
grub2-mkconfig -o /boot/grub2/grub.cfg

# 8. Exit and reboot
exit
reboot

Method 2: Fixing on a Running System (If You Can Still Log In)

# If the system can still boot, run directly

grub2-install /dev/vda
grub2-mkconfig -o /boot/grub2/grub.cfg

🔌 3. Resolving Boot Loader Issues on UEFI Systems

If your server uses UEFI (most new servers do), the repair method will be different.

UEFI vs BIOS: Key Differences to Know

Feature BIOS UEFI
Boot Record MBR (512 bytes) GPT (Partition Table)
Disk Size Max 2TiB Over 2TiB
Boot Registration Scans devices OS registration
Configuration File /boot/grub2/grub.cfg /boot/efi/EFI/redhat/grub.cfg

UEFI Boot Chain: shim → grub → kernel

UEFI systems useshim as a bridge for secure boot:

UEFI firmware → shim.efi → grubx64.efi → kernel

Role of shim:

  • Signed with keys trusted by UEFI firmware
  • Verifies and loads grubx64.efi
  • Supports Secure Boot

Fixing UEFI Boot Issues: Three Steps to Success

# 1. Reinstall grub2-efi and shim
yum reinstall grub2-efi shim

# 2. If the configuration file is deleted, regenerate it
grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

# 3. If the UEFI boot menu is deleted, it will be automatically added back on reboot

Managing UEFI Boot Targets: The efibootmgr Tool

<span><span>efibootmgr</span></span> is a dedicated tool for managing UEFI boot entries:

# Install the tool
yum install -y efibootmgr

# View current boot targets
efibootmgr

# Delete a boot entry
efibootmgr -B -b 001E

# Set a temporary boot target (effective next boot)
efibootmgr -n -b 002C

# Add a new boot entry
efibootmgr -c -d /dev/sda -p 2 -L "MyLinux" -l "\EFI\redhat\grubx64.efi"
# Note: Use backslashes in the path!

⚙️ 4. Handling Failed Services: Troubleshooting systemd Dependencies

The system can boot, but services won’t start? This is usually due toservice dependency issues!

Types of systemd Dependencies: Six Relationships to Know

systemd supports six types of dependencies:

  1. Requires=: Strong dependency, required services

  • If the required service fails to start, the current service will also fail
  • Wants=: Weak dependency, optional services

    • If the required service fails to start, the current service can still start
  • Requisite=: Required and must be running

    • The dependent service must already be running, otherwise it fails
  • Conflicts=: Conflict relationship

    • Starting the current service will stop the conflicting service
  • Before= / After=: Start order

    • Specifies the order in which services start
  • RequiresOverridable=: Overridable dependency

    • When explicitly started by the administrator, failure will not cause the unit to fail

    Checking Service Dependencies: Three Methods to Choose From

    Method 1: View Service Details

    systemctl show httpd.service
    

    Method 2: View Dependency Tree (Most Common!)

    # View all dependencies of the service (tree view)
    systemctl list-dependencies httpd.service
    
    # Recursively view all dependencies
    systemctl list-dependencies --all httpd.service
    
    # View reverse dependencies (who depends on this service)
    systemctl list-dependencies --reverse httpd.service
    

    Method 3: Graphical View of Dependencies

    # Install graphviz
    yum install -y graphviz
    
    # Generate dependency graph
    systemd-analyze dot httpd.service | dot -Tsvg > httpd-deps.svg
    

    Resolving Service Dependency Issues: Practical Case Analysis

    Scenario: Both httpd and vsftpd fail to start due to mutual dependencies, forming a circular dependency.

    # 1. Check dependency relationships
    systemctl list-dependencies httpd.service
    systemctl list-dependencies vsftpd.service
    
    # 2. Check service status
    systemctl status httpd.service
    systemctl status vsftpd.service
    
    # 3. View service unit files
    systemctl cat httpd.service
    systemctl cat vsftpd.service
    
    # 4. Modify dependency relationships (if necessary)
    vim /etc/systemd/system/httpd.service.d/override.conf
    # Remove or modify conflicting dependencies
    
    # 5. Reload systemd configuration
    systemctl daemon-reload
    
    # 6. Restart services
    systemctl restart httpd.service
    systemctl restart vsftpd.service
    

    Debugging Tips: Use debug-shell to Obtain Early Root Shell

    If the system hangs on a service during boot, you can enable debug shell for troubleshooting:

    # Enable debug shell (⚠️ Must disable after debugging!)
    systemctl enable debug-shell.service
    
    # After rebooting, there will be a root shell on tty9
    # Press Ctrl+Alt+F9 to switch to tty9
    
    # Disable debug shell after debugging
    systemctl disable debug-shell.service
    

    ⚠️ Security Warning: debug-shell will provide root access to anyone; it must be disabled after debugging!

    🔑 5. Recovering Lost Root Password: The rd.break Method

    Forgot the root password? Don’t panic! FYC will teach you how to “break in” and reset the password.

    Method 1: Interrupting the Boot Process with rd.break (Recommended)

    This is the fastest method and does not require external media:

    # 1. Reboot the system, when the GRUB menu appears, select the boot option and press 'e' to edit
    
    # 2. Find the line starting with linux16 or linuxefi
    
    # 3. Add rd.break at the end of the line (separated by a space)
    # For example: linux16 /vmlinuz ... rd.break
    
    # 4. Press Ctrl+X to boot
    
    # 5. The system will stop at the initramfs stage, and you will see a switch_root shell
    
    # 6. Remount the root filesystem as writable
    mount -o remount,rw /sysroot
    
    # 7. chroot into the system
    chroot /sysroot
    
    # 8. Reset the root password
    echo redhat | passwd --stdin root
    
    # 9. Fix SELinux context (Important!)
    load_policy -i
    restorecon -Rv /etc
    
    # 10. Exit and reboot
    exit
    exit
    

    Method 2: Using Rescue Mode

    If more comprehensive repairs are needed, you can use the rescue mode from the installation media:

    # 1. Boot from installation media, select "Rescue an installed system"
    # 2. Choose option 1 (Continue), mount to /mnt/sysimage
    # 3. chroot into the system
    chroot /mnt/sysimage
    
    # 4. Reset password or edit /etc/shadow file
    passwd root
    # or
    vim /etc/shadow
    
    # 5. Exit and reboot
    exit
    reboot
    

    🔍 6. Identifying Hardware Issues: Letting Hardware “Speak”

    The system can boot, but frequently has issues? It is likely that there are hardware failures! FYC will teach you how to let the hardware “speak”.

    Hardware Identification Toolset: Six Essential Tools

    1. Identifying CPU: lscpu Command

    # View CPU information
    lscpu
    
    # View CPU supported flags (important!)
    grep flags /proc/cpuinfo
    
    # Check virtualization support
    grep -E 'vmx|svm' /proc/cpuinfo
    # vmx: Intel CPU virtualization support
    # svm: AMD CPU virtualization support
    

    2. Identifying Memory: dmidecode Tool

    # Install the tool
    yum install -y dmidecode
    
    # View memory information
    dmidecode -t memory
    
    # View detailed memory information
    dmidecode -t 17
    

    3. Identifying Disks: lsscsi and hdparm

    # Install the tools
    yum install -y lsscsi hdparm
    
    # View SCSI devices
    lsscsi
    
    # View detailed disk information
    hdparm -I /dev/sda
    

    4. Identifying PCI Hardware: lspci Command

    # View PCI devices
    lspci
    
    # View detailed information (-v increases detail level)
    lspci -v
    lspci -vv    # More detailed
    lspci -vvv   # Most detailed
    
    # View specific devices
    lspci | grep -i network
    lspci | grep -i video
    

    5. Identifying USB Hardware: lsusb Command

    # Install the tool
    yum install -y usbutils
    
    # View USB devices
    lsusb
    
    # View detailed information
    lsusb -v
    

    Hardware Error Monitoring: The Eyes to Detect Problems Early

    Tool 1: mcelog – Machine Check Exception Log

    # Install the tool
    yum install -y mcelog
    
    # Start the service
    systemctl start mcelog.service
    systemctl enable mcelog.service
    
    # View logs
    journalctl -u mcelog.service
    
    # View /var/log/mcelog file (if configured with cron)
    tail -f /var/log/mcelog
    

    Tool 2: rasdaemon – Reliability, Availability, Serviceability Daemon

    # Install the tool
    yum install -y rasdaemon
    
    # Start the service
    systemctl start rasdaemon.service
    systemctl enable rasdaemon.service
    
    # View status
    ras-mc-ctl --status
    
    # View errors
    ras-mc-ctl --errors
    

    Memory Testing: memtest86+ Helps Identify Memory Failures

    If you suspect there are memory issues, you can use memtest86+ for testing:

    # Install the tool
    yum install -y memtest86+
    
    # Run memtest-setup
    memtest-setup
    
    # Regenerate GRUB2 configuration
    grub2-mkconfig -o /boot/grub2/grub.cfg
    
    # Reboot the system, and the memtest86+ option will appear in the GRUB menu
    # Select it for memory testing
    

    🧩 7. Managing Kernel Modules: Making Hardware Drivers Work

    After identifying the hardware, the next step is to ensure the drivers are functioning properly. Kernel module management is key!

    Viewing and Managing Kernel Modules

    # View loaded modules
    lsmod
    
    # View module details
    modinfo megaraid_sas
    
    # View parameters supported by the module
    modinfo -p megaraid_sas
    
    # Load module
    modprobe megaraid_sas
    
    # Unload module
    modprobe -r megaraid_sas
    
    # View module dependencies
    modprobe --show-depends megaraid_sas
    

    Configuring Module Parameters: Making Drivers Work as Needed

    Many kernel modules support parameters to adjust behavior:

    Method 1: Temporary Setting (Effective Immediately, Lost After Reboot)

    # Set parameters when loading the module
    modprobe megaraid_sas msix=0
    
    # View current parameter values
    cat /sys/module/megaraid_sas/parameters/msix
    

    Method 2: Permanent Setting (Recommended)

    # Create module configuration file
    vim /etc/modprobe.d/megaraid_sas.conf
    
    # Add configuration
    options megaraid_sas msix=0
    
    # If the module is already loaded, unload and reload it first
    modprobe -r megaraid_sas
    modprobe megaraid_sas
    
    # Verify parameters
    cat /sys/module/megaraid_sas/parameters/msix
    # Should display: N
    

    Practical Example: Disabling MSI-X Interrupts for SAS RAID Card

    # Problem: Frequent MSI-X interrupt errors in logs
    # Solution: Disable MSI-X interrupt handling in the driver
    
    # 1. Create configuration file
    cat > /etc/modprobe.d/megaraid_sas.conf << EOF
    options megaraid_sas msix=0
    EOF
    
    # 2. Reload module
    modprobe -r megaraid_sas
    modprobe megaraid_sas
    
    # 3. Verify
    cat /sys/module/megaraid_sas/parameters/msix
    # Should display: N
    

    🖥️ 8. Handling Virtualization Issues: KVM Troubleshooting

    If your environment uses KVM virtualization, then troubleshooting virtual machine issues is also an essential skill!

    Checking Hardware Virtualization Support

    KVM requires both CPU and firmware to support hardware virtualization:

    # 1. Check if the CPU supports virtualization
    grep -E 'vmx|svm' /proc/cpuinfo
    # Intel CPU shows vmx, AMD CPU shows svm
    
    # 2. Try loading KVM module
    modprobe kvm-intel    # Intel CPU
    # or
    modprobe kvm-amd      # AMD CPU
    
    # 3. Check using virsh
    virsh capabilities | grep -A 5 kvm
    
    # ⚠️ Important: If hardware virtualization is unavailable, VMs will run on emulated processors, which will be significantly slower!
    

    Checking Resource Over-Allocation

    libvirt allows you to allocate more virtual resources to VMs than the actual resources of the host, which can lead to performance issues:

    # View host resources
    virsh nodecpustats
    virsh nodememstats
    
    # View virtual machine resources
    virsh dommemstats vm1
    virsh vcpuinfo vm1
    
    # Use top to view VM processes (VMs appear as regular processes on the host)
    top -p $(pgrep -f qemu)
    

    Resolving Resource Over-Allocation Issues:

    1. Add more physical resources (the simplest solution)
    2. Limit VM resource usage (using cgroups)
    3. Stop unnecessary VMs (to free up resources)

    Validating libvirt XML Configuration

    libvirt’s virtual machine configuration is stored in XML format, and configuration errors can prevent VMs from starting:

    # Validate XML file syntax
    xmllint --noout /etc/libvirt/qemu/vm1.xml
    
    # Validate XML configuration correctness (more strict checks)
    virt-xml-validate /etc/libvirt/qemu/vm1.xml /usr/share/libvirt/schemas/domain.rng
    
    # ⚠️ Note: Do not manually edit files under /etc/libvirt/; use virsh or virt-manager instead!
    

    Troubleshooting Virtual Network Issues

    libvirt uses software bridges to implement virtual networks, and network issues are also common failures:

    Common Issue 1: VM Cannot Be Accessed Externally

    # Check the type of virtual network (it may be NAT type, inaccessible externally)
    virsh net-list
    virsh net-info default
    
    # Check firewall rules
    iptables -L -n | grep virbr
    
    # Check the bridge
    brctl show
    

    Common Issue 2: External Access to VM Fails

    # Check if the virtual network is isolated type
    virsh net-info default
    
    # Check hypervisor firewall rules
    firewall-cmd --list-all
    

    Common Issue 3: Complete Network Disruption

    # If all iptables rules are cleared, libvirt's network may be disrupted
    # Solution: Restart libvirtd or the virtual network
    systemctl restart libvirtd.service
    # or
    virsh net-destroy default
    virsh net-start default
    

    💼 9. Practical Case: Complete Troubleshooting Process

    Having discussed so much theory, FYC will provide you with a practical case to see how these skills are combined!

    Case 1: System Fails to Boot (BIOS System)

    Scenario: The server cannot boot after a restart, and the GRUB menu is not visible.

    Troubleshooting Steps:

    # 1. Boot from installation media, enter rescue mode
    # Select "Rescue an installed system" → Option 1 (Continue)
    
    # 2. chroot into the system
    chroot /mnt/sysimage
    
    # 3. Check /boot directory
    ls -l /boot
    # Found: Kernel files exist, but grub2 directory may have issues
    
    # 4. Check disk device name
    lsblk
    
    # 5. Reinstall GRUB2
    grub2-install /dev/vda
    
    # 6. Regenerate configuration file
    grub2-mkconfig -o /boot/grub2/grub.cfg
    
    # 7. Verify configuration
    ls -l /boot/grub2/grub.cfg
    
    # 8. Exit and reboot
    exit
    exit
    reboot
    

    Case 2: Service Start Failure (Dependency Issue)

    Scenario: Both httpd and vsftpd fail to start; the system can boot, but services cannot run.

    Troubleshooting Steps:

    # 1. Check service status
    systemctl status httpd.service
    systemctl status vsftpd.service
    
    # 2. Check dependency relationships
    systemctl list-dependencies httpd.service
    systemctl list-dependencies vsftpd.service
    
    # 3. Check service logs
    journalctl -u httpd.service -n 50
    journalctl -u vsftpd.service -n 50
    
    # 4. Check service unit files
    systemctl cat httpd.service
    systemctl cat vsftpd.service
    
    # 5. Identify the issue: the two services depend on each other, forming a circular dependency
    
    # 6. Modify service configuration (remove conflicting dependencies)
    vim /etc/systemd/system/httpd.service.d/override.conf
    # [Unit]
    # After=vsftpd.service
    
    # 7. Reload configuration
    systemctl daemon-reload
    
    # 8. Restart services
    systemctl restart httpd.service
    systemctl restart vsftpd.service
    
    # 9. Verify
    systemctl status httpd.service
    systemctl status vsftpd.service
    

    Case 3: Hardware Failure Identification (Memory Error)

    Scenario: The system frequently experiences memory errors, suspected to be hardware failures.

    Troubleshooting Steps:

    # 1. Check hardware error logs
    journalctl -u mcelog.service
    ras-mc-ctl --errors
    
    # 2. View memory information
    dmidecode -t memory
    
    # 3. Run memory test
    yum install -y memtest86+
    memtest-setup
    
    # 4. Reboot the system, select the memtest86+ option for testing
    # After testing, check the results
    
    # 5. Replace faulty memory modules based on test results
    

    🎁 Conclusion!

    📋 Value Summary

    Today, FYC has brought you a complete guide to Linux system boot and hardware failure troubleshooting:

    Three Steps to Diagnose Boot Issues:

    • BIOS/UEFI Boot Repair: Reinstalling GRUB2, Rebuilding Configuration Files
    • systemd Service Dependency Troubleshooting: Checking Dependencies, Resolving Circular Dependencies
    • Root Password Recovery: Interrupting Boot Process with rd.break, Resetting in Rescue Mode

    Hardware Identification Tools:

    • Comprehensive identification tools for CPU, memory, disk, PCI, USB
    • Hardware error monitoring: mcelog, rasdaemon to detect failures in advance
    • Memory testing: memtest86+ to identify memory failures

    Kernel Modules and Virtualization:

    • Kernel module parameter configuration: Making drivers work as needed
    • KVM Virtualization Support Check: Ensuring VM Performance
    • libvirt Network Troubleshooting: Resolving VM Network Issues

    By mastering these skills, you can bring the system “back to life” in critical moments, upgrading from a “firefighter” to a “system doctor”!

    🎯 Call to Action

    Do you find this article not enough? Want to see more detailed GRUB2 configuration file interpretations, in-depth analysis of systemd dependencies, and more practical case studies for hardware failure troubleshooting?

    👉 Click on the “Read the Original” below, to get:

    • 📚 Complete Boot Failure Troubleshooting Checklist (Checklist)
    • 🔧 Complete Interpretation of GRUB2 Configuration Files (All Parameter Descriptions)
    • 📊 Guide to Visualizing systemd Dependencies Usage Guide
    • 🎯 Quick Reference for Hardware Failure Diagnosis Tools (All Commands and Parameters)
    • 💡 More Real Case Analyses (Covering BIOS/UEFI/Virtualization)

    FYC’s Mission: To enable every operations engineer to become a system doctor! Technology should be hardcore, and the writing should be engaging! 🔥

    #Operations #Linux #SystemBoot #HardwareFailure #Troubleshooting #GRUB2 #systemd #KVM #TechnicalContent #RedHat #RCA

    Leave a Comment