Ansible Firefighting Hotline Series – (18) Automated Network Fault Diagnosis

🔍 Ansible Firefighting Hotline | Tired of Troubleshooting Network Latency? One-Click Automated Diagnosis Turns You into a Network Expert!

Are you still struggling with network latency issues and troubleshooting blindly? Today, I bring you a comprehensive RHEL8 automated analysis solution for network latency, allowing you to say goodbye to the nightmare of manually typing commands!

🎯 Pain Points Addressed

The daily routine of an operations engineer: network latency alarm → manual ping tests → check network interface status → review system logs → packet capture analysis → compare historical data… After a series of actions, several hours may pass, and the problem could still be elusive.

Even more frightening is: manual troubleshooting can easily overlook key information, lack systematic analysis, and fail to quickly locate the root cause. Have you ever thought that if there were an automated network diagnosis solution, all these problems would be resolved?

✨ Solution Preview

Today, I will share an automated analysis of RHEL8 network latency issues using Ansible, which includes four core diagnostic modules, standardizing, automating, and intelligentizing your network troubleshooting!

Check out the results!

=====================================
RHEL8 Network Latency Issue Automated Analysis Report
=====================================

Analysis Time: 2025-08-29T02:55:17Z
Host Name: 10.66.208.231
Network Interface: ens33
Issue Severity: Normal

=====================================
1. Network Interface Status Analysis
=====================================

Interface Name: ens33
Interface Status: UP
IP Address: 10.66.208.231
Subnet Mask: 24
MTU Setting: 1500

Statistics:
- Number of Error Packets: 0
- Number of Dropped Packets: 0
- Number of Overruns: 0

Status Assessment:
- Error Status: Normal
- Dropped Packet Status: Normal
- MTU Status: Normal

=====================================
2. System-Level Network Diagnosis
=====================================

TCP Connection Statistics:
- TCP Retransmission Count: 0
- Retransmission Status: Normal
- Total Connections: 14
- Connection Status: Normal

System Resources:
- System Load: load average: 0.07, 0.02, 0.00
- Total Memory: 7.8G
- Memory Usage: None
- Number of Log Errors: 0

TCP Configuration Optimization:
- TCP Window Scaling: Disabled
- TCP Timestamps: Disabled
- TCP SACK: Disabled

=====================================
3. Real-Time Traffic Capture Analysis
=====================================

Capture File: packet_capture_10.66.208.231_1756436117.pcap
File Size: 2.3M
Packet Statistics:
- Total Packets: 11290
- Number of SYN Packets: 290
- SYN Flood Suspected: No

Traffic Analysis Results:
- Capture Status: Successful
- File Path: /var/tmp/packet_capture_10.66.208.231_1756436117.pcap

=====================================
4. Problem Classification and Severity Assessment
=====================================

Detected Problem Types:
- Configuration Issues

Problem Severity: Normal

=====================================
5. Repair Suggestions
=====================================

Recommended Repair Measures:
1. Adjust MTU setting to standard value 1500
2. Enable TCP window scaling and timestamp optimization

=====================================
6. Technical Details
=====================================

Analysis Configuration:
- Network Interface: ens33
- Capture Duration: 60 seconds
- Error Packet Threshold: 10
- TCP Retransmission Threshold: 5
- Interface Dropped Packet Threshold: 5

Execution Environment:
- Ansible Version: 2.16.14
- Python Version: 3.9.21
- System Architecture: x86_64

=====================================
7. Follow-Up Action Recommendations
=====================================

Maintenance Recommendations:
1. Continue monitoring network performance
2. Regularly perform network diagnostics
3. Keep the system updated
4. Record normal baseline data

=====================================
Report Generation Completed
======================================

Report File: network_analysis_10.66.208.231_1756436117.txt
Generation Time: 2025-08-29T02:55:17Z
Analysis Duration: N/A seconds

For technical support, please contact the system administrator.

⭐ Automation Scenario Rating

Rating Dimension Rating Description
Ease of Use ⭐⭐⭐⭐⭐ One-click execution, detailed comments, not friendly for beginners!
Reusability ⭐⭐ Variable configuration, supports multi-host parallel execution
Stability ⭐⭐⭐⭐⭐ Idempotent design, comprehensive error handling
Scalability ⭐⭐⭐⭐⭐ Modular roles, easy to extend functionality
Best Practice Compliance ⭐⭐⭐⭐⭐ Follows Ansible best practices, code standards

🗂️ Project Directory Structure

03_RHEL8_Network_Latency_Automated_Analysis/
├── inventory                    # Host inventory configuration
├── group_vars/
│   └── all.yml                 # Global variable configuration
├── playbook.yml                # Main playbook file
├── roles/                      # Four core diagnostic roles
│   ├── network_interface/      # Network interface check
│   ├── system_diagnosis/       # System diagnosis
│   ├── traffic_capture/        # Traffic capture
│   └── report_generation/      # Report generation
├── README.md                   # Project documentation
├── repair_instructions.md       # Troubleshooting guide
└── public_account_article.md    # This document

📄 Core File Content Overview

🎯 Main Playbook File (playbook.yml)

---
- name: RHEL8 Network Latency Automated Analysis
  hosts: diagnose
  gather_facts: yes
  become: yes

  pre_tasks:
    - name: Record analysis start time
      ansible.builtin.set_fact:
        analysis_start_time: "{{ ansible_date_time.iso8601 }}"

    - name: Display analysis start information
      ansible.builtin.debug:
        msg: "Starting network latency analysis: {{ inventory_hostname }} ({{ ansible_date_time.iso8601 }})"

  roles:
    - role: network_interface
      tags: network_interface
    - role: system_diagnosis
      tags: system_diagnosis
    - role: traffic_capture
      tags: traffic_capture
    - role: report_generation
      tags: report_generation

  post_tasks:
    - name: Record analysis end time
      ansible.builtin.set_fact:
        analysis_end_time: "{{ ansible_date_time.iso8601 }}"

    - name: Calculate analysis duration
      ansible.builtin.set_fact:
        analysis_duration: "{{ ((ansible_date_time.epoch | int) - (analysis_start_time | strptime('%Y-%m-%dT%H:%M:%S%z') | int)) }}"

    - name: Display analysis completion information
      ansible.builtin.debug:
        msg: "Network latency analysis completed! Duration: {{ analysis_duration }} seconds"

🔧 Host Inventory Configuration (inventory)

[RHEL8]
10.66.208.232

[all:vars]
ansible_user=root
ansible_ssh_private_key_file=~/.ssh/id_rsa
ansible_become=yes
ansible_become_method=sudo
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'

⚙️ Global Variable Configuration (group_vars/all.yml)

# Network interface configuration
network_interface: "ens33"
network_interface_backup: "eth0"

# Traffic capture configuration
traffic_capture:
  enabled: true
  duration: 30
  packet_count: 1000
  output_dir: "/var/tmp"
  filename_prefix: "packet_capture"

# Report configuration
report_output_dir: "/tmp/network_analysis_reports"
report_filename: "network_analysis_{{ inventory_hostname }}_{{ ansible_date_time.epoch }}.txt"

# Diagnosis threshold configuration
diagnosis_thresholds:
  error_packets: 10
  dropped_packets: 5
  overrun_packets: 3
  tcp_retransmit_rate: 0.1

# System log keywords
log_keywords:
  - "bnx2x"
  - "e1000"
  - "igb"
  - "ixgbe"
  - "network"
  - "ethtool"
  - "link"
  - "carrier"

# Performance configuration
performance:
  max_log_lines: 1000
  timeout_seconds: 30
  retry_attempts: 3

# Cleanup configuration
cleanup_temp_files: true
backup_original_files: true

🔍 Network Interface Check Role (roles/network_interface/tasks/main.yml)

---
- name: Check network interface status
  ansible.builtin.shell: ip link show {{ network_interface }}
  register: interface_status
  failed_when: false

- name: Parse interface status information
  ansible.builtin.set_fact:
    interface_up: "{{ 'UP' in interface_status.stdout }}"
    interface_mtu: "{{ interface_status.stdout | regex_search('mtu (\d+)') | regex_replace('.*mtu (\d+).*', '\1') }}"

- name: Get interface statistics
  ansible.builtin.shell: ethtool -S {{ network_interface }}
  register: interface_stats
  failed_when: false

- name: Parse error packet statistics
  ansible.builtin.set_fact:
    rx_errors: "{{ interface_stats.stdout | regex_search('rx_errors:\s*(\d+)') | regex_replace('.*rx_errors:\s*(\d+).*', '\1') | default('0') | int }}"
    tx_errors: "{{ interface_stats.stdout | regex_search('tx_errors:\s*(\d+)') | regex_replace('.*tx_errors:\s*(\d+).*', '\1') | default('0') | int }}"
    rx_dropped: "{{ interface_stats.stdout | regex_search('rx_dropped:\s*(\d+)') | regex_replace('.*rx_dropped:\s*(\d+).*', '\1') | default('0') | int }}"
    tx_dropped: "{{ interface_stats.stdout | regex_search('tx_dropped:\s*(\d+)') | regex_replace('.*tx_dropped:\s*(\d+).*', '\1') | default('0') | int }}"

- name: Get IP address information
  ansible.builtin.shell: ip addr show {{ network_interface }}
  register: ip_info
  failed_when: false

- name: Parse IP address
  ansible.builtin.set_fact:
    interface_ip: "{{ ip_info.stdout | regex_search('inet (\d+\.\d+\.\d+\.\d+)') | regex_replace('.*inet (\d+\.\d+\.\d+\.\d+).*', '\1') }}"

- name: Summarize network interface check results
  ansible.builtin.debug:
    msg: |
      Network Interface Check Results:
      Interface: {{ network_interface }}
      Status: {{ 'UP' if interface_up else 'DOWN' }}
      MTU: {{ interface_mtu }}
      IP Address: {{ interface_ip }}
      Received Errors: {{ rx_errors }}
      Transmitted Errors: {{ tx_errors }}
      Received Dropped: {{ rx_dropped }}
      Transmitted Dropped: {{ tx_dropped }}

🩺 System Diagnosis Role (roles/system_diagnosis/tasks/main.yml)

---
- name: Check TCP connection statistics
  ansible.builtin.shell: ss -s
  register: tcp_stats
  failed_when: false

- name: Parse TCP retransmission statistics
  ansible.builtin.set_fact:
    tcp_retransmit: "{{ tcp_stats.stdout | regex_search('TCP:\s*\d+\s*\d+\s*(\d+)') | regex_replace('.*TCP:\s*\d+\s*\d+\s*(\d+).*', '\1') | default('0') | int }}"

- name: Check network-related errors in system logs
  ansible.builtin.shell: |
    journalctl --since "1 hour ago" | grep -i "{{ item }}" | tail -{{ performance.max_log_lines }}
  register: system_logs
  loop: "{{ log_keywords }}"
  failed_when: false

- name: Summarize system log information
  ansible.builtin.set_fact:
    network_log_entries: "{{ system_logs.results | map(attribute='stdout') | list | join('\n') }}"

- name: Check system load
  ansible.builtin.shell: uptime
  register: system_load
  failed_when: false

- name: Check memory usage
  ansible.builtin.shell: free -h
  register: memory_usage
  failed_when: false

- name: Summarize system diagnosis results
  ansible.builtin.debug:
    msg: |
      System Diagnosis Results:
      TCP Retransmissions: {{ tcp_retransmit }}
      System Load: {{ system_load.stdout }}
      Memory Usage: {{ memory_usage.stdout.split('\n')[1] if memory_usage.stdout_lines | length > 1 else 'N/A' }}

📹 Traffic Capture Role (roles/traffic_capture/tasks/main.yml)

---
- name: Check if tcpdump is available
  ansible.builtin.command: which tcpdump
  register: tcpdump_check
  failed_when: false

- name: Install tcpdump (if not installed)
  ansible.builtin.dnf:
    name: tcpdump
    state: present
  when: tcpdump_check.rc != 0

- name: Create traffic capture directory
  ansible.builtin.file:
    path: "{{ traffic_capture.output_dir }}"
    state: directory
    mode: '0755'
  when: traffic_capture.enabled

- name: Start traffic capture
  ansible.builtin.shell: |
    timeout {{ traffic_capture.duration }} tcpdump -i {{ network_interface }} -w {{ traffic_capture.output_dir }}/{{ traffic_capture.filename_prefix }}_{{ inventory_hostname }}_{{ ansible_date_time.epoch }}.pcap -c {{ traffic_capture.packet_count }}
  register: capture_result
  async: "{{ traffic_capture.duration + 10 }}"
  poll: 0
  when: traffic_capture.enabled

- name: Wait for traffic capture to complete
  ansible.builtin.async_status:
    jid: "{{ capture_result.ansible_job_id }}"
  register: capture_status
  until: capture_status.finished
  retries: "{{ performance.retry_attempts }}"
  delay: 5
  when: traffic_capture.enabled

- name: Display traffic capture results
  ansible.builtin.debug:
    msg: "Traffic capture completed: {{ traffic_capture.output_dir }}/{{ traffic_capture.filename_prefix }}_{{ inventory_hostname }}_{{ ansible_date_time.epoch }}.pcap"
  when: traffic_capture.enabled

📊 Report Generation Role (roles/report_generation/tasks/main.yml)

---
- name: Create report directory
  ansible.builtin.file:
    path: "{{ report_output_dir }}"
    state: directory
    mode: '0755'

- name: Generate network latency analysis report
  ansible.builtin.template:
    src: network_analysis_report.j2
    dest: "{{ report_output_dir }}/{{ report_filename }}"
    mode: '0644'

- name: Display report generation results
  ansible.builtin.debug:
    msg: "Network latency analysis report generated: {{ report_output_dir }}/{{ report_filename }}"

- name: Display report content preview
  ansible.builtin.shell: head -20 "{{ report_output_dir }}/{{ report_filename }}"
  register: report_preview
  failed_when: false

- name: Output report preview
  ansible.builtin.debug:
    msg: "{{ report_preview.stdout_lines }}"

📋 Report Template (roles/report_generation/templates/network_analysis_report.j2)

=====================================
RHEL8 Network Latency Issue Diagnosis Report
=====================================
Analysis Time: {{ analysis_start_time }}
Host Name: {{ inventory_hostname }}
Network Interface: {{ network_interface }}
Analysis Duration: {{ analysis_duration }} seconds

1. Network Interface Status Check
=====================================
Interface Name: {{ network_interface }}
Interface Status: {{ 'UP' if interface_up else 'DOWN' }}
MTU Setting: {{ interface_mtu }}
IP Address: {{ interface_ip }}

2. Hardware Statistics
=====================================
Received Error Packets: {{ rx_errors }}
Transmitted Error Packets: {{ tx_errors }}
Received Dropped Packets: {{ rx_dropped }}
Transmitted Dropped Packets: {{ tx_dropped }}

3. System Diagnosis Results
=====================================
TCP Retransmission Count: {{ tcp_retransmit }}
System Load: {{ system_load.stdout if system_load.stdout else 'N/A' }}

4. Problem Diagnosis Conclusion
=====================================
{% if rx_errors > diagnosis_thresholds.error_packets or tx_errors > diagnosis_thresholds.error_packets %}
[Severe Issue] Detected network interface error packets exceeding threshold
{% endif %}
{% if rx_dropped > diagnosis_thresholds.dropped_packets or tx_dropped > diagnosis_thresholds.dropped_packets %}
[Potential Issue] Detected network interface dropped packets exceeding threshold
{% endif %}
{% if tcp_retransmit > 0 %}
[Notice] Detected TCP retransmissions, indicating possible network instability
{% endif %}

5. Suggested Measures
=====================================
{% if rx_errors > diagnosis_thresholds.error_packets or tx_errors > diagnosis_thresholds.error_packets %}
- Check physical network connections
- Verify cable quality
- Consider replacing network card drivers
{% endif %}
{% if rx_dropped > diagnosis_thresholds.dropped_packets or tx_dropped > diagnosis_thresholds.dropped_packets %}
- Check for network congestion
- Adjust network buffer sizes
- Optimize application network usage
{% endif %}
{% if tcp_retransmit > 0 %}
- Check network latency
- Optimize TCP parameters
- Consider using a more stable network path
{% endif %}

6. Traffic Capture File
=====================================
{% if traffic_capture.enabled %}
Capture File: {{ traffic_capture.output_dir }}/{{ traffic_capture.filename_prefix }}_{{ inventory_hostname }}_{{ ansible_date_time.epoch }}.pcap
Use tools like Wireshark for in-depth analysis
{% else %}
Traffic capture not enabled
{% endif %}

=====================================
Report Generation Completed
======================================

🚀 How to Use?

🛠️ Foolproof Deployment Guide

1️⃣ Environment Preparation

# Ensure Ansible is installed
ansible --version

# Configure SSH key authentication
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
ssh-copy-id root@target_server_IP

2️⃣ Download the Project

# Enter the project directory
cd 03_RHEL8_Network_Latency_Automated_Analysis

3️⃣ Configure Host Inventory

Edit the <span><span>inventory</span></span> file to add your target server:

[RHEL8]
Your_Server_IP_Address

4️⃣ Configure Network Interface

Edit the <span><span>group_vars/all.yml</span></span> file to specify the network interface to analyze:

network_interface: "ens33"  # Change to your network card name

5️⃣ One-Click Execution

# Execute full analysis
ansible-playbook playbook.yml -i inventory -v

# Or execute step by step
ansible-playbook playbook.yml -i inventory --tags network_interface
ansible-playbook playbook.yml -i inventory --tags system_diagnosis
ansible-playbook playbook.yml -i inventory --tags traffic_capture
ansible-playbook playbook.yml -i inventory --tags report_generation

6️⃣ View Analysis Results

# View the generated report
cat /tmp/network_analysis_reports/network_analysis_*.txt

# View traffic capture files (if enabled)
ls -la /var/tmp/packet_capture_*.pcap

🎯 Step-by-Step Execution Guide

If you only want to execute specific diagnostic functions, you can use tags:

# Only check network interface
ansible-playbook playbook.yml -i inventory --tags network_interface

# Only perform system diagnosis
ansible-playbook playbook.yml -i inventory --tags system_diagnosis

# Only capture traffic
ansible-playbook playbook.yml -i inventory --tags traffic_capture

# Only generate report
ansible-playbook playbook.yml -i inventory --tags report_generation

🔥 Core Feature Highlights

✅ Comprehensive Network Interface Check

Automatically check interface status (UP/DOWN)Obtain MTU configuration informationParse IP address configurationCount error and dropped packets

✅ In-Depth System-Level Diagnosis

Analyze TCP connection statusSearch system logs for keywordsCheck system loadMonitor memory usage

✅ Intelligent Traffic Capture

Automatically install tcpdump toolConfigurable capture duration and packet countGenerate standard pcap format filesSupport in-depth analysis with Wireshark

✅ Professional Report Generation

Structured diagnostic reportSeverity classification of issuesTargeted solution suggestionsTimestamp and host information

✅ Highly Configurable

Variable network interface configurationAdjustable diagnosis thresholdsFlexible system log keywordsOptional traffic capture feature

✅ Comprehensive Error Handling

Idempotent designFault tolerance for failed tasksTimeout control mechanismRetry mechanism support

💡 Tips for Use

🎯 Batch Diagnosis

# Add multiple servers in inventory
[RHEL8]
server1 ansible_host=192.168.1.100
server2 ansible_host=192.168.1.101
server3 ansible_host=192.168.1.102

# Execute in parallel, doubling efficiency
ansible-playbook playbook.yml -i inventory --forks 10

🔧 Custom Configuration

Edit the <span><span>group_vars/all.yml</span></span> file to adjust according to your environment:

Change network interface nameAdjust diagnosis thresholdsAdd custom log keywordsConfigure traffic capture parameters

🐛 Troubleshooting

If you encounter issues, check the <span><span>repair_instructions.md</span></span> file, which contains solutions and repair records for common problems.

🎁 Summary

This RHEL8 automated analysis solution for network latency truly achieves:

🔍 Comprehensive Diagnosis: Analyzing everything from hardware to system, from interfaces to traffic🚀 One-Click Execution: Automating all diagnostic steps without manual intervention📊 Professional Reports: Generating structured diagnostic reports that make issues clear🔧 Highly Customizable: Variable configuration to adapt to different network environments📈 Batch Processing: Supporting multi-host parallel diagnosis, doubling efficiency

What are you waiting for? Download this automated diagnosis solution now and boost your network troubleshooting efficiency by 10 times!

👉 Do you find the Playbook in the article not detailed enough? Want to see super detailed Chinese comments for every step and understand the meaning behind each line of code?

👉 If interested, please message me to get it 👈

Tags: #Ansible #Automation #NetworkDiagnosis #RHEL8 #NetworkLatency #OperationalEfficiency

Leave a Comment