Ansible Emergency Hotline Series (24): Automated Diagnosis of Memory OOM

πŸ’₯ Ansible Emergency Hotline | Troubled by Memory Out of Memory (OOM) Troubleshooting? One-Click Automated Diagnosis Turns You into a Memory Expert!

Are you still struggling with the tedious troubleshooting of Out of Memory (OOM) issues? Today, we bring you a comprehensive automated analysis solution for OOM issues on RHEL8/9 & CentOS8/9, allowing you to say goodbye to the nightmare of manually typing commands!

🎯 Directly Addressing Pain Points

The daily routine of an operations engineer: the system suddenly freezes β†’ manually checking OOM logs β†’ checking memory usage β†’ analyzing process memory consumption β†’ checking swap usage β†’ troubleshooting memory hardware β†’ analyzing system load… After a series of actions, several hours have passed, and the problem may still be unclear.

Even more frightening is: OOM issues often lead to system crashes and service interruptions, and manual troubleshooting can easily overlook key information, lacking systematic analysis and failing to quickly locate the root cause. Have you ever thought that if there were an automated OOM diagnosis solution, all these problems would be resolved?

✨ Solution Preview

Today, we share an automated analysis solution for OOM issues on RHEL8/9 & CentOS8/9 using Ansible, which includes 8 core diagnostic modules, standardizing, automating, and intelligentizing your memory troubleshooting!

Results Preview

🧾 Sample Original Diagnosis Report (results only)

======== OOM Diagnosis Report ======
Hostname: server.example.com
Date: 2025-09-22 20:13:01

--- OOM Killer Logs ---
Sep 22 15:30:15 server kernel: Out of memory: Kill process 12345 (java) score 789 or sacrifice child
Sep 22 15:30:15 server kernel: Killed process 12345 (java) total-vm:2048576kB, anon-rss:1024000kB, file-rss:0kB, shmem-rss:0kB

--- Processes Killed Memory Usage ---
Sep 22 15:30:15 server kernel: Killed process 12345 (java) total-vm:2048576kB, anon-rss:1024000kB, file-rss:0kB, shmem-rss:0kB
Sep 22 15:30:15 server kernel: Killed process 12346 (mysql) total-vm:1048576kB, anon-rss:512000kB, file-rss:0kB, shmem-rss:0kB

--- SAR Memory Usage (%commit) ---
Linux 5.14.0-427.13.1.el9_4.x86_64 (server)  09/22/2025  _x86_64_  (4 CPU)

15:25:01    kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
15:30:01      102400    204800   3072000     96.8     51200    102400   4096000    128.5   2048000    512000     1024
15:35:01       51200    153600   3123200     98.4     51200    102400   4096000    128.5   2048000    512000     1024

--- SAR Swap Usage ---
Linux 5.14.0-427.13.1.el9_4.x86_64 (server)  09/22/2025  _x86_64_  (4 CPU)

15:25:01    kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
15:30:01     1024000    512000     33.3     256000     50.0
15:35:01      512000   1024000     66.7     512000     50.0

--- vmstat Output ---
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1 1024000  51200  51200 102400   0    0     0     0  100  200 15  5 75  5  0
 1  2 1024000  25600  51200 102400   0    0     0     0  120  250 20  8 65  7  0

--- Total Memory from /proc/meminfo ---
MemTotal:        4194304 kB

--- Free Memory from free -m ---
              total        used        free      shared  buff/cache   available
Mem:           4096        3072         100         200         924         800
Swap:          2048        1024        1024

--- Memory Hardware Info (dmidecode -t memory) ---
Memory Device
	Array Handle: 0x0001
	Error Information Handle: Not Provided
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Locator: DIMM_A1
	Bank Locator: BANK_A
=======================================

πŸ€” Design Philosophy: Why is Our Playbook a Best Practice?

A professional automation solution is not just a simple pile of commands. Our design philosophy incorporates the core practices advocated by Red Hat, allowing your automation solution to leap from “it works” to “professional and reliable”!

1

Comprehensive Memory Diagnosis, No Issues Escape ✨ We adopt a multi-dimensional memory analysis strategy, covering all aspects of memory issues from OOM logs to memory hardware, from real-time status to historical trends. This includes OOM killer logs, process memory usage, SAR historical data, vmstat real-time status, memory hardware information, etc., ensuring no memory issue escapes!

2

Variable-Driven, Flexible Adaptation πŸ’» We centralize all configurable parameters (such as report directory, report filename) in the <span><span>vars</span></span> section at the top of the Playbook. This means that when you need to adjust the scope of diagnosis, you only need to modify these variables without touching any core automation task logic.

3

Idempotency Assurance, Safe and Worry-Free βœ… All our Playbooks strictly adhere to Ansible’s core principleβ€”idempotency. You can confidently execute this Playbook repeatedly; Ansible will automatically detect the current state and only perform necessary checks.

4

Closed-Loop Verification, Visible Results 🎯 The last step of the Playbook is to generate a complete diagnostic report. This forms a check-analyze-report closed loop. Not only do you execute automation, but you can also immediately see the diagnostic results, ensuring that the problem is under control!

⭐ Automation Scenario Scoring

Scoring Dimension Score Description
Ease of Use ⭐⭐⭐⭐ One-click execution, detailed comments, beginner-friendly
Reusability ⭐⭐⭐⭐⭐ Variable configuration, supports multi-host parallel execution
Stability ⭐⭐⭐⭐⭐ Idempotent design, comprehensive error handling
Scalability ⭐⭐⭐⭐ Modular design, easy to extend functionality
Best Practice Compliance ⭐⭐⭐⭐⭐ Follows Ansible best practices, code standards

πŸ—‚οΈ Project Directory Structure

10_Memory Out of Memory (OOM) Automated Diagnosis/
β”œβ”€β”€ troubleshooting03_oom.yml    # Main diagnostic Playbook

πŸ“„ Core File Content Overview

🎯 Main Diagnostic Playbook (troubleshooting03_oom.yml)

---
- name: "Diagnose OOM Issues on RHEL Servers"
  hosts: all
  become: yes
  gather_facts: yes

  vars:
    report_dir: "/var/tmp/oom_diagnosis"
    report_file: "oom_report.txt"

  tasks:

    - name: "Ensure report directory exists"
      ansible.builtin.file:
        path: "{{ report_dir }}"
        state: directory
        owner: root
        group: root
        mode: '0755'

    - name: "Collect OOM killer logs"
      ansible.builtin.shell: |
        egrep 'Out of memory:' /var/log/messages || true
      register: oom_killer_logs
      ignore_errors: yes

    - name: "Collect killed process memory usage logs"
      ansible.builtin.shell: |
        egrep 'total-vm' /var/log/messages || true
      register: oom_process_logs
      ignore_errors: yes

    - name: "Check SAR memory usage (%commit)"
      ansible.builtin.shell: |
        sar -r -f /var/log/sa/sa$(date +%d) 2>/dev/null || true
      register: sar_memory
      ignore_errors: yes

    - name: "Check SAR swap usage"
      ansible.builtin.shell: |
        sar -S -f /var/log/sa/sa$(date +%d) 2>/dev/null || true
      register: sar_swap
      ignore_errors: yes

    - name: "Check SAR CPU usage"
      ansible.builtin.shell: |
        sar -f /var/log/sa/sa$(date +%d) 2>/dev/null || true
      register: sar_cpu
      ignore_errors: yes

    - name: "Check vmstat (short run)"
      ansible.builtin.shell: |
        vmstat 1 5
      register: vmstat_out
      ignore_errors: yes

    - name: "Collect total memory from /proc/meminfo"
      ansible.builtin.shell: |
        grep MemTotal /proc/meminfo
      register: mem_total
      ignore_errors: yes

    - name: "Collect free memory using free -m"
      ansible.builtin.shell: |
        free -m
      register: free_mem
      ignore_errors: yes

    - name: "Collect memory hardware info using dmidecode"
      ansible.builtin.shell: |
        dmidecode -t memory
      register: dmidecode_mem
      ignore_errors: yes

    - name: "Assemble OOM Diagnosis Report"
      ansible.builtin.copy:
        dest: "{{ report_dir }}/{{ report_file }}"
        content: |
          ======== OOM Diagnosis Report ======
          Hostname: {{ inventory_hostname }}
          Date: {{ ansible_date_time.date }} {{ ansible_date_time.time }}

          --- OOM Killer Logs ---
          {{ oom_killer_logs.stdout | default('N/A') }}

          --- Processes Killed Memory Usage ---
          {{ oom_process_logs.stdout | default('N/A') }}

          --- SAR Memory Usage (%commit) ---
          {{ sar_memory.stdout | default('N/A') }}

          --- SAR Swap Usage ---
          {{ sar_swap.stdout | default('N/A') }}

          --- SAR CPU Usage ---
          {{ sar_cpu.stdout | default('N/A') }}

          --- vmstat Output ---
          {{ vmstat_out.stdout | default('N/A') }}

          --- Total Memory from /proc/meminfo ---
          {{ mem_total.stdout | default('N/A') }}

          --- Free Memory from free -m ---
          {{ free_mem.stdout | default('N/A') }}

          --- Memory Hardware Info (dmidecode -t memory) ---
          {{ dmidecode_mem.stdout | default('N/A') }}

          =======================================

    - name: "Debug - Report file location"
      ansible.builtin.debug:
        msg: "OOM diagnosis report generated at {{ report_dir }}/{{ report_file }}"

πŸ› οΈ Foolproof Deployment Guide

Seeing it a thousand times in theory is not as good as doing it once!

Prerequisites

1One Ansible control node.2The target server is configured with SSH trust, and the user executing Ansible has <span><span>sudo</span></span> permissions.3The control node has Ansible installed.

Project Directory Structure

This is a very simple project; you only need a few files!

10_Memory Out of Memory (OOM) Automated Diagnosis/
β”œβ”€β”€ troubleshooting03_oom.yml    # Main diagnostic Playbook
└── inventory                    # Host inventory (needs to be created)

How to Use?

1

Create Host Inventory πŸ“: Create an <span><span>inventory</span></span> file and fill in your server hostnames or IP addresses.

[all]
server1.example.com
server2.example.com
# or
# 192.168.1.100
# 192.168.1.101

2

Modify Variables ✏️: Open the <span><span>troubleshooting03_oom.yml</span></span> file and modify the variable section according to your needs, such as report directory, report filename, etc.

3

Execute Automation ▢️: Run the following command, then go make a cup of coffee β˜•οΈ!

ansible-playbook -i inventory troubleshooting03_oom.yml

πŸ” Diagnostic Coverage

βœ… OOM Log Analysis

β€’OOM Killer Logs: Collect detailed logs triggered by the system OOM killerβ€’Process Memory Usage: Analyze the memory usage of killed processesβ€’Timestamp Analysis: Determine the exact time the OOM event occurred

βœ… Historical Memory Usage Analysis

β€’SAR Memory Usage: Analyze historical memory usage trends and %commit metricsβ€’SAR Swap Usage: Analyze swap usage and trendsβ€’SAR CPU Usage: Correlate CPU usage to analyze system load

βœ… Real-Time Memory Status Check

β€’vmstat Output: Real-time memory, swap, and I/O status checkβ€’Total Memory: Get total memory from /proc/meminfoβ€’Available Memory: Get current memory usage using free -m

βœ… Memory Hardware Information

β€’Hardware Details: Get detailed memory hardware information using dmidecodeβ€’Memory Configuration: Analyze memory module configuration and capacity information

βœ… Intelligent Report Generation

β€’Structured Report: Generate a diagnostic report containing all key informationβ€’Timestamp Logging: Log the execution time of the diagnosisβ€’Host Information: Include hostname and system information

βœ… Comprehensive Error Handling

β€’Idempotent designβ€’Error tolerance for failed tasks (ignore_errors: yes)β€’Automatic creation of report directoryβ€’Default value handling (default(‘N/A’))

πŸ’‘ Usage Tips

🎯 Batch Diagnosis

# Add multiple servers in the inventory
[all]
server1 ansible_host=192.168.1.100
server2 ansible_host=192.168.1.101
server3 ansible_host=192.168.1.102

# Execute in parallel, doubling efficiency
ansible-playbook troubleshooting03_oom.yml -i inventory --forks 10

πŸ”§ Custom Configuration

Edit the <span><span>troubleshooting03_oom.yml</span></span> file to adjust according to your environment:

β€’Modify report output directoryβ€’Adjust report filenameβ€’Customize diagnosis scope

πŸ› Troubleshooting

If you encounter issues, check the generated diagnostic report:

β€’Report location:<span><span>/var/tmp/oom_diagnosis/oom_report.txt</span></span>β€’Contains complete OOM diagnostic informationβ€’Provides clues for memory issue analysis

⚠️ Reminder on the Importance of OOM Issues

OOM issues often lead to system crashes and service interruptions; it is recommended to:

β€’Regularly check memory usageβ€’Set memory usage alertsβ€’Establish standard procedures for handling OOM issuesβ€’Monitor memory usage of critical processes

🎯 Advanced Usage

Custom Diagnosis Scope

# Check only specific hosts
ansible-playbook troubleshooting03_oom.yml -i inventory --limit server1

# Skip certain checks
ansible-playbook troubleshooting03_oom.yml -i inventory --skip-tags "hardware_check"

Output Format Customization

# Detailed output mode
ansible-playbook troubleshooting03_oom.yml -i inventory -v

# Super detailed output mode
ansible-playbook troubleshooting03_oom.yml -i inventory -vvv

Custom Variable Override

# Override default variables
ansible-playbook troubleshooting03_oom.yml -i inventory -e "report_dir=/tmp/custom_oom"

🎁 Surprise Time! Get the Complete Annotated Version!

Do you find the above Playbook not detailed enough? Want to delve into the logic behind each line of code and the best practices recommended by the official documentation?

Let you not only use it but also be able to apply it in various scenarios, becoming the most outstanding Ansible automation expert in your team!

πŸ‘‰ Click the link below to 【Read the Original】 and get the packaged download of the Playbook project with complete annotations and syntax highlighting! πŸ‘ˆ

🎁 Summary

This automated analysis solution for OOM issues on RHEL8/9 & CentOS8/9 truly achieves:

β€’πŸ” Comprehensive Diagnosis: Analyzing all aspects from OOM logs to memory hardware, from real-time status to historical trendsβ€’πŸš€ One-Click Execution: Automating all diagnostic steps without manual interventionβ€’πŸ“Š Intelligent Reporting: Generating structured diagnostic reports, making memory issues clear at a glanceβ€’πŸ”§ Highly Customizable: Variable configuration to adapt to different memory environmentsβ€’πŸ“ˆ Batch Processing: Supporting parallel diagnosis across multiple hosts, doubling efficiencyβ€’πŸ›‘οΈ Safe and Reliable: Read-only analysis, no modification of system configurations‒⏰ Historical Analysis: Providing historical trend analysis combined with SAR data

What are you waiting for? Download this automated diagnosis solution and boost your OOM troubleshooting efficiency by 10 times!

Tags:#Ansible #Automation Operations #OOM Diagnosis #Memory Management #RHEL8 #CentOS8 #System Failures #Operational Efficiency

Leave a Comment