π₯ Ansible Emergency Hotline | Troubled by Memory Out of Memory (OOM) Troubleshooting? One-Click Automated Diagnosis Turns You into a Memory Expert!
Are you still struggling with the tedious troubleshooting of Out of Memory (OOM) issues? Today, we bring you a comprehensive automated analysis solution for OOM issues on RHEL8/9 & CentOS8/9, allowing you to say goodbye to the nightmare of manually typing commands!
π― Directly Addressing Pain Points
The daily routine of an operations engineer: the system suddenly freezes β manually checking OOM logs β checking memory usage β analyzing process memory consumption β checking swap usage β troubleshooting memory hardware β analyzing system load… After a series of actions, several hours have passed, and the problem may still be unclear.
Even more frightening is: OOM issues often lead to system crashes and service interruptions, and manual troubleshooting can easily overlook key information, lacking systematic analysis and failing to quickly locate the root cause. Have you ever thought that if there were an automated OOM diagnosis solution, all these problems would be resolved?
β¨ Solution Preview
Today, we share an automated analysis solution for OOM issues on RHEL8/9 & CentOS8/9 using Ansible, which includes 8 core diagnostic modules, standardizing, automating, and intelligentizing your memory troubleshooting!
Results Preview
π§Ύ Sample Original Diagnosis Report (results only)
======== OOM Diagnosis Report ======
Hostname: server.example.com
Date: 2025-09-22 20:13:01
--- OOM Killer Logs ---
Sep 22 15:30:15 server kernel: Out of memory: Kill process 12345 (java) score 789 or sacrifice child
Sep 22 15:30:15 server kernel: Killed process 12345 (java) total-vm:2048576kB, anon-rss:1024000kB, file-rss:0kB, shmem-rss:0kB
--- Processes Killed Memory Usage ---
Sep 22 15:30:15 server kernel: Killed process 12345 (java) total-vm:2048576kB, anon-rss:1024000kB, file-rss:0kB, shmem-rss:0kB
Sep 22 15:30:15 server kernel: Killed process 12346 (mysql) total-vm:1048576kB, anon-rss:512000kB, file-rss:0kB, shmem-rss:0kB
--- SAR Memory Usage (%commit) ---
Linux 5.14.0-427.13.1.el9_4.x86_64 (server) 09/22/2025 _x86_64_ (4 CPU)
15:25:01 kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
15:30:01 102400 204800 3072000 96.8 51200 102400 4096000 128.5 2048000 512000 1024
15:35:01 51200 153600 3123200 98.4 51200 102400 4096000 128.5 2048000 512000 1024
--- SAR Swap Usage ---
Linux 5.14.0-427.13.1.el9_4.x86_64 (server) 09/22/2025 _x86_64_ (4 CPU)
15:25:01 kbswpfree kbswpused %swpused kbswpcad %swpcad
15:30:01 1024000 512000 33.3 256000 50.0
15:35:01 512000 1024000 66.7 512000 50.0
--- vmstat Output ---
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 1 1024000 51200 51200 102400 0 0 0 0 100 200 15 5 75 5 0
1 2 1024000 25600 51200 102400 0 0 0 0 120 250 20 8 65 7 0
--- Total Memory from /proc/meminfo ---
MemTotal: 4194304 kB
--- Free Memory from free -m ---
total used free shared buff/cache available
Mem: 4096 3072 100 200 924 800
Swap: 2048 1024 1024
--- Memory Hardware Info (dmidecode -t memory) ---
Memory Device
Array Handle: 0x0001
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Locator: DIMM_A1
Bank Locator: BANK_A
=======================================
π€ Design Philosophy: Why is Our Playbook a Best Practice?
A professional automation solution is not just a simple pile of commands. Our design philosophy incorporates the core practices advocated by Red Hat, allowing your automation solution to leap from “it works” to “professional and reliable”!
1
Comprehensive Memory Diagnosis, No Issues Escape β¨ We adopt a multi-dimensional memory analysis strategy, covering all aspects of memory issues from OOM logs to memory hardware, from real-time status to historical trends. This includes OOM killer logs, process memory usage, SAR historical data, vmstat real-time status, memory hardware information, etc., ensuring no memory issue escapes!
2
Variable-Driven, Flexible Adaptation π» We centralize all configurable parameters (such as report directory, report filename) in the <span><span>vars</span></span> section at the top of the Playbook. This means that when you need to adjust the scope of diagnosis, you only need to modify these variables without touching any core automation task logic.
3
Idempotency Assurance, Safe and Worry-Free β All our Playbooks strictly adhere to Ansible’s core principleβidempotency. You can confidently execute this Playbook repeatedly; Ansible will automatically detect the current state and only perform necessary checks.
4
Closed-Loop Verification, Visible Results π― The last step of the Playbook is to generate a complete diagnostic report. This forms a check-analyze-report closed loop. Not only do you execute automation, but you can also immediately see the diagnostic results, ensuring that the problem is under control!
β Automation Scenario Scoring
| Scoring Dimension | Score | Description |
|---|---|---|
| Ease of Use | ββββ | One-click execution, detailed comments, beginner-friendly |
| Reusability | βββββ | Variable configuration, supports multi-host parallel execution |
| Stability | βββββ | Idempotent design, comprehensive error handling |
| Scalability | ββββ | Modular design, easy to extend functionality |
| Best Practice Compliance | βββββ | Follows Ansible best practices, code standards |
ποΈ Project Directory Structure
10_Memory Out of Memory (OOM) Automated Diagnosis/
βββ troubleshooting03_oom.yml # Main diagnostic Playbook
π Core File Content Overview
π― Main Diagnostic Playbook (troubleshooting03_oom.yml)
---
- name: "Diagnose OOM Issues on RHEL Servers"
hosts: all
become: yes
gather_facts: yes
vars:
report_dir: "/var/tmp/oom_diagnosis"
report_file: "oom_report.txt"
tasks:
- name: "Ensure report directory exists"
ansible.builtin.file:
path: "{{ report_dir }}"
state: directory
owner: root
group: root
mode: '0755'
- name: "Collect OOM killer logs"
ansible.builtin.shell: |
egrep 'Out of memory:' /var/log/messages || true
register: oom_killer_logs
ignore_errors: yes
- name: "Collect killed process memory usage logs"
ansible.builtin.shell: |
egrep 'total-vm' /var/log/messages || true
register: oom_process_logs
ignore_errors: yes
- name: "Check SAR memory usage (%commit)"
ansible.builtin.shell: |
sar -r -f /var/log/sa/sa$(date +%d) 2>/dev/null || true
register: sar_memory
ignore_errors: yes
- name: "Check SAR swap usage"
ansible.builtin.shell: |
sar -S -f /var/log/sa/sa$(date +%d) 2>/dev/null || true
register: sar_swap
ignore_errors: yes
- name: "Check SAR CPU usage"
ansible.builtin.shell: |
sar -f /var/log/sa/sa$(date +%d) 2>/dev/null || true
register: sar_cpu
ignore_errors: yes
- name: "Check vmstat (short run)"
ansible.builtin.shell: |
vmstat 1 5
register: vmstat_out
ignore_errors: yes
- name: "Collect total memory from /proc/meminfo"
ansible.builtin.shell: |
grep MemTotal /proc/meminfo
register: mem_total
ignore_errors: yes
- name: "Collect free memory using free -m"
ansible.builtin.shell: |
free -m
register: free_mem
ignore_errors: yes
- name: "Collect memory hardware info using dmidecode"
ansible.builtin.shell: |
dmidecode -t memory
register: dmidecode_mem
ignore_errors: yes
- name: "Assemble OOM Diagnosis Report"
ansible.builtin.copy:
dest: "{{ report_dir }}/{{ report_file }}"
content: |
======== OOM Diagnosis Report ======
Hostname: {{ inventory_hostname }}
Date: {{ ansible_date_time.date }} {{ ansible_date_time.time }}
--- OOM Killer Logs ---
{{ oom_killer_logs.stdout | default('N/A') }}
--- Processes Killed Memory Usage ---
{{ oom_process_logs.stdout | default('N/A') }}
--- SAR Memory Usage (%commit) ---
{{ sar_memory.stdout | default('N/A') }}
--- SAR Swap Usage ---
{{ sar_swap.stdout | default('N/A') }}
--- SAR CPU Usage ---
{{ sar_cpu.stdout | default('N/A') }}
--- vmstat Output ---
{{ vmstat_out.stdout | default('N/A') }}
--- Total Memory from /proc/meminfo ---
{{ mem_total.stdout | default('N/A') }}
--- Free Memory from free -m ---
{{ free_mem.stdout | default('N/A') }}
--- Memory Hardware Info (dmidecode -t memory) ---
{{ dmidecode_mem.stdout | default('N/A') }}
=======================================
- name: "Debug - Report file location"
ansible.builtin.debug:
msg: "OOM diagnosis report generated at {{ report_dir }}/{{ report_file }}"
π οΈ Foolproof Deployment Guide
Seeing it a thousand times in theory is not as good as doing it once!
Prerequisites
1One Ansible control node.2The target server is configured with SSH trust, and the user executing Ansible has <span><span>sudo</span></span> permissions.3The control node has Ansible installed.
Project Directory Structure
This is a very simple project; you only need a few files!
10_Memory Out of Memory (OOM) Automated Diagnosis/
βββ troubleshooting03_oom.yml # Main diagnostic Playbook
βββ inventory # Host inventory (needs to be created)
How to Use?
1
Create Host Inventory π: Create an <span><span>inventory</span></span> file and fill in your server hostnames or IP addresses.
[all]
server1.example.com
server2.example.com
# or
# 192.168.1.100
# 192.168.1.101
2
Modify Variables βοΈ: Open the <span><span>troubleshooting03_oom.yml</span></span> file and modify the variable section according to your needs, such as report directory, report filename, etc.
3
Execute Automation βΆοΈ: Run the following command, then go make a cup of coffee βοΈ!
ansible-playbook -i inventory troubleshooting03_oom.yml
π Diagnostic Coverage
β OOM Log Analysis
β’OOM Killer Logs: Collect detailed logs triggered by the system OOM killerβ’Process Memory Usage: Analyze the memory usage of killed processesβ’Timestamp Analysis: Determine the exact time the OOM event occurred
β Historical Memory Usage Analysis
β’SAR Memory Usage: Analyze historical memory usage trends and %commit metricsβ’SAR Swap Usage: Analyze swap usage and trendsβ’SAR CPU Usage: Correlate CPU usage to analyze system load
β Real-Time Memory Status Check
β’vmstat Output: Real-time memory, swap, and I/O status checkβ’Total Memory: Get total memory from /proc/meminfoβ’Available Memory: Get current memory usage using free -m
β Memory Hardware Information
β’Hardware Details: Get detailed memory hardware information using dmidecodeβ’Memory Configuration: Analyze memory module configuration and capacity information
β Intelligent Report Generation
β’Structured Report: Generate a diagnostic report containing all key informationβ’Timestamp Logging: Log the execution time of the diagnosisβ’Host Information: Include hostname and system information
β Comprehensive Error Handling
β’Idempotent designβ’Error tolerance for failed tasks (ignore_errors: yes)β’Automatic creation of report directoryβ’Default value handling (default(‘N/A’))
π‘ Usage Tips
π― Batch Diagnosis
# Add multiple servers in the inventory
[all]
server1 ansible_host=192.168.1.100
server2 ansible_host=192.168.1.101
server3 ansible_host=192.168.1.102
# Execute in parallel, doubling efficiency
ansible-playbook troubleshooting03_oom.yml -i inventory --forks 10
π§ Custom Configuration
Edit the <span><span>troubleshooting03_oom.yml</span></span> file to adjust according to your environment:
β’Modify report output directoryβ’Adjust report filenameβ’Customize diagnosis scope
π Troubleshooting
If you encounter issues, check the generated diagnostic report:
β’Report location:<span><span>/var/tmp/oom_diagnosis/oom_report.txt</span></span>β’Contains complete OOM diagnostic informationβ’Provides clues for memory issue analysis
β οΈ Reminder on the Importance of OOM Issues
OOM issues often lead to system crashes and service interruptions; it is recommended to:
β’Regularly check memory usageβ’Set memory usage alertsβ’Establish standard procedures for handling OOM issuesβ’Monitor memory usage of critical processes
π― Advanced Usage
Custom Diagnosis Scope
# Check only specific hosts
ansible-playbook troubleshooting03_oom.yml -i inventory --limit server1
# Skip certain checks
ansible-playbook troubleshooting03_oom.yml -i inventory --skip-tags "hardware_check"
Output Format Customization
# Detailed output mode
ansible-playbook troubleshooting03_oom.yml -i inventory -v
# Super detailed output mode
ansible-playbook troubleshooting03_oom.yml -i inventory -vvv
Custom Variable Override
# Override default variables
ansible-playbook troubleshooting03_oom.yml -i inventory -e "report_dir=/tmp/custom_oom"
π Surprise Time! Get the Complete Annotated Version!
Do you find the above Playbook not detailed enough? Want to delve into the logic behind each line of code and the best practices recommended by the official documentation?
Let you not only use it but also be able to apply it in various scenarios, becoming the most outstanding Ansible automation expert in your team!
π Click the link below to γRead the Originalγ and get the packaged download of the Playbook project with complete annotations and syntax highlighting! π
π Summary
This automated analysis solution for OOM issues on RHEL8/9 & CentOS8/9 truly achieves:
β’π Comprehensive Diagnosis: Analyzing all aspects from OOM logs to memory hardware, from real-time status to historical trendsβ’π One-Click Execution: Automating all diagnostic steps without manual interventionβ’π Intelligent Reporting: Generating structured diagnostic reports, making memory issues clear at a glanceβ’π§ Highly Customizable: Variable configuration to adapt to different memory environmentsβ’π Batch Processing: Supporting parallel diagnosis across multiple hosts, doubling efficiencyβ’π‘οΈ Safe and Reliable: Read-only analysis, no modification of system configurationsβ’β° Historical Analysis: Providing historical trend analysis combined with SAR data
What are you waiting for? Download this automated diagnosis solution and boost your OOM troubleshooting efficiency by 10 times!
Tags:#Ansible #Automation Operations #OOM Diagnosis #Memory Management #RHEL8 #CentOS8 #System Failures #Operational Efficiency