Ansible Firefighting Hotline Series (26) Automated OS Performance Analysis

πŸ’₯ Ansible Firefighting Hotline | Is OS Performance Analysis Too Complicated? One-Click PCP Automated Inspection Turns You into a Performance Expert!

Are you still struggling with system performance analysis? Manually running a bunch of commands like top, iostat, vmstat, free… leads to scattered information that is hard to integrate, and you might miss key metrics. Today, we bring you an enterprise-level automated performance inspection solution for RHEL systems, allowing you to say goodbye to the nightmare of manual performance analysis!

🎯 Pain Points Addressed

The daily routine of an operations engineer: the system slows down β†’ manually run top to check CPU β†’ iostat to check disk IO β†’ vmstat to check system load β†’ free to check memory β†’ numastat to check NUMA β†’ view historical data… After a series of commands, several hours have passed, and the root cause of the performance issue remains elusive.

Even more frightening is: traditional performance analysis tools have scattered information, lack historical trend analysis, cannot form a complete performance picture, and manually integrating data is time-consuming and labor-intensive, making it difficult to support enterprise-level performance management needs. Have you ever thought that if there were an automated performance analysis solution, all these problems would be solved?

✨ Solution Preview

Today, we share an automated performance analysis solution for RHEL systems using Ansible, based on the Performance Co-Pilot (PCP) enterprise-level monitoring tool, which includes 8 core analysis modules, allowing your system performance analysis to be standardized, automated, and intelligent!

Results Preview

🧾 PCP Performance Inspection Report Excerpt (Results Only)

# RHEL Performance Inspection Report (Automatically Generated by PCP and Ansible)

**Report Generation Time:** 2025-09-27T15:30:00+08:00

---

## 🎯 Host: rhel-server-01

### 1. System Overview (PCP)
This section displays the configuration of PCP, hardware summary, and running agents, which is the first step to understanding the monitoring environment.
```text
Performance Co-Pilot (PCP) Archive Logger
Copyright (c) 2012-2024 Red Hat.
Hostname: rhel-server-01
Archive: /var/log/pcp/pmlogger/rhel-server-01/20250927
Start: Fri Sep 27 15:00:00.000 2025
End:   Fri Sep 27 15:30:00.000 2025
Commencing PCP Archive Logger (pmlogger)...
```

2. System Load and Uptime

Displays the current time, system uptime, number of logged-in users, and average load over the past 1, 5, and 15 minutes.

 15:30:00 up 30 days,  4:15,  2 users,  load average: 1.25, 1.10, 0.95

3. Memory Usage

Provides detailed information on total, used, and available physical memory (Mem) and swap space (Swap), in MB.

      total        used     free   shared  buff/cache available
Mem:    32168       18964    3208     1252   9996      11632
Swap:    8192       512      7680

4. NUMA Architecture Statistics

Displays Non-Uniform Memory Access statistics, which are crucial for performance tuning on servers with multiple physical CPUs.

node0   1048576   123456   56789   1234   567  123   45   6   0
node1  1048576   234567   67890   2345   678   234   56   7   1

5. Disk I/O Statistics

Displays key I/O metrics such as read/write rates, queue lengths, and average wait times for each block device.

Linux 5.14.0-427.13.1.el9_4.x86_64 (rhel-server-01)  09/27/2025  _x86_64_  (4 CPU)

Device       r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda         5.2    12.3     256.8     512.4      0.1      0.2   1.9   1.6    2.1    1.8   0.05    49.4    41.7   0.8   1.4
sdb         2.1     8.7     128.4     256.8      0.0      0.1   0.0   1.1    1.5    1.2   0.02    61.1    29.5   1.2   1.3

6. Top 5 CPU Consuming Processes

Lists the processes currently consuming the most CPU resources, helping to quickly identify performance bottlenecks.

Linux 5.14.0-427.13.1.el9_4.x86_64 (rhel-server-01)  09/27/2025  _x86_64_  (4 CPU)

15:30:00      UID       PID    %usr %system  %guest    %CPU   CPU  Command
15:30:00        0      1234     12.5     2.1     0.0    14.6     0  java
15:30:00        0      5678      8.3     1.2     0.0     9.5     1  mysql
15:30:00      1000      9012      5.1     0.8     0.0     5.9     2  nginx
15:30:00        0     12345      3.2     1.5     0.0     4.7     3  httpd
15:30:00        0     16789      2.1     0.9     0.0     3.0     0  kworker

7. Metric Description

Using the <span><span>pminfo</span></span> tool, we can gain insights into the meaning of any PCP metric. For example, the system load:

kernel.all.load
    Data Type: float  (D)
    InDom: PM_INDOM_NULL  0xffffffff
    Semantics: instant
    Units: none
    Help: 1, 5, and 15 minute load averages

8. Historical Performance Summary

<span><span>pmlogger</span></span> continuously archives performance data.<span><span>pmlogsummary</span></span> can calculate the average of key metrics from the latest archive, reflecting overall trends over time.

kernel.all.cpu.user: 5.8
disk.all.total: 25.3
metric: kernel.all.cpu.user
  inst [0 or ""] value 5.8
metric: disk.all.total
  inst [0 or ""] value 25.3

🎯 Traditional vs PCP Performance Analysis Comparison

Analysis Dimension Traditional Tool Analysis PCP Automated Analysis
Data Integrity ❌ Information is scattered, requires multiple tools βœ… Unified platform, comprehensive coverage
Historical Analysis ❌ Lacks historical data, difficult to analyze trends βœ… Automatic archiving, supports historical backtracking
Analysis Efficiency ❌ Manually executing multiple commands takes 2-3 hours βœ… One-click execution, completed in 5 minutes
Standardization Level ❌ Relies on personal experience, lacks standards βœ… Standardized reports, enterprise-level specifications
Scalability ❌ Difficult to integrate between tools βœ… Unified architecture, easy to expand

πŸ€” Design Philosophy: Why is PCP + Ansible Best Practice?

A professional performance analysis solution is not just a simple stack of commands. Our design philosophy incorporates the core practices of Red Hat enterprise-level monitoring:

1

Enterprise-level monitoring architecture, unified performance data platform ✨ We use Performance Co-Pilot (PCP) as the core monitoring platform, which is the officially recommended enterprise-level monitoring solution by Red Hat, providing a unified framework for performance data collection, storage, and analysis, avoiding the issues of traditional tools being scattered.

2

Multi-dimensional performance coverage, 360-degree analysis without blind spots πŸ’» PCP provides comprehensive performance data collection from system overview to process level, from real-time status to historical trends, from hardware monitoring to application performance, truly achieving “one-click access to all performance metrics”.

3

Automated report generation, professional-level performance insights βœ… Ansible automatically integrates the data collected by PCP into a structured Markdown report, including expert-level performance analysis explanations, allowing ordinary operations personnel to quickly understand the meaning behind the performance data.

4

Historical data archiving, supports trend analysis 🎯 The pmlogger service of PCP continuously archives performance data, supporting historical backtracking and trend analysis, providing data support for performance capacity planning and problem prevention.

⭐ Automated Scenario Scoring

Scoring Dimension Score Description
Ease of Use ⭐⭐⭐⭐ PCP is powerful, but the learning curve is slightly steep
Reusability ⭐⭐⭐⭐⭐ Enterprise-level architecture, supports large-scale deployment
Stability ⭐⭐⭐⭐⭐ Official Red Hat components, validated in production environments
Scalability ⭐⭐⭐⭐⭐ Open architecture, supports custom metrics
Best Practice Compliance ⭐⭐⭐⭐⭐ Officially recommended enterprise-level solution by Red Hat

πŸ—‚οΈ Project Directory Structure

13_OS Performance Automation Analysis/
β”œβ”€β”€ OsPcpAnalysis.yml            # Main analysis Playbook

πŸ“„ Core File Content Overview

🎯 Main Analysis Playbook (OsPcpAnalysis.yml)

---
# ===========================================================================
# Integrated Ansible Playbook: RHEL Host Performance Inspection and Reporting
# Description: This is a standalone Ansible Playbook that contains all necessary logic and report templates.
#       You only need one file to accomplish the following tasks:
#       1. Install and configure Performance Co-Pilot (PCP) on the target host.
#       2. Use various PCP commands to collect comprehensive real-time and historical performance data.
#       3. Generate a detailed Markdown format report on the Ansible control node.
# =========================================================================

# -----------------------------------------------------------------------------
# Play 1: Install and configure PCP on the target host
# Goal: Ensure that the PCP-related packages are installed and core services are running.
# -----------------------------------------------------------------------------
- name: "Play 1: Install and configure PCP on the target host"
  hosts: pcp_servers
  become: true
  tasks:
    - name: "Ensure pcp and pcp-system-tools are installed"
      ansible.builtin.package:
        name:
          - pcp
          - pcp-system-tools
        state: present

    - name: "Ensure pmcd and pmlogger services are started and set to start on boot"
      ansible.builtin.service:
        name: "{{ item }}"
        state: started
        enabled: true
      loop:
        - pmcd
        - pmlogger

# -----------------------------------------------------------------------------
# Play 2: Collect comprehensive performance data from the target host
# Goal: Run a series of PCP commands to capture performance snapshots and historical summaries across various dimensions.
# -----------------------------------------------------------------------------
- name: "Play 2: Collect comprehensive performance data from the target host"
  hosts: pcp_servers
  become: true
  tasks:
    - name: "1. Get PCP system overview"
      ansible.builtin.command: pcp
      register: pcp_overview_result
      changed_when: false
      ignore_errors: true

    - name: "2. Get system load and uptime"
      ansible.builtin.command: pcp uptime
      register: pcp_uptime_result
      changed_when: false
      ignore_errors: true

    - name: "3. Get memory usage (in MB)"
      ansible.builtin.command: pcp free -m
      register: pcp_free_result
      changed_when: false
      ignore_errors: true

    - name: "4. Get NUMA architecture statistics"
      ansible.builtin.command: pcp numastat
      register: pcp_numastat_result
      changed_when: false
      ignore_errors: true

    - name: "5. Get disk I/O statistics (similar to iostat)"
      ansible.builtin.command: pcp iostat -x
      register: pcp_iostat_result
      changed_when: false
      ignore_errors: true

    - name: "6. Get top 5 CPU consuming processes (similar to pidstat)"
      ansible.builtin.shell: "pcp pidstat -u | head -n 8"
      register: pcp_pidstat_result
      changed_when: false
      ignore_errors: true

    - name: "7. Get detailed description of 'kernel.all.load' metric (pminfo)"
      ansible.builtin.command: pminfo -T kernel.all.load
      register: pminfo_load_desc_result
      changed_when: false
      ignore_errors: true

    - name: "8. Extract summary information from historical archives (pmlogsummary)"
      ansible.builtin.command: pmlogsummary -l kernel.all.cpu.user disk.all.total
      register: pmlogsummary_result
      changed_when: false
      ignore_errors: true

# -----------------------------------------------------------------------------
# Play 3: Generate and display performance report on the Ansible control node
# Goal: Integrate all data collected in Play 2 into a report file using embedded templates.
# -----------------------------------------------------------------------------
- name: "Play 3: Generate and display performance report on the Ansible control node"
  hosts: localhost
  connection: local
  gather_facts: true
  vars:
    report_filename: "integrated_pcp_report_{{ ansible_date_time.date }}.md"

    report_template_content: |
      # RHEL Performance Inspection Report (Automatically Generated by PCP and Ansible)

      **Report Generation Time:** {{ ansible_date_time.iso8601 }}

      ---

      {% for host in groups['pcp_servers'] %}
      ## 🎯 Host: {{ host }}

      ### 1. System Overview (PCP)
      This section displays the configuration of PCP, hardware summary, and running agents, which is the first step to understanding the monitoring environment.
      ```text
      {% if not hostvars[host].pcp_overview_result.failed %}{{ hostvars[host].pcp_overview_result.stdout }}{% else %}Error: {{ hostvars[host].pcp_overview_result.stderr | default('Failed to execute pcp command') }}{% endif %}
      ```

      ### 2. System Load and Uptime
      Displays the current time, system uptime, number of logged-in users, and average load over the past 1, 5, and 15 minutes.
      ```text
      {% if not hostvars[host].pcp_uptime_result.failed %}{{ hostvars[host].pcp_uptime_result.stdout }}{% else %}Error: {{ hostvars[host].pcp_uptime_result.stderr | default('Failed to execute pcp uptime command') }}{% endif %}
      ```

      ### 3. Memory Usage
      Provides detailed information on total, used, and available physical memory (Mem) and swap space (Swap), in MB.
      ```text
      {% if not hostvars[host].pcp_free_result.failed %}{{ hostvars[host].pcp_free_result.stdout }}{% else %}Error: {{ hostvars[host].pcp_free_result.stderr | default('Failed to execute pcp free -m command') }}{% endif %}
      ```

      ### 4. NUMA Architecture Statistics
      Displays Non-Uniform Memory Access statistics, which are crucial for performance tuning on servers with multiple physical CPUs.
      ```text
      {% if not hostvars[host].pcp_numastat_result.failed %}{{ hostvars[host].pcp_numastat_result.stdout }}{% else %}Error: {{ hostvars[host].pcp_numastat_result.stderr | default('Failed to execute pcp numastat command') }}{% endif %}
      ```

      ### 5. Disk I/O Statistics
      Displays key I/O metrics such as read/write rates, queue lengths, and average wait times for each block device.
      ```text
      {% if not hostvars[host].pcp_iostat_result.failed %}{{ hostvars[host].pcp_iostat_result.stdout }}{% else %}Error: {{ hostvars[host].pcp_iostat_result.stderr | default('Failed to execute pcp iostat -x command') }}{% endif %}
      ```

      ### 6. Top 5 CPU Consuming Processes
      Lists the processes currently consuming the most CPU resources, helping to quickly identify performance bottlenecks.
      ```text
      {% if not hostvars[host].pcp_pidstat_result.failed %}{{ hostvars[host].pcp_pidstat_result.stdout }}{% else %}Error: {{ hostvars[host].pcp_pidstat_result.stderr | default('Failed to execute pcp pidstat command') }}{% endif %}
      ```

      ### 7. Metric Description
      Using `pminfo` tool, we can gain insights into the meaning of any PCP metric. For example, the system load:
      ```text
      {% if not hostvars[host].pminfo_load_desc_result.failed %}{{ hostvars[host].pminfo_load_desc_result.stdout }}{% else %}Error: {{ hostvars[host].pminfo_load_desc_result.stderr | default('Failed to execute pminfo command') }}{% endif %}
      ```

      ### 8. Historical Performance Summary
      `pmlogger` continuously archives performance data. `pmlogsummary` can calculate the average of key metrics from the latest archive, reflecting overall trends over time.
      ```text
      {% if not hostvars[host].pmlogsummary_result.failed %}{{ hostvars[host].pmlogsummary_result.stdout }}{% else %}Error: {{ hostvars[host].pmlogsummary_result.stderr | default('Failed to execute pmlogsummary command') }}{% endif %}
      ```
      ---
      {% endfor %}

  tasks:
    - name: "Generate Markdown format report using embedded template variables"
      ansible.builtin.copy:
        content: "{{ report_template_content }}"
        dest: "./{{ report_filename }}"
        mode: '0644'

    - name: "Output report path and content summary to console"
      ansible.builtin.debug:
        msg: |
          =================================================================
          βœ… Performance report has been successfully generated!

          Report saved to: {{ report_filename }}

          Please open this Markdown file to view the detailed report.
          ===================================================================

πŸ› οΈ Foolproof Deployment Guide

Seeing it a thousand times in theory is not as good as doing it once!

Prerequisites

1One Ansible control node.2The target RHEL server is configured with SSH trust, and the user executing Ansible has<span><span>sudo</span></span> permissions.3The control node has Ansible installed.4The target server can access the Red Hat software repository (for installing PCP).

Project Directory Structure

This is a very simple project, you only need a few files!

13_OS Performance Automation Analysis/
β”œβ”€β”€ OsPcpAnalysis.yml            # Main analysis Playbook
└── inventory                    # Host inventory (needs to be created)

How to Use?

1

Create Host Inventory πŸ“: Create an <span><span>inventory</span></span> file and fill in your server hostnames or IP addresses.

[pcp_servers]
rhel-server-01.example.com
rhel-server-02.example.com
# or
# 192.168.1.100
# 192.168.1.101

2

Execute Automation ▢️: Run the following command, then you can go make a cup of coffee β˜•οΈ!

ansible-playbook -i inventory OsPcpAnalysis.yml

3

View Report πŸ“Š: After execution, a performance report file similar to <span><span>integrated_pcp_report_2025-09-27.md</span></span> will be generated in the current directory.

πŸ” Analysis Coverage

βœ… PCP System Overview Analysis

β€’System Architecture: PCP service configuration, hardware information summaryβ€’Agent Status: Running PCP agents and service statusβ€’Archiving Information: Log archiving configuration and time range

βœ… System Load and Uptime Analysis

β€’System Load: 1-minute, 5-minute, and 15-minute average loadβ€’Uptime: System uptime and user login informationβ€’Timestamp: Accurate data collection time records

βœ… Memory Usage Analysis

β€’Physical Memory: Total, used, available, cache, buffer, and other detailed metricsβ€’Swap Space: Swap usage and available spaceβ€’Memory Allocation: Shared memory, available memory, and other key information

βœ… NUMA Architecture Statistics Analysis

β€’Node Statistics: Memory access statistics for each NUMA nodeβ€’Local Access: numa_hit (efficient local memory access)β€’Remote Access: numa_miss (inefficient cross-node memory access)β€’Performance Optimization: Provides data support for performance tuning on multi-CPU servers

βœ… Disk I/O Performance Analysis

β€’IOPS Metrics: Read and write counts per second (r/s, w/s)β€’Throughput: Amount of data read and written (rkB/s, wkB/s)β€’Wait Time: Average I/O wait time (await)β€’Queue Length: Device queue depth (aqu-sz)β€’Device Utilization: Disk busy level (%util)

βœ… Process-Level CPU Analysis

β€’Top Processes: List of processes with the highest CPU usageβ€’User Mode/Kernel Mode: Distinguishing between user mode and kernel mode CPU usageβ€’Process Information: Detailed information such as PID, UID, process name, etc.β€’Performance Bottlenecks: Quickly identify major CPU resource consumers

βœ… Metric Description

β€’Metric Metadata: Data type, units, semantic descriptionβ€’Help Information: Professional explanations of PCP metricsβ€’Metric Relationships: Understanding the relationships between metricsβ€’Expert Guidance: Provides professional guidance for performance analysis

βœ… Historical Performance Trend Analysis

β€’Average Calculation: Statistical averages of historical dataβ€’Trend Analysis: Time variation trends of performance metricsβ€’Capacity Planning: Provides data support for system expansionβ€’Problem Prevention: Prevent potential performance issues through trend analysis

βœ… Enterprise-Level Features

β€’Unified Platform: All performance data is collected through a unified PCPβ€’Historical Archiving: Automatically saves historical data, supports trend analysisβ€’Standard Reports: Generates professional-level Markdown format reportsβ€’Multi-Host Support: Batch analysis of performance across multiple serversβ€’Error Tolerance: Comprehensive error handling and default value mechanisms

πŸ’‘ Tips for Use

🎯 Batch Performance Analysis

# Add multiple servers in the inventory
[pcp_servers]
server1 ansible_host=192.168.1.100
server2 ansible_host=192.168.1.101
server3 ansible_host=192.168.1.102

# Execute in parallel, doubling efficiency
ansible-playbook OsPcpAnalysis.yml -i inventory --forks 10

πŸ”§ Custom Analysis Scope

Modify the PCP commands in the Playbook as needed:

β€’Change <span><span>pcp iostat -x</span></span> to <span><span>pcp iostat -x 1 5</span></span> to get data for 5 seconds continuouslyβ€’Add more PCP commands like <span><span>pcp netstat</span></span>, <span><span>pcp df</span></span>, etc.β€’Adjust the number of lines displayed for <span><span>pcp pidstat</span></span>.

πŸ› Troubleshooting

If you encounter issues related to PCP:

1Check PCP Services:<span><span>systemctl status pmcd pmlogger</span></span>2Validate PCP Commands:Manually execute the <span><span>pcp</span></span> command on the target server3Check Logs:<span><span>journalctl -u pmcd -u pmlogger</span></span>4Check Permissions:Ensure the executing user has sufficient permissions

🎯 Advanced Usage

Regular Inspections

# Combine with cron for regular performance inspections
# Automatically execute performance analysis every day at 9 AM
0 9 * * * cd /path/to/pcp-analysis &amp;&amp; ansible-playbook -i inventory OsPcpAnalysis.yml

Establishing Performance Baselines

# Establish performance baselines under normal system conditions
ansible-playbook -i inventory OsPcpAnalysis.yml
# Save the report as a baseline for future comparative analysis

Locating Performance Issues

# Quickly generate a current status report when performance issues arise
ansible-playbook -i inventory OsPcpAnalysis.yml --limit problematic-server

🎁 Surprise Time! Get the Enterprise-Level PCP Solution!

Do you think the above Playbook is not detailed enough? Want to learn more about the advanced features and enterprise-level deployment solutions of PCP?

Let you not only use it but also master the core skills of enterprise-level performance monitoring!

πŸ‘‰ Click on the 【Read Original】 below to get the complete PCP enterprise-level monitoring solution and best practices for performance analysis! πŸ‘ˆ

🎁 Summary

This automated performance analysis solution for RHEL systems truly achieves:

β€’πŸ” Comprehensive Coverage: From system overview to process level, from real-time status to historical trends, comprehensive performance analysisβ€’πŸš€ One-Click Execution: Automates the complete process of PCP installation, configuration, data collection, and report generationβ€’πŸ“Š Professional Reports: Generates structured Markdown reports containing expert-level analysis explanationsβ€’πŸ”§ Enterprise-Level Architecture: Based on the PCP monitoring platform recommended by Red Hat, validated in production environmentsβ€’πŸ“ˆ Historical Analysis: Supports performance trend analysis and capacity planningβ€’πŸ›‘οΈ High Reliability: Comprehensive error handling and fault tolerance mechanisms‒⏰ Standardization: Standard processes for enterprise-level performance management

🎯 Now the operations team can:

β€’Quickly Diagnose system performance issues without manually executing multiple commandsβ€’Establish Baselines, analyze performance trends through historical dataβ€’Batch Manage performance monitoring across multiple serversβ€’Improve Efficiency, reducing performance analysis time from hours to minutes

What are you waiting for? Download this enterprise-level performance analysis solution and elevate your system performance management capabilities to new heights!

πŸš€ Advanced Application Scenarios

Enterprise-Level Monitoring Deployment

β€’Monitoring Center: Establish a unified PCP monitoring centerβ€’Alert Integration: Integrate with existing alert systemsβ€’Visualization: Combine with Grafana for performance visualizationβ€’SLA Monitoring: Establish a service level agreement monitoring system

Performance Capacity Planning

β€’Trend Analysis: Capacity planning based on historical dataβ€’Predictive Models: Establish performance predictive modelsβ€’Expansion Recommendations: Provide data-driven expansion recommendationsβ€’Cost Optimization: Optimize resource allocation and reduce costs

Fault Prevention System

β€’Baseline Comparison: Establish performance baselines to detect anomalies in a timely mannerβ€’Trend Alerts: Preventive alerts based on trend analysisβ€’Automated Recovery: Achieve self-healing of faults with automation toolsβ€’Knowledge Base: Establish a knowledge base for performance issues and handling processes

DevOps Integration

β€’CI/CD Integration: Include performance checks in the deployment processβ€’A/B Testing: Support performance testing for version comparisonsβ€’Monitoring as Code: Include monitoring configurations in version controlβ€’Automated Operations: Build a complete automated operations system

Tags: #Ansible #PCP #Performance Monitoring #System Analysis #RHEL #Enterprise-Level #Automated Operations #Performance Tuning #Monitoring Solutions #Red Hat Certified

Leave a Comment