Background
This article is the fourth in the series on telecom AIOps intelligent agents. Building on the previous articles on Monitoring and Diagnostics agents, it focuses on the most critical link in the operational closed loop: Optimization and Repair. Based on the agent workflow orchestration capabilities and the RAG knowledge base, this article constructs a set of AI optimization agents whose core task is to receive diagnostic reports and automatically generate safe, efficient, and executable network optimization plans.
The scenarios in this series of articles are derived from the TMF Catalyst project initiated by major telecom companies such as Verizon in the USA, BT in the UK, and Etisalat in the UAE: Unleash the potential of GenAI-powered 5G network slicing.

As the commercial deployment of 5G deepens, network slicing, as a core technology, provides customized network services for vertical industries (such as smart ports, ultra-high-definition live broadcasting, and autonomous driving). However, ensuring end-to-end SLA for these slices is a significant challenge. Traditional NOCs (Network Operation Centers) rely on domain-specific, static threshold-based monitoring systems, leading to alarm storms, slow fault localization, and inefficient cross-team communication, making it difficult to meet dynamic and stringent SLA requirements.
A previous article by the author (I) on top-level design deeply analyzed a challenging scenario—dynamic service quality (QoS) assurance for eMBB slices during large sporting events—and proposed a solution that aligns closely with the concept of “Self-Intelligent Networks.” It elaborated on how to adopt a new AI+OSS paradigm, with a large language model (LLM) as the “cognitive core,” working collaboratively with the following four specialized AI agents to achieve a revolutionary transformation from passive “fault repair” to proactive, autonomous “experience assurance,” ultimately paving the way for the large-scale commercial implementation of 5G slicing.

- 1. Monitoring Agent: Responsible for 24/7 uninterrupted monitoring of network slice performance and SLA metrics, serving as the team’s “sentinel.” (II) Monitoring Agent: Practical application of AI Multi-Agent collaborative architecture in 5G slicing optimization
- 2. Diagnostics Agent: Responsible for querying and correlating data across systems and domains upon receiving alerts to identify the root cause of issues, serving as the team’s “detective.” (III) Diagnostics Agent: Practical application of AI Multi-Agent collaborative architecture in 5G slicing optimization
- 3. Optimization Agent: Responsible for executing specific network adjustment operations (such as resource scheduling and configuration changes) based on diagnostic conclusions and LLM-approved plans, serving as the team’s “engineer.”【Intelligent agents to be implemented in this article】
- 4. Reporting Agent: Responsible for reporting the entire event handling process and results to human experts in natural language, serving as the team’s “communicator.”

On the road to achieving highly autonomous networks, if monitoring and diagnostics are about “identifying problems” and “analyzing problems,” then optimization is about “solving problems.” The value of AI large models in this phase lies in their ability to simulate the decision-making process of experienced network architects, weighing the pros and cons of various solutions, and generating detailed, risk-controlled execution plans in conjunction with standard operating procedures (SOP).

In traditional operations, the process from diagnostic report to the final implementation of repair plans often involves multiple time-consuming steps such as plan writing, multi-party reviews, and risk assessments. The AI Optimization Agent automates this process, reducing the plan generation time from several hours to a few minutes. More importantly, it ensures that each plan adheres to best practices and safety protocols through a programmatic approach, significantly reducing the risk of secondary failures caused by human operational errors.

The core systems involved in this case include: AIOps Data Platform, Business Process Orchestrator, Network Slice Management Function (NSMF)/RAN Controller, Resource Management System (RM), and RAG Knowledge Base (RAGKB).

1. Scenario Description
This case follows the output of the AI Diagnostics Agent. The diagnostics agent has completed its mission and generated a clear root cause analysis (RCA) report: <span>RCA-NSI-eMBB-Stadium-01-dfc2a198</span>, indicating that the SLA degradation of the large sporting event eMBB network slice is due to PCI conflicts on the RAN side.

Now, this report is automatically forwarded to the AI Optimization Agent. Its task is no longer “why,” but “how.” The agent needs to design the best repair plan based on this exact diagnostic conclusion. This is not just a simple command execution, but a complete decision-making process that includes plan selection, risk assessment, resource verification, step planning, and rollback plans.

Optimization Target: Diagnosed PCI Conflict Event
1. Optimization Trigger
- • Input: RCA report ID from the
<span>AI Diagnostics Agent</span>, such as<span>RCA-NSI-eMBB-Stadium-01-dfc2a198</span>.
2. Optimization Goals
- • Generate the Best Solution: Select the one with the lowest risk, highest efficiency, and minimal impact from various possible repair strategies (e.g., trigger SON, manual re-planning, adjust power).
- • Create an Executable Plan: Generate a detailed, standardized format “Network Optimization Action Plan (MOP)” that includes all necessary checks and commands.
- • Ensure Safety and Compliance: Every step of the plan must comply with internal operational safety regulations and SOP.
3. Key Decision-Making Inputs
When formulating the plan, the AI Optimization Agent needs to query and analyze the following information:
| Information Category | Key Inquiry Parameters | Query System | Decision Value |
|---|---|---|---|
| Diagnostic Conclusion | Root Cause, Domain, Confidence | AIOps Platform (RCA Report) | Starting point for decision-making, clarifying the problem to be solved. |
| Network Policy | Is SON function enabled, change window policy | Business Orchestration System (Orchestrator) | Determine if the automated solution (e.g., triggering SON) is feasible. |
| Resource Availability | List of unused PCIs in the area | Resource Management (RM) | Provide available resources for manual re-planning. |
| Standard Operating Procedures (SOP) | Standard process for PCI changes, risk description | RAG Knowledge Base (RAGKB) | Ensure that the generated plan steps comply with best practices and safety requirements. |
| Network Status | Current alarms and traffic of affected cells | AIOps Platform/Fault Management System (FM) | Used for pre-checks of the plan, ensuring changes are executed in a stable network state. |
2. On-Site Situation
In the operator’s NOC center, once an RCA report is generated, it typically enters a “plan formulation and review” queue.
- 1. The RAN optimization engineer receives the work order and reads the RCA report.
- 2. The engineer needs to manually log into the resource management system to check which backup PCIs are available in the conflict area.
- 3. Then, the engineer needs to manually consult the internal knowledge base to find the “PCI Change Operation Guide V2.5.docx.”
- 4. Based on the guide template, the engineer needs to manually write a change plan (MOP), including pre-checks, execution commands, post-checks, rollback steps, etc.
- 5. The plan is submitted to the change management committee for review, which may require several hours of meeting communication.
This case aims to fully automate steps 1-4 using the AI Optimization Agent and elevate the quality and standardization of the plan to the level of “no review” or “fast-track review.”
3. System Architecture

The decision-making and execution of the AI Optimization Agent rely on deep collaboration with multiple core OSS/BSS systems.
1. AIOps Data Platform
As the input source for decision-making, it provides the RCA report that triggers this process, as well as real-time network status data needed during plan formulation.
2. Business Process Orchestrator
As the guardian of business rules, it defines the “red lines” for operational actions. For example, it informs the agent: “Currently in a major activity assurance period, no configuration changes are allowed for URLLC slices,” or “The PCI automatic optimization function of SON is currently enabled in this area.” This provides critical strategic constraints for the agent’s plan selection.
3. Resource Management System (RM)
This is the agent’s resource repository. When manual PCI re-planning is needed, the agent must query RM to obtain a unique, available PCI resource in that geographical area.
4. Network Slice Management Function (NSMF) / RAN Controller
This is the final executor of the plan. Although in this case’s workflow, the agent only generates the plan and does not execute it directly, the commands or API call requests generated by it will ultimately be consumed and executed by these systems. Therefore, the format of the plan must be fully compatible with the interface specifications of these systems.
5. RAG Knowledge Base (RAGKB)
As a best practice advisor, RAGKB stores all standard operating procedures (SOP). When the agent determines to execute the action of “manually modifying PCI,” it will query RAGKB for the “PCI Change SOP” and use the retrieved standard steps, safety notes, verification methods, etc., as core materials for generating the final action plan, ensuring the professionalism and compliance of the plan.
4. Technical Plan

The Dify workflow of the AI Optimization Agent simulates the complete thought process of a network architect from receiving a problem to outputting a detailed solution.
Step 1: Obtain and Parse Diagnostic Report
- • Node Type: HTTP & Code
- • Node Name:
<span>GET_RCA_REPORT</span>&<span>PARSE_RCA_REPORT</span> - • Function: The workflow is triggered by the
<span>rca_report_id</span>passed in from the<span>Start</span>node. First, it retrieves the complete RCA report JSON from the AIOps platform via the HTTP node. Then, the Code node parses this JSON to extract core diagnostic information such as<span>root_cause</span>(“PCI Collision”),<span>domain</span>(“RAN”),<span>affected_elements</span>([“Cell-101”, “Cell-外部-205”]),<span>confidence</span>(0.95).
Step 2: Generate Preliminary Solution Options
- • Node Type: LLM
- • Node Name:
<span>GENERATE_SOLUTION_OPTIONS</span> - • Function: This is a “strategy divergence” node. It inputs the diagnosed
<span>root_cause</span>to the LLM. Under the guidance of a “network planning expert” role prompt, the LLM will propose all possible solutions based on its general knowledge. For example, for PCI conflicts, it may suggest:
- 1. Trigger SON automatic optimization.
- 2. Manually reassign PCI for conflicting cells.
- 3. Adjust the transmission power of conflicting cells to reduce interference range.
Step 3: Evaluate and Select the Best Solution
- • Node Type: Code & IF/ELSE
- • Node Name:
<span>EVALUATE_AND_SELECT_OPTION</span>&<span>CHECK_BEST_OPTION</span> - • Function: This is the key decision node for “strategy convergence.” The Code node receives multiple solution options generated by the LLM and calls tools to enrich the decision basis:
- • Query the
<span>Orchestrator</span>: “Is the SON function enabled in this area?” - • Query the
<span>Resource Manager</span>: “Are there available PCIs?” - • Based on this information and built-in risk assessment models (e.g., triggering SON is low risk, manual modification is medium risk, adjusting power may affect coverage), score each option.
- • Finally, the IF/ELSE node selects the optimal option based on the scores (e.g., if SON is available, prioritize SON).
Step 4: Retrieve Standard Operating Procedures (SOP)
- • Node Type: Knowledge Retrieval
- • Node Name:
<span>RETRIEVE_REMEDIATION_SOP</span> - • Function: The process has determined the best action (e.g., “trigger SON for PCI re-planning”). This node queries the RAGKB knowledge base for semantic retrieval, obtaining internal standard operating procedure (SOP) document fragments that fully match this action.
Step 5: Generate Detailed Optimization Action Plan (MOP)
- • Node Type: LLM
- • Node Name:
<span>GENERATE_OPTIMIZATION_PLAN</span> - • Function: This is the final value output of the workflow. It integrates all information: the determined best solution, the SOP retrieved from RAGKB, the list of affected network elements, and the original RCA report. Under the guidance of a “top-tier operations engineer” role prompt, the LLM will generate a professional, complete, standardized format network optimization action plan (MOP).
Step 6: Output Final Report
- • Node Type: End
- • Function: Display the complete MOP report generated by the
<span>GENERATE_OPTIMIZATION_PLAN</span>node as the final result of this optimization process.
5. Sample 5G Network Slicing Optimization and Repair Action Plan (MOP)
1. Action Plan Information
- • Plan Number (MOP ID):
<span>MOP-PCI-OPT-Cell-外部-205-0f3b4c1a</span> - • Related RCA Report:
<span>RCA-NSI-eMBB-Stadium-01-dfc2a198</span> - • Priority Level: Urgent
- • Plan Creator: AI Trainer – AI Optimization Agent
2. Optimization Goals and Summary
Goal: Resolve the eMBB slice <span>NSI-eMBB-Stadium-01</span> SLA degradation issue caused by the PCI conflict between cell <span>Cell-外部-205</span> (PCI: 42) and <span>Cell-101</span> (PCI: 42).
Summary: This plan will automatically optimize PCI by triggering the self-organizing network (SON) function, reassigning a non-conflicting PCI for <span>Cell-外部-205</span>. This solution is automated, low-risk, and is expected to restore service within 15 minutes.
3. Pre-checks
- 1. Confirm SON Status:
- • Check Item: Query the Orchestrator to confirm that the SON PCI optimization function is enabled in the
<span>CLUSTER-SOUTH-03</span>area. - • Expected Result:
<span>{"son_pci_optimization_status": "enabled"}</span>[Check Passed]
- • Check Item: Query the FM system to confirm that
<span>Cell-外部-205</span>and its neighboring cells have no<span>Critical</span>or<span>Major</span>level hardware alarms. - • Expected Result:
<span>{"active_critical_alarms": 0}</span>[Check Passed]
4. Safety Precautions
- • This operation is executed automatically through the SON system, eliminating the need for manual login to network elements, thereby reducing the risk of manual operations.
- • During the operation, the
<span>AI Monitoring Agent</span>will monitor the SINR, handover success rate, and PDU session establishment success rate of the affected cell cluster at one-minute intervals. - • If key KPIs do not improve within five minutes after execution, or if new
<span>Critical</span>alarms occur, a rollback plan will be automatically triggered.
5. Execution Steps
- 1. [Step 1] Trigger SON PCI Optimization Task (Automated)
- • Target System: SON Platform
- • API Endpoint:
<span>POST /api/son/v1/actions/trigger</span> - • Request Body (Payload):
{ "action": "pci_optimization", "scope": { "cell_ids": ["Cell-外部-205"] }, "priority": "high", "execution_mode": "auto" }2. [Step 2] Monitor Task Execution Status (Automated)• Target System: SON Platform
- • API Endpoint:
<span>GET /api/son/v1/tasks/{task_id}</span> - • Expected Status: The task should change from
<span>RUNNING</span>to<span>COMPLETED</span>within five minutes.
6. Post-checks
- 1. [Step 3] Verify PCI Change (Automated)
- • Target System: Resource Management (RM)
- • Check Item: Query the configuration of
<span>Cell-外部-205</span>to confirm that its PCI is no longer 42.
- • Target System: AIOps Data Platform
- • Check Item: Query the SINR and handover success rate of
<span>Cell-101</span>to confirm that the metrics have returned to baseline levels within ten minutes.
7. Rollback Plan
- • Trigger Condition: Post-check step 4 fails, or a
<span>Critical</span>alarm occurs during execution. - • Operation:
- 1. Send a
<span>POST /api/son/v1/actions/revert</span>command to the SON system to revert the PCI change. - 2. Automatically create a P1 level (highest priority) fault ticket and notify the NOC expert team for immediate intervention.
Prepared by: AI Trainer
Preparation Time: 2025-03-13 20:36 UTC
6. System Extension Features
- • Full Closed-Loop Execution: Add an HTTP node in the Dify workflow to directly execute the automated steps in the MOP by calling the API of NSMF or RAN controller, achieving a fully automated closed loop from “monitoring” to “repair.”
- • Change Management Integration: Add a node to call the change management system (e.g., ServiceNow) API before execution to automatically create a Change Request and wait for approval before continuing execution.
- • Dynamic SOP Generation: For new types of faults without fixed SOPs, the LLM can dynamically generate a new, safe temporary operating procedure based on its general knowledge and foundational technical documents in RAGKB.
- • Cost and Benefit Analysis: Add a quantitative analysis of the cost (e.g., manpower, instantaneous impact on business) and benefits (e.g., SLA recovery level, long-term stability) for each option in the
<span>EVALUATE_AND_SELECT_OPTION</span><span> node, making decisions more data-driven.</span>
7. Data Structures Involved in the Case
AIOps Data Platform – RCA Report Data Structure
Description: This is the core input of this workflow, defining the structured report output by the diagnostics agent.
| Field Name | Chinese Name | Data Type | Comments |
|---|---|---|---|
<span>RcaID</span> |
RCA Report ID | <span>String</span> |
Primary key, e.g., “RCA-NSI-…-dfc2a198”. |
<span>AlertID</span> |
Associated Alarm ID | <span>String</span> |
|
<span>RootCause</span> |
Root Cause | <span>String</span> |
Standardized root cause, such as “PCI_COLLISION”. |
<span>Domain</span> |
Domain | <span>String</span> (Enum) |
<span>RAN</span>, <span>Transport</span>, <span>Core</span>. |
<span>ConfidenceScore</span> |
Confidence | <span>Float</span> |
A float between 0 and 1. |
<span>AffectedElements</span> |
Affected Network Elements | <span>Array[String]</span> |
List of affected key network element IDs. |
<span>EvidenceChain</span> |
Evidence Chain | <span>Array[Object]</span> |
Each object contains evidence items, sources, and details. |
CREATE TABLE `tabAIOpsRcaReports` (
`name` varchar(140) NOT NULL,
`creation` datetime(6) DEFAULT NULL,
`modified` datetime(6) DEFAULT NULL,
`RcaID` varchar(140) DEFAULT NULL,
`AlertID` varchar(140) DEFAULT NULL,
`RootCause` varchar(140) DEFAULT NULL,
`Domain` varchar(140) DEFAULT NULL,
`ConfidenceScore` decimal(5, 4) DEFAULT NULL,
`ReportData` json DEFAULT NULL, -- Contains AffectedElements and EvidenceChain
PRIMARY KEY (`name`),
UNIQUE KEY `rca_id_unique` (`RcaID`)
) ENGINE=InnoDB;
Resource Management (RM) – Available PCI Resource Data Structure
Description: Defines the API return format for querying available PCI resources.
| Field Name | Chinese Name | Data Type | Comments |
|---|---|---|---|
<span>RegionID</span> |
Region ID | <span>String</span> |
ID of the geographical or network area queried. |
<span>AvailablePCIs</span> |
Available PCI List | <span>Array[Integer]</span> |
List of currently unallocated PCIs in that area. |
<span>LastUpdated</span> |
Last Updated Time | <span>Timestamp</span> |
(For other data structures such as slice KPI, service definitions, alarm data, etc., please refer to previous agent documents, which will not be repeated here.)
8. Case Source Code
Dify Workflow YAML Source Code
# dify-workflow-telecom-monitor.yml
# Follow this public account (AI Trainer) and send a private message to obtain the YAML source code of this Dify workflow.
Workflow Code Node Code (<span>Code-EVALUATE_AND_SELECT_OPTION.py</span>)
import json
def evaluate_and_select(llm_options_str: str, son_status_str: str, pci_availability_str: str) -> dict:
"""
Evaluates potential solutions and selects the best one.
"""
try:
# Assume LLM returns a simple, newline-separated list string
options = llm_options_str.strip().split('\n')
son_enabled = json.loads(son_status_str).get("son_pci_optimization_status") == "enabled"
scores = {}
for option in options:
score = 0
if "SON" in option:
score = 100 if son_enabled else 0 # Highest score if SON is available
elif "手动" in option or "Manual" in option:
score = 70 # Manual is the second-best choice
elif "功率" in option or "Power" in option:
score = 30 # Adjusting power is the last resort, affecting coverage
scores[option] = score
best_option = max(scores, key=scores.get)
return {
"best_option": best_option,
"decision_reason": f"SON is {'enabled' if son_enabled else 'disabled'}. Scores: {scores}. Selected '{best_option}'."
}
except Exception as e:
return {"best_option": "ERROR", "decision_reason": str(e)}
def main(llm_options: str, son_status: str, pci_availability: str) -> dict:
return evaluate_and_select(llm_options, son_status, pci_availability)
Workflow Prompt (<span>Prompt-LLM-GENERATE_OPTIMIZATION_PLAN.md</span>)
# Role and Goal
You are a top-tier Network Operations Automation Engineer. Your mission is to create a formal, safe, and executable Method of Procedure (MOP) based on a selected optimization strategy and a standard operating procedure (SOP).
# Core Task
Synthesize the provided inputs to generate a complete and professional "Network Optimization Plan". The plan must be ready for execution by either an automated system or a human engineer.
# Critical Output Constraint
Your output MUST be a single, well-formatted Markdown document following the precise structure provided in the example.
---
### Input Data for MOP Generation
#### 1. Selected Optimization Strategy
- **Strategy**: {{#EVALUATE_AND_SELECT_OPTION.best_option#}}
- **Reasoning**: {{#EVALUATE_AND_SELECT_OPTION.decision_reason#}}
#### 2. Standard Operating Procedure (from RAGKB)
- **SOP Details**:
{{#RETRIEVE_REMEDIATION_SOP.result#}}
#### 3. Context from RCA Report
- **RCA Report**:
{{#GET_RCA_REPORT.body#}}
---
### Your Task: Generate the Network Optimization Plan (MOP)
Using your expertise, create the formal MOP. The plan must include:
1. **Header**: MOP ID, Related RCA, Priority.
2. **Objective**: What the plan aims to achieve.
3. **Pre-checks**: Steps to verify before execution.
4. **Safety Precautions**: Mandatory safety notes.
5. **Execution Steps**: Detailed, numbered steps, including specific API calls or commands.
6. **Post-checks**: Steps to verify success after execution.
7. **Rollback Plan**: Clear instructions on how to revert the change if needed.
Generate the report now.
If you find the content useful, please follow
and give a thumbs up, share, and recommend. Your support is my motivation to continue updating.
Selected Previous Articles
(III) Diagnostics Agent: Practical Application of AI Multi-Agent Collaborative Architecture in 5G Slicing Optimization(II) Monitoring Agent: Practical Application of AI Multi-Agent Collaborative Architecture in 5G Slicing Optimization(I) Top-Level Design: Practical Application of AI Multi-Agent Collaborative Architecture in 5G Slicing OptimizationInnovations and Implementation Directions of AI Intelligent Operations in Telecom NetworksRevealing: Why ChatGPT is Becoming More ‘Understanding’? The Answer to AI Assistants Suddenly Becoming Smarter is Hidden in These 5 StepsNew Play of AI Multi-Agent Collaboration: Closed-loop Automation from Monitoring → Diagnostics → Optimization → Reporting(Source Code Attached) First Public Release! Complete Technical Solution for 5G Slice Automated Repair Based on AI Intelligent AgentHow to Build an AI Diagnostics Agent to Ensure 5G Network Information Communication in Sports EventsFrom American Telecom Giant Verizon to Middle Eastern Digital Carrier Etisalat, Global Operators are Replicating the 5G Slice AI Agent Monitoring ParadigmBest Practice Case of Self-Intelligent Networks: Building a New Intelligent Operations Paradigm for the Commercialization of 5G SlicingDesign Methodology and Case Studies of AI Intelligent Agents in Telecom Network Intelligent Operations AIOps(Source Code Attached) Building Telecom Transmission Network Intelligent Optimization AI Intelligent Agent with Dify+RAGFlow for Rapid Network Fault Prediction and Automated OperationsEverything Can Be MCP: Teaching You How to Handcraft MCP Services from Scratch, Making AI Intelligent Agents the Brain of the IndustryFrom Work Order-Driven to AI-Driven Paradigm Shift, Huawei’s AI Innovation Empowers the Evolution of OSS in the 5G EraWhen AI Intelligent Agents ‘Understand’ 3GPP Standards, 5G Network Optimization Experts PanicBuilding AI+OSS Intelligent Agent Platforms with Langchain+Airflow to Double the Efficiency of Telecom Operators’ NOCLet AI+OSS Intelligent Agents ‘Fish’ in Telecom Networks, Achieving Millisecond Association and Quality Optimization of Multi-Circuit Cascading Faults, Redefining Network Operations BoundariesBuilding a TMF-Compliant Telecom Network Quality Optimization AI+OSS Intelligent Agent Platform to Rapidly Generate Business Performance Optimization Plans, Achieving Intelligent Transformation from Passive Response to Proactive Prediction(Source Code Attached) Building Industrial Internet MCP Services and AI Intelligent Agents with Dify+RAGFlow for Rapid Predictive Maintenance of CNC Machine Tools in Pump Enterprises[50,000-Word Long Article, Including Case Studies] Perfectly Implementing Knowledge Graphs and Low-Code Development Based on DeepSeek Private Deployment RAGFlow Industry Knowledge Base and Intelligent AgentsThe Secret Weapon of Qwen3 Large Model AI Agents Mimicking ‘Human Memory’ – Open Source Dynamic Knowledge Graph GraphitiHow to Rent Servers for Private Deployment of DeepSeek at Low CostLocal Deployment of DeepSeek-R1 Commercial Grade Knowledge Base with LangChain+RAG+Agent, Perfectly Achieving Low-Code Visual Process OrchestrationUnderstanding the Principles of Transformers through the Llama Large Model Architecture DiagramHow to Build Efficient AI Intelligent AgentsLocal Deployment and Case Demonstration of Llama 3.2 90 Billion Parameter Visual Multimodal Large Model
and give a thumbs up, share, and recommend. Your support is my motivation to continue updating.