Innovative Practices in Building System Stability Assurance

We welcome fintech professionals to actively contribute!

Submission Email: [email protected]

——Financial Electronicization

Innovative Practices in Building System Stability Assurance

Article by / ICBC Fintech Research Institute

In the era of fintech, financial products and service models are changing rapidly, and traditional monolithic architectures can no longer meet the fast-changing demands of the business. Therefore, the banking industry has begun to widely use cloud computing and distributed technologies to accelerate its digital transformation to support the rapid development of financial services. The extensive use of cloud computing and distributed technologies supports the rapid development of financial services. With the in-depth application of new technologies, the complexity of application architecture is increasing, and transaction volumes continue to rise. How to ensure the safe and stable development of business while accelerating the digital transformation of enterprises is an important issue faced by financial IT system construction.

To implement the IT architecture transformation plan, ICBC, based on its actual situation and years of digital transformation construction and large-scale production practice, actively explores the innovation and practice of financial enterprises in building a system stability assurance system.

Overview of System Stability Assurance System

System stability refers to the characteristic of a system that allows it to return to its original balanced state after being disturbed by external interference. The construction of stability assurance capability is a systematic project that requires design and implementation from multiple aspects such as architecture, capacity, operation and maintenance, and security of enterprise systems.

Since 2014, ICBC has initiated technology research related to distributed and cloud computing, independently developed and built distributed technology platforms and cloud computing platforms, and based on this, constructed an open platform core banking system, effectively supporting the rapid development of financial services. As the architecture transformation work deepens, ICBC has higher requirements for the assurance capability of system stability. Building on existing stability solutions, it actively innovates and explores, fully utilizing technologies such as cloud-native observability, chaos engineering, intelligent production verification, and intelligent inspection to create a system stability assurance system, further enhancing business continuity and promoting high-quality development in the industry.

ICBC’s System Stability Assurance System Construction

With the deepening of IT architecture transformation, ICBC has achieved large-scale application of distributed technology on the open platform, and has integrated it thoroughly. Currently, over 130,000 production access servers have been deployed, with more than 30,000 services deployed, and an average daily service call volume of 15 billion, peaking at 200,000 transactions per second. The application of new technologies such as cloud computing and distributed systems has increased the complexity of infrastructure, and unpredictable user behaviors and events intertwine, posing higher demands on the reliability of system and application architectures, bringing certain challenges to business stability.

Based on the above background, ICBC has formed a system stability assurance system (as shown in Figure 1) based on stability solutions such as dual-active, self-healing, and graceful startup/shutdown, to scientifically enhance the stability of information systems. First, in terms of observability, it has formed multi-dimensional observability capabilities around the three pillars of logs, metrics, and traces, accelerating the efficiency of fault identification, diagnosis, and handling. Second, chaos engineering drills are conducted to enhance system high availability, ensuring stable financial services are provided externally. Third, an intelligent production verification platform is constructed to automate the entire process of application system production verification and control it intelligently, ensuring smooth production changes. Fourth, an intelligent inspection center is established, using inspection robots to automate inspection work, promptly discovering potential risks in application system operations and improving application system reliability.

Innovative Practices in Building System Stability Assurance

Figure 1  Stability Assurance System

1. Building Enterprise-Level Observability

In terms of building observability capabilities, ICBC draws on advanced industry experience to promote the infrastructure of monitoring and operation capabilities and provide enterprise-level support. First, a standardized monitoring metadata system is built to unify monitoring semantics, achieving the integration of the three observability standards of metrics, traces, and logs. Second, based on bytecode enhancement technology, it breaks down barriers between various platforms within the organization, completing adaptations for various frameworks and components, reducing application integration costs, and meeting monitoring needs. Third, monitoring views are constructed that can be personalized for different dimensions and scenarios, providing observability capabilities from various perspectives including global, campus, unit, application, group, service, single machine, and gray-scale. Fourth, business-level end-to-end monitoring capabilities are built, using business tags to transmit information, collecting monitoring information related to business links and metrics, establishing a business link topology analysis mechanism through aggregation calculations, linking business key indicators, forming monitoring and observability views around business scenarios, and accelerating the transition of the monitoring operation system from application perspective to business perspective. Fifth, combining the experience of operation and maintenance experts with AIOps capabilities, flexible and customizable alarm services assist in the rapid discovery of faults. By visualizing custom databases, node health, critical resource diagnostic rules, key channels of applications can be quickly checked, and high-density snapshot services retain complete scenes for problem review. An emergency expert database is established, matching corresponding emergency strategies based on diagnostic results, linking disaster recovery platforms and cloud platforms to execute emergencies and verify results, and allowing for real-time viewing and review of fault emergency plans, execution status, and post-recovery diagnostic results.

Through building an enterprise-level observability system, ICBC has achieved second-level real-time monitoring of core businesses such as quick payment, personal settlement, and mobile banking, helping applications to timely discover over 8,000 application production and operational risks through custom monitoring alarms, and assisting applications in completing over 50,000 automatic diagnostics.

2. Conducting Chaos Engineering Drills

Since 2019, ICBC has introduced chaos engineering technology, building a fault drill platform that provides automated drill functions such as medium distribution, stress testing initiation, fault implementation, and environment recovery, covering more than 100 types of fault drill capabilities across systems, applications, and containers, forming a normalized fault drill mechanism to solidify system robustness and enhance the high availability of application architectures.

ICBC’s chaos engineering drill platform has three main purposes: first, to shield the differences between application architecture and underlying deployment architecture. For many underlying facilities, the platform achieves unified encapsulation, allowing users to focus only on the fault implementation content without needing to consider underlying differences. Second, on top of fault injection tools, it provides core capabilities such as fault orchestration, task scheduling, drill scenario configuration, and automatic generation of fault drill reports to achieve enterprise-level platform capabilities. Third, it automatically matches the high availability expert database with the application architecture, enabling one-click generation of multiple types of faults, requiring the chaos platform to possess general, convenient, and even intelligent characteristics.

Based on the above purposes, the overall technical framework of ICBC’s chaos drill platform includes infrastructure, underlying capabilities, task scheduling, system integration, and upper-layer business (as shown in Figure 2).

Innovative Practices in Building System Stability Assurance

Figure 2  ICBC Chaos Engineering Platform Framework

(1) The infrastructure types include various target resources such as physical machines, virtual machines, and containers.

(2) The underlying capabilities refer to the injection medium, integrating over 100 types of system-level and application-level atomic capabilities (as shown in Figure 3).

Innovative Practices in Building System Stability Assurance

Figure 3  ICBC Chaos Platform Fault Injection Capability

(3) Task scheduling is responsible for batch issuing and scheduling tasks orchestrated by users in the frontend.

(4) System integration represents the chaos engineering fault drill platform, integrating multiple core functions such as drill orchestration, monitoring, and expert database (as shown in Figure 4).

Innovative Practices in Building System Stability Assurance

Figure 4  ICBC Chaos Engineering High Availability Expert Database

(5) Upper-layer business represents various business systems customizing the platform for specific application scenarios, such as conducting red-blue offense and defense, routine drills, and application ratings.

Through conducting chaos engineering drills, ICBC has effectively improved the overall stable operation capability of the system. As of now, the chaos drill platform has been implemented across 336 business systems within the organization, with nearly 20,000 drill cases deployed, helping to identify 971 deep-seated high availability issues.

3. Building an Intelligent Production Verification Platform

By drawing on advanced experiences from the internet finance industry, ICBC has built an intelligent production verification platform to achieve automation and intelligent control of the entire process of application system production verification, ensuring smooth production changes. The intelligent production verification platform uses Jenkins Pipeline orchestration engine and Ansible server management technology, based on the PaaS cloud platform’s Kubernetes, Docker, and Elasticsearch cloud-native characteristics, compatible with full-range verification from deployment to operation of both cloud and on-premises nodes. To further meet the verification needs of application characteristics, the intelligent production verification platform interfaces with various major technical platforms within the organization, covering multiple verification scenarios such as distributed services, big data, distributed batch processing, and PaaS cloud platforms, providing core function verification capabilities during changes (as shown in Figure 5).

Innovative Practices in Building System Stability Assurance

Figure 5  Intelligent Production Verification Platform

4. Normalized Intelligent Inspection in Production

Traditional production inspections have room for improvement in automation levels, the depth and breadth of inspection data usage, and online tracking and feedback of abnormal inspection items. ICBC has built an intelligent inspection center, providing core indicator inspection capabilities for application containers, distributed services, distributed transactions, and distributed batch processing, using inspection robots to automate inspection work, promptly discovering potential risks in application system operations and improving application system reliability.

The overall architecture of the inspection center includes a data service layer, orchestration scheduling layer, and presentation layer. The data service layer is responsible for collecting and storing inspection data, encapsulating it into computing services for inspection items, and providing it for use by the orchestration scheduling layer; the orchestration scheduling layer is responsible for orchestrating inspection rules, inspection content, and scheduling inspection tasks; the presentation layer is responsible for displaying inspection results to track and manage abnormal issues discovered during inspections, forming a closed loop for issues.

As of now, the inspection center has assisted applications in discovering over 100,000 abnormalities such as high CPU usage in containers, failed health checks, and service production running errors, effectively enhancing the reliability of application systems by identifying potential hidden risks in advance.

Conclusion and Outlook

Based on the construction of the system stability assurance system, the probability of production failures and the scope of their impact have been effectively reduced, enhancing business continuity while minimizing business impacts and asset losses. System stability is a fundamental requirement for product capability. ICBC will comprehensively summarize the experience of building the stability assurance system and continue to explore and practice in areas such as production traffic recording and playback, full-link stress testing in the future, and engage in exchanges and cooperation with peers to ensure system safety and stability, accelerating the pace of digital transformation.

(Column Editor: Zhang Lixia)

Selected Past Articles:

(Click to view exciting content)

● Practical | Exploration and Practice of Blockchain Technology in the Medical Insurance Sector

Practical | Practice of Integrated Data Service Architecture in Small and Medium-sized City Commercial Banks

● Practical | Integrated Lake and Warehouse Supporting Ping An Property & Casualty’s Digital Transformation

Practical | Deepening “Integrated Lake and Warehouse” to Strengthen Data Application Foundation

Practical | Based on Technological Innovation, Coexistence Practice of Integrated Lake and Warehouse

Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance
Innovative Practices in Building System Stability Assurance

New Media Center: Director / Kuang Yuan Editor / Fu Tiantian Zhang Jun Tai Siqi

Innovative Practices in Building System Stability Assurance

Leave a Comment