How SoCs Balance Ultra-High Computing Power and Ultra-High Safety in Heterogeneous Computing Architectures (Case Studies of Orin and Horizon Journey 5)

Contact me for automotive-grade chip business consultation👆🏻

In the field of smart driving chips,the coexistence of ultra-high computing power (TOPS) and ultra-high functional safety (ASIL-B/D) is one of the core challenges in high-performance SoC design. Automotive-grade AI chips represented by NVIDIA Orin and Horizon Journey 5 achieve the dual goals of performance and safety through the architectural concept of “heterogeneous computing + safety isolation”.

This article will delve into the technical implementation paths from several dimensions.

The Relationship Between Safety Levels and Autonomous DrivingCONTENT

ASIL (Automotive Safety Integrity Level) is the functional safety level defined by the ISO 26262 standard, ranging from A (lowest) to D (highest).

L2/L2+ assisted driving typically requires ASIL-B;

High-level autonomous driving systems (such as automatic lane change, urban NOA) require ASIL-D level safety assurance.

Therefore, chips must not only be “fast” but also “stable”—able to operate reliably or safely degrade under extreme conditions.

*Core Contradiction: Why is it difficult to balance ultra-high computing power and ultra-high safety?

1. Complexity Paradox: Higher computing power usually means larger chip size, more transistors, and more complex designs. Each transistor and each connection is a potential failure point. Complex systems are harder to conduct comprehensive safety analysis and verification.

2.Conflict Between Generality and Determinism: To pursue high performance, general computing units (such as CPU, GPU) often adopt aggressive designs like out-of-order execution, deep pipelines, and complex cache hierarchies to enhance throughput. However, these optimizations introduce behavioral uncertainty and timing unpredictability, which contradict the functional safety requirements of “deterministic behavior” and “predictable response times”.

3.Power Consumption and Thermal Management: High performance leads to high power consumption and heat generation, which can affect the long-term reliability and lifespan of components, increasing the risk of failure.

4.Coupling of Software and Hardware: A complex software stack running on complex hardware makes it difficult to trace failure propagation paths, complicating safety analysis.

NVIDIA OrinCONTENT

NVIDIA Orin is a representative of the fusion of high performance and safety islands.

Architecture Overview

Main Computing Unit: 12-core ARM Cortex-A78AE (Hercules) CPU + Ampere GPU + DL/TL accelerators, providing 254 TOPS of AI computing power.

Safety Mechanisms:Functional Safety Island (FSI) and Lockstep Cores are deeply integrated

Independent Physical Isolation: Orin physically/logically isolates the high-performance computing domain (GPU/NPU) from the functional safety domain (Safety Island). FSI includes 4 pairs of Cortex-R52 dual-core lockstep (DCLS) cores (a total of 8 physical cores), with independent voltage rails, oscillators, PLLs, and dedicated SRAM.

This design ensures that the safety island is completely isolated from the main CPU (12-core Cortex-A78AE), GPU (Ampere architecture), and other high-performance modules, used for real-time monitoring of the main system status, executing fault detection, and safety response.

Lockstep Mechanism: Two R52 cores execute the same instructions synchronously, comparing results, and triggering safety actions (such as rebooting or degrading) if discrepancies occur.

ECC Memory Protection: All critical SRAM and caches support error correction codes

Hardware Watchdog + Fault Injection Testing Interface: Supports automakers in safety validation.

Decoupling Safety and Performance

The safety domain does not participate in AI inference, only responsible for monitoring, diagnostics, and controlling outputs, avoiding performance paths being hindered by safety logic.

Dynamic Fault Monitoring: The Hardware Security Manager (HSM) collects error signals from various SoC modules in real-time (such as memory ECC errors, clock anomalies), triggering external safety MCU responses via SOC_ERROR GPIO.

FSI also integrates CHSM (Cryptographic Hardware Security Module), supporting SecOC (Secure Onboard Communication) and hardware-level key management, meeting the ISO/SAE 21434 cybersecurity standard.

Computing Power Allocation Strategy: FSI provides approximately 10K DMIPS of ASIL-D computing power for running classic AUTOSAR operating systems, handling critical tasks such as brake control and sensor data validation, while the main CPU and GPU focus on AI inference (254 TOPS@INT8) and multi-sensor fusion. This division of labor not only frees up main computing resources but also ensures the real-time nature of safety functions.

Operating System and Task Scheduling

Microkernel Architecture: The Orin microkernel’s DRIVE OS uses an embedded virtualization platform to achieve functional isolation, with the safety island executing the AUTOSAR CP real-time operating system, and the main CPU executing AUTOSAR AP AI tasks, ensuring that safety tasks have higher priority than high-computing tasks.
Self-Healing Mechanism: FSI triggers periodic self-checks and error counter tracking hardware status, and when an unrecoverable fault is detected, it triggers a system reset or switches to a backup safety state, such as switching to redundant sensors when sensor transmission is interrupted.

Horizon Journey 5 (J5)CONTENT

High-Efficiency Safety Design Under Software-Hardware Collaboration

J5 has 8 ARM A55 cores and 2 BPU inference units (computing power 128 TOPS), and integrates high-speed digital signal interfaces such as LPDDR4/4X, PCIe Gen3, RGMII, CSI-2 MIPI RX/TX.

Architecture Highlights

BPU (Bayesian Architecture): A dedicated AI acceleration unit, 128 TOPS, with an energy efficiency ratio of up to 4.27 TOPS/W.

CPU Configuration: 8-core Cortex-A55 (general computing) + dual-core Cortex-R5F Lockstep (safety core).

Safety Island Design:

Integrates an independent “safety subsystem dual-core lockstep MCU: The safety island adopts a Cortex-MStar dual-core lockstep architecture, integrating 64KB ITCM, 32KB DTCM, and a hardware encryption engine, supporting dynamic self-check (BIST) and watchdog timers.

Although the chip itself is ASIL-B certified, it can meet ASIL-D requirements through system-level redundancy (such as dual Journey 5 chips).

Functional Partitioning and Resource Isolation: The safety island independently manages real-time control tasks (such as steering signal monitoring), while the main CPU (8-core Cortex-A55) and BPU (128 TOPS computing power) handle visual perception and path planning. Memory protection units (MPU) and firewall mechanisms prevent failures in high-computing modules from spreading to the safety island.

Enhanced Cybersecurity: Journey 5 integrates a national secret algorithm hardware accelerator and ARM TrustZone trusted execution environment, dividing secure/non-secure areas through a TrustZone-like mechanism to prevent unauthorized access. Supports integrity checks for secure boot and OTA upgrades.

Software-Hardware Integrated Optimization

Horizon emphasizes “algorithm-compiler-chip” collaborative design:

Compilers can automatically insert redundancy checks and critical path protections;

Runtime systems (RTOS) support priority scheduling for safety tasks;

In the HSD (Horizon SuperDrive) full-stack solution, the perception-planning-control link is included in the safety monitoring closed loop.
Journey 5’s dual backup strategy: The safety island works in coordination with an external MCU, exchanging signals via SPI. If the safety island does not respond within the timeout, the external MCU directly takes over vehicle control, ensuring ASIL-D level failure protection.

How to Balance Computing Power, Energy Efficiency, and Safety?CONTENT

Achieving high computing power, high energy efficiency, and high safety involves the following technical dimensions; in addition, safety certification and redundancy design are also important measures to ensure safety.

Layered Implementation of Functional Safety Certification

Orin’s Full Process Compliance: The chip itself has passed ASIL-B random fault certification, and the system level has undergone ASIL-D concept evaluation. For example, in the Cruise Robotaxi, Orin combines with an external safety MCU to form a “main computing + safety monitoring” ASIL-D redundant architecture.

Journey 5’s System-Level Redundancy: Single-chip ASIL-B combined with dual-chip redundancy design (such as GAC Starling architecture) achieves ASIL-D level safety through cross-checking and voting mechanisms. For example, in automatic parking scenarios, dual Journey 5 chips process ultrasonic radar data simultaneously, triggering emergency braking when inconsistencies occur.

Dynamic Balance of Computing Power and Safety

Orin’s Energy Efficiency Optimization: FSI adopts a 16nm process, with a power consumption of only 2-3W, while the main CPU and GPU reduce frequency during high load through dynamic voltage frequency scaling (DVFS). For example, during urban autonomous driving, the GPU computing power is released to 254 TOPS, while during high-speed cruising, it is reduced to 150 TOPS to lower heat generation.

Journey 5’s Near-Memory Computing: BPU adopts a Bayesian architecture, deploying AI models in on-chip SRAM to reduce external DDR access latency. The safety island performs cryptographic operations through hardware accelerators (such as TRNG), avoiding the consumption of main CPU resources.

Challenges and TrendsCONTENT

Cost vs Safety: ASIL-D certification requires a significant amount of verification work (investment of tens of millions of dollars), which is difficult for small and medium-sized manufacturers to bear.

Computing Power Redundancy vs Safety Redundancy: Future L4 systems may require “dual Orin” or “J5 + backup MCU” architectures, bringing power consumption and cost pressures.

Exploration of New Architectures: For example, Horizon Journey 6 BPU adopts a “Nash architecture”, embedding lightweight security check units in the BPU, attempting to integrate real-time verification into the AI computing flow, reducing reliance on external safety cores.;

Horizon plans to introduce RISC-V security extensions (such as CHERI) in the next generation of chips to lower functional safety development costs through an open-source ecosystem. For example, the safety island could use a RISC-V dual-core lockstep core to achieve lower power consumption for real-time monitoring.

Deep Collaboration Between AI and Safety: NVIDIA is developing real-time threat detection algorithms based on FSI, utilizing the Neon SIMD units of the Cortex-R52 cores to detect anomalies in sensor data, reducing the burden on the main CPU. For example, in the BEVFusion fusion framework, the safety island can filter out interference point clouds from LiDAR in advance.

Predictive Safety Mechanisms: Machine learning-based fault prediction models will be integrated into the safety island, training to identify potential hardware degradation such as flash wear through historical data, triggering redundancy switching in advance. For example, Orin’s HSM can predict the soft error probability of memory units, dynamically adjusting ECC check frequency.

ConclusionCONTENT

NVIDIA Orin and Horizon Journey 5 achieve ASIL-B/D safety goals under high computing power of 254 TOPS and 128 TOPS throughhardware isolation, software layering, and dynamic redundancy. NVIDIA Orin and Horizon Journey 5 represent two paths:

Orin: Meets the dual demands of high-end automakers for scalability and safety with “high computing power platform + strong safety island”;

Journey 5: Achieves a balance of cost-effectiveness and compliance with “high-efficiency AI + lightweight safety core + software collaboration”.

Both demonstrate that: ultra-high computing power and ultra-high safety are not opposing forces, but goals that can evolve collaboratively through architectural innovation. As L3/L4 deployment accelerates, the next generation of chips (such as Orin-X successors, Journey 6P) will further integrate the capabilities of “computation-safety-communication”, pushing smart driving into a new stage of being “both intelligent and reliable”.

—end—

Contact me for automotive-grade chip business consultation👆🏻

Related posts

Leave a Comment Cancel reply