Top Ten Edge AI Chips

Today, artificial intelligence is permeating almost all edge and embedded markets, creating devices with more powerful performance and richer functionalities, ranging from predictive maintenance in industrial machines to voice activation in household appliances, and supporting more complex computer vision applications and autonomous machines.

Generative Artificial Intelligence(GenAI) is also rising in edge devices, enabling these devices to understand and create natural language, thus providing a more natural user experience. However, this requires a significant amount of computational resources for small devices, necessitating specialized AI chips to accelerate workloads without sacrificing power consumption.

Here are our selected top ten edge AI chips, all currently available on the market. They cover chips capable of processing GenAI in edge devices to products designed specifically for visual and ultra-low power applications.

User Interface

Large Language Models(LLM) and GenAI can add natural and intuitive interfaces to systems with sufficient computation, memory, and processing capabilities, allowing them to operate within given latencies.

Hailo Technologies’s second-generation AI accelerator Hailo-10 is specifically designed for GenAI and LLM in edge computing. It is based on the same architecture as the company’s earlier focus on vision, the Hailo-8, which relies on Hailo’s software to analyze the computational and memory requirements of each layer of neural networks, allocating sufficient resources and mapping them as close as possible to minimize data transfer distances.Hailo-10 adds a dedicated memory interface, allowing the use of external memory, which is crucial for fast LLM inference.

Hailo-10 supports 4, 8, and 16 bit integer precision, achieving up to 40TOPS of computing power in INT4 mode. This is similar to Hailo-8, but the new memory access capabilities make it more suitable for GenAI. Hailo has also improved the efficiency of some common Transformer operators and enhanced support for multimodal applications.

Hailo-10 can operate at a power consumption of less than 5W while running Llama2-7B, processing up to 10 tokens per second; or, at the same power consumption, running StableDiffusion 2.1 with an image processing time of less than 5 seconds. Although 70 billion parameters of LLM may seem relatively small today, it is sufficient for user interfaces that only require domain-specific knowledge.

Top Ten Edge AI Chips

Hailo’s Hailo-10 AI accelerator (Source: Hailo Technologies Company)

AI Assistants

Kinara has launched its second-generation AI accelerator Ara-2, suitable for edge servers, laptops, and gaming consoles.Ara-2 can accelerate models with up to 30 billion INT4 parameters within a power range of 6W.

Kinara has demonstrated that Ara-2 can generate dozens of tokens per second while running Llama2-7B and complete 20 iterations of StableDiffusion 1.4 in under 10 seconds. Ara-2 is optimized for GenAI workloads, including generating images and text for edge servers and edge devices.

The chip is larger than Kinara’s first-generation product focused on vision, but it also has higher computational efficiency, with performance improvements of 5 to 8 times. The new core includes a very long instruction word (VLIW) optimized for AI workloads, helping to avoid loading/storage bottlenecks. (VLIW is commonly used in AI accelerators because it supports instruction-level parallelism, which is beneficial for AI workloads.)

It also adds support for common Transformer activation functions (such as softmax and ReLU), as well as INT4 and MSFP16 capabilities. A proprietary compiler is responsible for handling data flow.

Edge devices can use local data to add valuable context for AI agents and assistants, allowing them to obtain context-specific information to help generate more accurate results. For example, this could manifest as local data from users on their laptops. In the gaming console space, Kinara is pushing for running small LLM locally to support more realistic and interactive non-player characters.

Top Ten Edge AI Chips

Kinara’s Ara-2 AI accelerator (Source: Kinara Company)

Quantization Technology

The core technology of South Korean AI chip startup DeepX lies in its quantization method, which can convert trained models into efficient low-precision versions to accelerate inference speed. Typically, quantization requires sacrificing accuracy, but DeepX’s quantization method actually makes the quantized visual networks more accurate than the original full-precision versions. This is because it helps the model reduce overfitting, a common problem where the model cannot generalize due to memorizing the data.

DeepX offers two chips. DX-V1 is a system on chip (SoC) equipped with a 5TOPS neural processing unit (NPU), four RISC-V CPUs, and a 12MP image signal processor (ISP). This is a small SoC suitable for edge devices, priced under 10 dollars, with a power consumption of only 1 to 2W.DeepX’s V1 has demonstrated running YOLOv7 at a speed of 30fps for real-time processing.DX-M1 is a more powerful accelerator based on the same NPU architecture, designed to work in parallel with a standalone host CPU. It can provide 25TOPS of computing power within a power range of 5W, suitable for industrial PC and similar applications such as camera systems, drones, and robotics.

DX-H1 is a quad-core M1 card, also suitable for edge servers and industrial gateways. The current generation supports Transformer encoders but does not support decoders. The next generation will fully support Transformer.

Top Ten Edge AI Chips

DeepX’s DX-M1 accelerator (Source: DeepX)

Multi-Camera Data Streams

Axelera AI’s Metis chip is equipped with a four-core digital memory computing matrix vector multiplication accelerator, achieving peak performance of 214TOPS (mixed precision/INT8 weights) with a power consumption of 14.7TOPS/W.Metis AI processing unit has a typical power consumption of 10W.

Metis’s efficiency comes from its densely interleaved memory and computing capabilities, along with a small RISC-V CPU in each AI core, which manages data flow on memory-mapped I/O and supports hardware acceleration for various activation functions. The four-core design can be configured to run different models on different cores or for cascading models, allowing large models to be distributed across multiple cores.

Metis is equipped with an M.2 card with 1GB DRAM or a PCIe card. These single-chip cards can still handle multi-stream inference; Axelera’s demonstration showed running YOLOv5 object detection on 24 camera data streams, achieving a total frame rate of 400fps. Running multiple data streams on a single chip helps avoid software complexity.

Coming soon is a four-chip PCIe card with a computing power of 856TOPS, which can be used to aggregate more camera data streams; additionally, there will be a single-board computer equipped with a single Metis chip and a host CPU.

While Metis is primarily used for computer vision applications, it can also run Transformer.

Top Ten Edge AI Chips

Axelera AI’s Metis chip (Source: Axelera AI)

Complete SoC

As consumer electronics and industrial devices, robotics, and vehicles gradually shift towards large multimodal models LMM (Large Multimodal Model) and GenAI, SiMa Technologies has built a second-generation chip to meet this demand.Modalix SoC has been optimized for Transformer architectures, including visual and multimodal BF16 precision, while also being able to run convolutional neural networks (CNN) and other AI workloads. It features hardware acceleration for segmented polynomial activation functions and other nonlinear functions commonly used in LLM and LMM.

Modalix is a complete SoC series that includes not only accelerators but also eight Arm A class CPU cores, designed to run complete applications rather than just accelerator tasks. These CPU cores will be used to run applications, make decisions, and can also be used for fallback when the accelerator does not support any operations.Modalix will launch 25, 50, 100, and 200TOPS (INT8) versions, with the 50TOPS version being the first to hit the market. This version can run Llama2-7B at a speed of over 10 tokens/s, with a power consumption of 8 to 10W.SoC also integrates an on-chip ISP and digital signal processor (DSP).

SiMa.ai’s toolchain can automatically quantize different layers for optimal accuracy.

Top Ten Edge AI Chips

SiMa.ai’s Modalix SoC (Source: SiMa Technologies Company)

Real-Time Vision

Blaize’s graph streaming processor architecture is designed for graph workloads, including AI and common image signal processing functions. This hardware combines stream processing with multithreading technology; activation data is cached in small on-chip buffers and then directly transferred to the next node. Reducing data transfer between the processor and external memory can significantly lower energy consumption.

Ultimately, the chip can process YOLOv3 object detection in real-time across five camera streams (each stream taking less than 20ms per inference, allowing all five streams to run simultaneously at a frame rate of 10fps). This makes real-time visual processing in industrial and smart city applications possible, but Blaize’s architecture is also suitable for automotive driver assistance systems, retail shelf cameras, and other visual applications.

Blaize’s chip Blaize 1600 SoC has 16 cores, totaling 16-TOPS INT8 performance, with a power consumption of 7W. It comes in several small card formats, serving as a single-chip accelerator (up to 4GB LPDDR4), or for edge server applications and gateways as a four-chip PCIe card.

AI Accelerators

For small visual models based on CNN, MemryX’s MX3 AI accelerator can provide 5TFLOPS (mixed precision) performance, with a power consumption of only 2W. Like similar solutions, it is based on a dataflow architecture for memory computing; the processing unit includes a matrix multiplication accelerator and another small unit for processing activations and other operations. Data flows from one engine to the next without leaving the chip to enter external memory, with memory being the only connection between processing engines.(No on-chip network). Weights support INT4, INT8, and INT16, with activations using BF16 to maintain overall accuracy.

For larger models, MemryX offers an M.2 module with four chips (mixed precision performance of 20TFLOPS). Models can be distributed across four devices, with a total power consumption of 8W. The company’s software stack can automatically compile models with one click. The company has tested a large number of models from online repositories such as HuggingFace, achieving chip utilization rates of 50% to 80% without further optimization.MemryX MX3 applications include real-time vision and AI.

Always-On AI

For applications requiring ultra-low power, such as always-on keyword detection in battery-powered devices, Syntiant’s NDP250 neural decision processor is an ideal choice.NDP250 is the third generation of the Syntiant architecture, providing 30GOPS of INT8 performance within a power range of 10 to 100mW.

Typical use cases for Syntiant devices include audio or visual wake words, or sensor processing, where the microcontroller (MCU) or other parts of the system are awakened if interesting content is detected. This allows most parts of the system to remain off until awakened to save power.NDP250 features a larger accelerator than previous Syntiant devices, capable of handling slightly larger tasks such as automatic speech recognition and text-to-speech. This can save energy and improve system latency, such as waking up a more powerful processor running LLM. Nevertheless, NDP250 supports attention layers, thus supporting tiny Transformer networks (less than 6 million INT8 parameters).

The chip integrates Syntiant’s accelerator, a HiFi3 DSP for audio feature extraction and signal processing, and an Arm Cortex-M0 core, allowing the chip to run in some applications without a main processor.

The company has also provided AI models through the acquisition of Pilot.ai. Additionally, it recently acquired the consumer-grade MEMS microphone business from Bosch.

Application Processors

In the application processor field, NXP Semiconductors (NXP)’s i.MX 95 series application processors utilize its proprietary Neutron NPU for on-chip AI acceleration. This is a powerful application processor designed for automotive, industrial, and IoT markets, equipped with up to six Arm Cortex-A55 CPUs, as well as a 3D graphics processing Arm Mali GPU, an ISP, and an NPU. Typical applications include factory machine vision and vehicle voice alerts, dashboards, and camera systems.

Neutron NPU is an extended version of the IP used in the previous MCX-N MCU, scalable up to 2TOPS (INT8). It can run CNN, RNN, TCN, and Transformer. NXP states that tests on MobileNet, MobileNet-SSD, and YOLO indicate that the i.MX 95’s Neutron NPU is 100 to 300 times faster than running inference on-chip on Cortex-A55. The i.MX 95 is supported by NXP’s eIQ software development environment, which includes tools for dataset management, model selection, and deployment. Additionally, many third-party tools (such as quantizers) are available as part of the eIQ process.

Top Ten Edge AI Chips

NXP’s i.MX 95 application processor series (Source: NXP)

AI MCU

STMicroelectronics (ST) has launched the first MCU STM32N6 with a dedicated AI accelerator, providing 600GOPS (INT8) of acceleration performance, far exceeding other MCU manufacturers’ products (including 5GOPS from NXP MCX-N and about 250GOPS from Infineon PSoC edge products). This MCU is capable of handling applications such as human detection—its demonstration showcased a custom YOLO with detection speeds of up to 314fps, and it features an on-chip imaging pipeline—but it also excels at running smaller models such as anomaly detection.

STMicroelectronics’ self-developed NeuralART accelerator achieves energy efficiency of up to 3TOPS/W.STM32N6 also features an Arm Cortex-M55 CPU with a frequency of up to 800MHz, the highest frequency of any STM32 device to date, and supports Arm Helium vector extensions. Additionally, it has the largest RAM in STM32 history, up to 4MB.ST has integrated high-speed memory interfaces, ISP, MIPI interfaces, and built-in graphics support.

This MCU will target edge AI applications across automotive, industrial, and consumer electronics, which are currently the main markets for the STM32 series. It is supported by ST’s mature toolchain. NanoEdge AI Studio is a no-code tool for time series data processing using ST’s models; STM32Cube.AI is used for optimizing models and performance.

Top Ten Edge AI ChipsST’s STM32N6 MCU (Source: STMicroelectronics)Author:Sally Ward-Foxton(Editor: Franklin)Top Ten Edge AI Chips

Leave a Comment