How MCU Companies Can Customize Without High Costs in the AI+MCU Era

MIT (Massachusetts Institute of Technology) published a paper in 2018 titled “The Decline of General-Purpose Computing: Why the End of Moore’s Law and Deep Learning is Leading to Computing Fragmentation”. This paper predicted that as Moore’s Law slows down and the costs of cutting-edge semiconductor manufacturing processes rise, general-purpose computing would struggle to meet the needs of the times. Meanwhile, specialized computing is expected to thrive.

This paper now appears to be highly prescient. For instance, in the current data center field, accelerators have begun to encroach on the market for general-purpose processors, and accelerated computing is blossoming in more areas.

However, the paper also explored an important issue: general-purpose processors like CPUs rely on a broad market and high shipment volumes to dilute costs, while specialized processors, aimed at specific application scenarios, will have far lower shipment volumes. Even if specialized processors offer higher performance and better efficiency for specific applications, cost remains a significant barrier.

It also presented several variables to describe under what circumstances (including how much higher specialized processor performance is and what shipment levels are reached) choosing specialized processors would be more valuable in industries that have economies of scale—interested readers can refer to the summary of this paper by Electronic Engineering Times from a couple of years ago.

On the other hand, Moore’s Law has not completely ended, and the hierarchical structure of computer science indicates that there is still significant room for optimization at the upper levels of this structure. Therefore, we see that leading international companies in the MCU field are focusing on “customization”—although the overall direction of MCUs remains general-purpose, these giants are investing heavily in solutions aimed at specific application markets, such as automotive, industrial, and medical. This aligns with the era’s “application-oriented” and specialized chip design philosophy.

How MCU Companies Can Customize Without High Costs in the AI+MCU Era

At the recent MCU Ecosystem Development Conference, several speakers mentioned that the MCU market is “intensely competitive” with serious product homogenization. Wang Dajun, CEO of ChipEasy, stated in an interview: “Those large foreign companies allocate resources based on end-user needs: the efficiency of chip design and computing can support customer applications.” “If Chinese MCU companies lack customization capabilities and only produce general-purpose chips, which are general cores + accelerators, they will inevitably engage in price wars. Because competitors can do the same thing, there is no differentiation.”

When discussing specialized chips in the DSA field, microarchitecture customization, and application-oriented issues, one potential solution may emerge as the new future.

The Dilemma of Chinese MCU Companies: Cost and Homogenization

Statistics indicate that the Chinese MCU market accounts for 25% of the global market, with a CAGR of 7% from 2019 to 2026, which exceeds the global market average. This seems to present a bright outlook.

However, in reality, the main participants in the Chinese MCU market are focused on mid-to-low-end MCU products. When international giants start producing Cortex-M85, there are still very few domestic MCU products based on Cortex-M7. Additionally, the lack of differentiation in the mid-to-low-end market has led to intense price wars.

Wang Dajun mentioned data from a third-party platform regarding the “R&D expense ratio” of MCUs: that is, MCU R&D expenses ÷ market revenue, as shown in the following image. The R&D expense ratio of domestic MCU manufacturers is about 10% higher than that of leading international companies. A higher R&D expense ratio means lower product profits.

“The technical accumulation of domestic MCUs is insufficient, and the ecological environment has not been fully established, which is understandable,” Wang Dajun commented. “However, if we cannot make progress in the high-performance MCU market, the development outlook will be unclear.” “Enhancing value is essential to avoid homogenization; achieving differentiation can help escape price wars and improve profits.” “With profits, our MCU companies can invest more money into specific applications and ecosystem development to better support customers.”

“International MCU companies spend considerable time and resources assisting customers with application issues when facing mainstream AI application deployments, providing end-to-end development solutions, including customized AI model deployment services.” “Domestic MCU companies must invest in this area to compete; it is essential for our customers to invest as well.”

As an upstream EDA company in chip design, ChipEasy can help customers reduce chip design cycles and costs, “allowing customers to invest more time and money into application optimization and ecosystem building. Ultimately, this will help escape the price war dilemma, support applications, and form real competitiveness.”

Market Opportunities Provided by AI MCU

When MIT was researching and writing the aforementioned paper, although they clearly defined the development of AI, they likely did not anticipate the rapid pace and wide coverage of AI development—this trend profoundly impacts application development models and the need for chips designed for specific application scenarios.

Even though the call for AIoT was high in earlier years, two new terms have emerged in the microcontroller field in the past two years: TinyML and AI MCU—over the past few months, not only have businesses been discussing these terms, but the media has also been promoting them.

“From a computational power perspective, TinyML may reach hundreds of GOPS, as the essence of AI MCU is still MCU, needing to consider the limitations present in mainstream MCU application scenarios, such as power consumption and cost,” Wang Dajun said when discussing TinyML and AI MCU. “I believe that compared to the past AIoT, the definitions of TinyML/AI MCU will be more restrictive.”

“For example, static face recognition, simple object recognition, voice recognition, and even simple gesture recognition can now be accomplished using AI MCU.” “AI everywhere is an irreversible trend. These are just a few scenarios that are frequently discussed. Once such technology becomes more accessible, the diverse ingenuity of developers will explode in applications.”

“The wave of AI and big data presents a great opportunity for us. In the future, logic chips in data centers, terminals, and edge devices will need to be infused with inference capabilities. Logic chips may need to be redesigned in the coming years to keep up with the endless stream of new algorithms.” In Wang Dajun’s view, TinyML brings further expansion of the MCU market capacity, which is an important market opportunity for participants in the upstream and downstream of the domestic MCU market.

“The manufacturing process of high-performance MCUs is transitioning from 40nm to 28nm/22nm, and even potentially to 14nm in the future. Due to geopolitical influences, the capacity of domestic foundries for mature processes will significantly increase, and costs will decrease, which is also an opportunity for us.”

However, the ultimate question still returns to how to create “differentiation” and the cost-effectiveness of specialized chips. We saw Renesas showcase AI MCU based on Arm Helium technology at last year’s Import Expo, which can perform human recognition without the need for accelerators; earlier this year, Infineon introduced microNPU to MCUs, enabling gesture recognition and fruit classification at high frame rates and low latency… International giants still hold an advantage in the high-end MCU market.

Enhancing TinyML Efficiency with DSA Specialized Architecture

“Our MCU customers are now very focused on AI, and many hope to incorporate AI in their next tape-out,” Wang Dajun said regarding the market’s enthusiasm for AI MCU. “However, they are still uncertain about whether they need 64 GOPS, 128 GOPS, or more computational power—currently, they are in discussions with their customers, as this must be determined by application scenarios.”

Thus, we also see that the ChipEasy E32 DSP IP’s E32N instruction set option provides selectable 8 GOPS-128 GOPS TPU—equipped with a built-in Tensor accelerator to meet the different needs of TinyML scenarios. Readers who have followed previous reports on ChipEasy’s chip design tools should know that the E32 DSP is a high-performance core processor offered by ChipEasy.

E32B Basic Product, Based on VLIW/SIMD Architecture, Features the Following Four Characteristics:

Among these, the last point regarding ISA extension is particularly noteworthy, as it relates to MCU chip design customers building their own differentiated cores—which is what Wang Dajun repeatedly emphasizes that domestic MCU companies should pursue in product differentiation. Specifically, the E32 ISA supports not only basic, floating-point, and mathematical operation instructions but also custom instruction support.

“Based on specific requirements of algorithms and applications, such as improving Load/Store unit application efficiency, calculating the next address increment (Load/Store with post increment), and adding bit reverse instructions to enhance FFT performance, as well as SIMD instructions for TinyML… ultimately, higher computing power and efficiency will be integrated into the processor core.”

ChipEasy itself offers four instruction set options, in addition to the standard E32B, which supports scalar single-precision or double-precision floating-point, the E32F and E32D, there is also the key option that Wang Dajun particularly mentioned: E32N. “E32N better supports Tensor’s INT8 SIMD acceleration instructions.” “Specifically designed for AI MCU or TinyML.”

“E32N is a dual-core structure, which includes both E32F and a TPU.” The TPU as a Tensor accelerator achieves higher TinyML performance. As mentioned earlier, downstream MCU design customers can also add custom extension instructions based on this, “developing truly innovative and differentiated core processors that belong to them.”

To demonstrate the efficiency advantages of E32 DSP, Wang Dajun presented data on digital signal processing, including GEMM general matrix multiplication, and several specific TinyML loads, showing the proportion of clock cycles in total clock cycles for MAC product accumulation operations (MAC Utility)—the higher this value, the more “data is constantly crunched” in the processing unit, indicating higher processing efficiency of the processor.

Compared to competing products, the E32 DSP shows significant advantages in computational efficiency. Ultimately, in FFT, FIR signal processing, and image classification, anomaly detection, etc., the E32 DSP demonstrates absolute leadership in performance and efficiency compared to other 32-bit width competitors.

In addition to ChipEasy’s own efforts in microarchitecture design, compilers, etc., Wang Dajun stated that DSA architecture is key to leading computational efficiency and performance. “ChipEasy has always aimed at DSA processors, with ‘data processing’ as our positioning.”

“It does not act as a general-purpose CPU coprocessor or accelerator, but as a data processor, tightly coupled with the CPU pipeline.” “After customization by customers, the final custom processor, while losing some generality, reduces power consumption and area while significantly improving performance and efficiency in specific data processing fields.”

In this way, the differentiation issue mentioned at the beginning is truly resolved.

So how can AI MCU design costs be reduced?

The design costs of processors mainly include microarchitecture, RTL, and verification costs. Compared to current design solutions or methodologies, Wang Dajun frankly stated in an interview, “Custom chips aimed at specific application scenarios require higher NRE and longer cycles.” This relates to how ChipEasy’s FARMStudio EDA tool can automatically generate processor cores in minutes. We have previously written several articles introducing FARMStudio, so we won’t elaborate further here. In simple terms, this is a tool that allows users to input basic cores and super instructions (SIMD/VLIW custom instructions) and select preset templates to generate DSA hardware and software toolchains with one click.

The final generated hardware includes RTL, synthesis scripts, test suites, FPGA development testing environments, RTL verification environments, etc.; software includes compilers, ISS, performance simulators, debuggers, application libraries, etc.

We previously described this as a magical process. This is especially reflected in the “super instructions” comprising the three input components: after hardware and software architects analyze applications, they design C-based instructions for algorithm hotspots and frequently used C language functions, describing the instruction set functions with C functions, and inputting them into the tool. FARMStudio has a hardware compiler that can directly deploy the custom instruction set into the processor’s pipeline, optimizing functionality, resource sharing, and more.

“This is a highly flexible and user-friendly design method, where C language describes the acceleration instructions needed for specific applications, and the tool automatically generates the microarchitecture and RTL of the processor without needing to write Verilog. This is the true meaning of a custom processor, and this design methodology represents a significant advancement for the industry,” Wang Dajun added. “We have a tool (Core Tools) that is provided to end customers. Ultimately, the application is designed by system vendors, and this tool can help them quickly develop and debug application-layer software based on the processor.”

Additionally, with FARMStudio V2.0 introducing the FTOS multi-level development verification platform, different levels of simulation and verification can be completed within the same design environment, achieving “cross-domain integration and collaborative development,” addressing verification issues. ChipEasy previously provided data indicating that this significantly shortened iteration cycles compared to traditional design processes, with total costs potentially reduced by over ten times.

Ultimately, the design costs of MCUs are reduced, achieving excellent customization and application-oriented differentiation.

“We see that customers’ requirements for AI capabilities and AI have become very clear; this may even be their basic requirement,” Wang Dajun concluded. “Is it possible to place an AI network in the processor? Given the clear CNN operators, how many cycles does the core need to compute, and what is the performance like? These are already the questions our customers are asking.”

It is evident that in the context of the AI MCU era, AI technology is surging forward, bringing abundant market opportunities for industries and market participants like ChipEasy. Just as we observed this year, MCU companies are generally discussing AI MCU and the shift towards data-driven development thinking for end application development; this process has progressed significantly in just six months.

“This perfectly aligns with the philosophy of ChipEasy’s data processor; thus, the AI wave is a significant boon for us.” Perhaps in the broader context of the shift towards specialized computing and application-oriented design, tools like FARMStudio for customizable processors and configurable custom processor IP like E32 DSP are precisely what chip design companies need.

Leave a Comment Cancel reply