Original by Machine Heart
On one hand, there is the objective technological gap, and on the other hand, there are opportunities that cannot be ignored.
On April 21, Nvidia released the A30 and A10 GPU series, which, with their Ampere architecture, latest manufacturing performance, and software-hardware system support, provide new choices for many tech companies in AI inference and training. The company expects that the new chips will appear in many companies’ cloud servers this summer.
For those familiar with the field of machine learning, the new generation of GPUs released every year or two is the most closely watched new trend. Nvidia’s flagship chip performance is always used as a benchmark for other chip startups.
However, for researchers looking for the most suitable computing power for artificial intelligence, GPUs are often considered not the ultimate solution for AI because they are “too general-purpose.” But so far, Nvidia GPUs still dominate the mainstream market. After GPUs led the explosion of deep learning technology, will there be new changes in the AI chip field?
Just like how people think about how to write AI algorithms, the thoughts on how chips should be made have never stopped. The next big direction in the chip field may lie in “Domain-Specific Architectures (DSA).”
Legendary computer architecture figures, 2017 Turing Award winners John Hennessy and David Patterson, proposed in a 2019 article published in ACM magazine titled “The New Golden Age of Computer Architecture” that once Moore’s Law is no longer applicable, a more hardware-centric design approach—architecture DSA for specific problems and domains will show its strength. This is a type of programmable processor that is Turing-complete but customized for specific categories of applications.

John L. Hennessy and David A. Patterson, co-authors of the book “Computer Architecture: A Quantitative Approach.”
By definition, DSA is different from application-specific integrated circuits (ASICs), which are only suitable for a single function and are difficult to modify the code running on them. DSA boards are often referred to as accelerators because they can accelerate certain applications compared to executing the entire application on a general-purpose CPU. Additionally, DSA can achieve better performance because they are closer to the actual needs of applications. Examples of DSA include the most common graphics processing units (GPUs), neural network processors for deep learning, and software-defined processors (SDPs). In domain-specific applications, DSA is more efficient and consumes less energy.
Typically, DSA processors suitable for AI inference cannot be applied to high-performance general computing, ray rendering, and other tasks, but unlike ASICs, they are not limited to a few fixed algorithm tasks. In artificial intelligence tasks, DPU chips can have high versatility, supporting NLP, computer vision, and speech task processing, and can cover various machine learning frameworks through tools like TVM.
If the technical solutions envisioned by architecture masters are sufficient conditions for DSA to establish itself, then the demand from tech companies for AI computing power is a necessary factor for DSA chips to break through.
Currently, it is still very difficult to build a GPU that can achieve performance similar to Nvidia’s through various methods. However, in the new era of the Internet centered around data centers, the scale of leading domestic Internet companies has brought unprecedented AI landing scenarios to the entire industry. If the landing needs can be accurately identified and efficient AI accelerators constructed, it can not only significantly enhance the value of machine learning but may also give rise to potential new markets.
In this case, clarifying the application direction has become the key to whether DSA can succeed. Today, the businesses that tech companies need deep learning inference for include recommendation systems, content review, AI education, artificial intelligence customer service, and text-image translation, among others. Around these businesses, all Internet manufacturers have generated a large demand for computing power.
For a semiconductor company, to create a chip that can accomplish these tasks, its design must align with customer application scenarios and underlying needs, have an efficient implementation method, and also be competitive in terms of delivery costs, maintenance services, update iterations speed, software friendliness, and even sales strategies.
Beyond architecture, another opportunity lies in instruction sets. The rise of RISC-V is also changing the chip field, as its modularity and scalability perfectly match the flexible and efficient technical needs of DSA.
Born in 2010, RISC-V is an open-source reduced instruction set architecture suitable for creating microprocessors and microcontrollers. It was initially proposed in 2010 by Professor Krste Asanovic of the University of California, Berkeley, along with developers Andrew Waterman and Yunsup Lee, and received support from computer architecture master David Patterson. This architecture allows developers to develop and use it for free, including commercial implementation directly on chips.
This January, foreign media reported that top chip designer Jim Keller joined the startup Tenstorrent as CTO and board member.

It is understood that Tenstorrent designs high-performance AI training and inference, heterogeneous architecture AI SoC. The company has designed the Tensix processor core optimized for machine learning, and to run traditional workloads, Tenstorrent’s SoC uses SiFive’s new general-purpose intelligent X280 core, which is a 64-bit RISC-V core integrated with 512-bit wide RISC-V vector instruction extensions (RVV).
Coincidentally, the American chip design company Pixilica has collaborated with the RV64X team to propose a new graphics instruction set aimed at merging CPU-GPU ISA and using it for 3D graphics and media processing, thus creating an open-source reference implementation for FPGA. Roddy Urquhart, senior marketing director of European tool developer Codasip, stated that this is one of the advantages of the RISC-V ecosystem: “If you want to create a domain-specific processor, one of the key tasks is to choose an instruction set architecture (ISA) that meets software requirements.”
“Some companies choose to create an instruction set from scratch, but if you have such an ISA, you may have to pay the price of porting software. Now, the open ISA of RISC-V can provide a good starting point and a software ecosystem,” Urquhart said. The RISC-V ISA is designed in a modular way, allowing processor designers to not only add any standard extensions but also create their own custom instructions while maintaining full RISC-V compatibility.
“Choosing a starting point for domain-specific processors, then figuring out what special instructions are needed to meet your computing requirements is necessary. This requires careful analysis of the software that needs to run on the processor core. Overview analysis tools can identify computing hotspots, and once understood, designers can create custom instructions to address these hotspots.”
Although processors based on Arm architecture have appeared in almost all smartphones and billions of electronic devices, more and more people are turning their attention to RISC-V. Major Linux developer Arnd Bermann believes that by 2030, we will see three architectures: Arm, RISC-V, and X86 occupying most of the market share. However, for DSA, it is clear that RISC-V has significant advantages.
Is it the most reasonable way for tech companies to make their own chips? Some companies have proposed chips that can be deeply integrated with their own business and bound to software systems for AI model training, including Amazon’s Inferentia and Trainium, Google’s TPU, etc. But these computing powers are limited to the specialization of each company’s own business system, with a narrow scope.
From the recent actions of some large companies, we can see changes in people’s thinking: AI chip businesses of companies like Baidu have become independent, while companies like Tencent and ByteDance have chosen to invest in startups, hoping to cultivate new systems aimed at a broad market.
Since the great development of deep learning technology around 2010, we have witnessed the emergence of chips like Cambrian and Ascend, and have been amazed by the technological capabilities of Google and Amazon, but under the endless demand for computing power, the era of domestic AI chips exploding seems still not to have arrived.
However, recently, the instruction set, architecture, and landing of artificial intelligence applications have changed the situation. As ByteDance and others invest in AI chip startups that quickly succeed in tape-out, the application of dedicated inference chips has achieved good results, and a new direction for developing DSA chips in the tech field is emerging.
In the lifecycle of chip products, if a startup can better understand the scenario, define the most suitable solution, and implement it the fastest, it can gain a relatively leading position. At the same time, if this new mechanism gives rise to sufficiently efficient computing power, developers in tech companies can also create more AI applications.
According to current estimates, the domestic market will see a demand for 200,000 to 300,000 AI inference computing cards per year. For domestic entrepreneurs, this may be an unprecedented development opportunity, and a strong engineering team will emerge in the new competition.
© THE END
For reprints, please contact this public account for authorization
Submissions or inquiries: [email protected]