How to Ensure AI Chips Keep Up with Algorithm Speeds

In the era of artificial intelligence, the ultimate measure of success for an AI product is the extent to which it improves the efficiency of our lives. As AI technology evolves from the cloud to the edge, the engineering problems that need optimization become increasingly complex. Effective evaluation before chip design is becoming more important for the ultimate success of the product. However, due to the complex attributes of AI chip application scenarios, the industry needs correct and professional evaluation tools.

One of the challenges in the AI industry

AI chips cannot keep up with the speed of algorithms

As early as 2019, a report from Stanford University pointed out that the speed of AI’s demand for computing power is faster than the development speed of chips. “Before 2012, AI development followed Moore’s Law very closely, with computing power doubling every two years, but after 2012, AI’s computing power doubles every 3.4 months.”

When general-purpose processors cannot keep up with the demands of AI applications, dedicated processors for AI computing, commonly referred to as “AI chips,” were born. Since AI algorithms surpassed human scores in visual recognition in 2015, the industry’s attention to AI chips has increased significantly, driving the development of related IP technologies, accelerating the speed of the next generation of processors and storage, and achieving higher bandwidth interfaces to keep pace with AI algorithms. Figure 1 shows a visible reduction in typical AI error rates after the introduction of backpropagation and modern neural networks in 2012, combined with NVIDIA’s heavy computing GPU engine.

▲ Figure 1: After the introduction of modern neural networks in 2012, AI classification errors rapidly decreased, falling below human error rates starting in 2015

As AI algorithms become increasingly complex, they cannot be executed on SoCs designed specifically for consumer products. Techniques such as pruning and quantization must be used for alignment and compression to reduce the memory and computation required by the system, but this can affect accuracy. Therefore, the engineering challenge is: how to implement compression techniques without affecting the precision required by AI applications?

In addition to the increasing complexity of AI algorithms, the amount of data required for inference has also surged due to the increase in input data. Figure 2 shows the memory and computation required for optimized visual algorithms. The algorithm is designed to occupy a relatively small memory space of 6MB (the memory requirement for SSD-MobileNet-V1). In this specific example, we can see that as pixel size and color depth increase, the memory requirements in the latest image capture have increased from 5MB to over 400MB.

Currently, the latest Samsung smartphone CMOS image sensor camera supports up to 108MP. Theoretically, these cameras may require 40 TOPS of performance at 30fps and over 1.3GB of memory. However, the technologies in the ISP and specific areas in AI algorithms cannot meet these requirements, and 40 TOPS performance is not yet achievable on mobile phones. However, this example highlights the complexity and challenges of edge devices, which are also driving the development of sensor interface IP. MIPI CSI-2 has dedicated areas to address this issue, and MIPI C/D-PHY continues to increase bandwidth to handle the data from the latest CMOS image sensors driving hundreds of millions of pixels.

▲ Figure 2: Memory variation test of SSD-MobileNet-V1 as input pixels increase

The current solution is to compress AI algorithms and images, which makes chip optimization extremely complex, especially for SoCs with limited memory, limited processing capacity, and small power budgets.

Another challenge in the AI industry

Challenges in evaluating AI chips

AI chip manufacturers typically conduct some benchmark tests on their chips. Today’s SoCs have various different metrics for measurement. First, TOPS (trillions of operations per second) is a primary performance metric that gives a clearer understanding of chip capabilities, such as the types and quality of operations the chip can handle. Furthermore, the number of inferences per second is also a major metric, but it requires understanding frequency and other parameters. Therefore, the industry has developed additional benchmark tests to assist in evaluating AI chips.

MLPerf/ML Commons and AI.benchmark.com are tools for standardized benchmarking of AI chips. Among them, ML Commons primarily provides measurement rules related to chip accuracy, speed, and efficiency, which are very important for understanding the chip’s ability to process different AI algorithms. As mentioned earlier, without understanding precision targets, we cannot make trade-offs between chip progress and compression levels. In addition, ML Commons also provides common datasets and best practices.

The Computer Vision Lab in Zurich, Switzerland, also provides benchmarking for mobile processors and publishes its results along with chip requirements and other information for reuse. This includes 78 tests and over 180 performance metrics.

Stanford University’s DAWNBench supports the work of ML Commons. These tests not only address the issue of AI performance scoring but also solve the total time required for processors to execute AI algorithm training and inference. This addresses a key issue in chip design engineering objectives: reducing total cost of ownership or overall ownership costs. AI processing time determines the ownership of chips for cloud AI rentals or edge computing, which is more useful for an organization’s overall AI chip strategy.

Another popular benchmarking method is the use of common open-source graphics and models, but these models also have some drawbacks. For example, the dataset for ResNET-50 is 256×256, but this may not necessarily be the resolution used in final applications. Secondly, this model is older and has fewer layers than many newer models. Thirdly, the model can be manually optimized by processor IP vendors, but this does not represent how the system will perform with other models. In addition to ResNET-50, there are many available open-source models that reflect the latest progress in the field and provide good performance indicators.

Finally, customized graphics and models for specific applications are becoming increasingly common. Ideally, this is the best approach for benchmarking AI chips and rational optimization to reduce power consumption and improve performance.

Since SoC developers have different goals, some are aimed at high-performance areas, while others are for lower-performance areas, and some are for general AI fields and ASIC fields. For SoCs that do not know which AI model to optimize, a good combination of custom models and openly available models can well indicate performance and power consumption. This combination is the most commonly used in today’s market. However, after SoCs enter the market, the emergence of the aforementioned newer benchmark standards seems to have some relevance in comparisons.

Evaluation before edge AI chip design

is particularly important

More and more data calculations are occurring at the edge. Given the complexity of edge optimization, today’s AI solutions must co-design software and chips. To do this, they must leverage the correct benchmarking techniques and also have tool support, enabling designers to accurately explore different optimization methods for systems, SoCs, or semiconductor IP, investigating process nodes, memory, processors, interfaces, and more.

In this regard, Synopsys can provide effective tools for specific domains to simulate, prototype, and benchmark IP, SoCs, and broader systems.

First, Synopsys’ HAPS® prototyping solution is typically used to demonstrate the capabilities and trade-offs of different processor configurations. This tool can detect the bandwidth of the AI system under what circumstances becomes a bottleneck aside from the processor. What is the optimal bandwidth for sensor input (via MIPI) or memory access (via LPDDR) when processing different tasks?

Another, Synopsys’ ZeBu® simulation system can be used for power simulation. ZeBu Empower can complete power verification cycles in hours using real software workloads for AI, 5G, data center, and mobile SoC applications. This simulation system has proven superior to static analysis and/or simulation of AI workloads.

Users can also explore the system-level design of SoCs through Synopsys’ Platform Architect. Platform Architect was initially used for memory, processing performance, and power exploration, and has recently been increasingly used to understand the system-level performance and power of AI. Using pre-built LPDDR and ARC processor models for AI, memory, etc., sensitivity analysis can be performed to determine optimal design parameters.

Synopsys has a team of experienced professionals responsible for developing AI processing solutions from ASIP Designer to ARC processors. The validated foundational IP portfolio, including memory compilers, has been widely applied to AI SoCs. The interface IP range for AI applications spans from sensor inputs to I3C and MIPI, to chip-to-chip connections via CXL, PCIe, and Die to Die solutions, as well as network functions via Ethernet.

Co-design of software and chips has become a reality, and selecting the right tools and expertise is crucial. Synopsys is leveraging expertise, services, and mature IP to provide customers with the best approach to optimize AI chips in an ever-changing environment.

Leave a Comment Cancel reply