Groq’s Stunning Emergence
Time flies, and many of you may remember that last year, in January and February, an AI company with a rather peculiar name, Groq, made headlines with its self-developed LPU (Language Processing Unit) chip system, showcasing extreme low-latency and high-performance inference on the Llama2 70B model (with actual performance exceeding GPU clusters by dozens of times in Tokens/s). They also launched an AI large language model service at an incredibly low price of one million tokens. Groq’s founder, Jonathan Ross, boldly claimed to “reduce computing costs to zero“, which, while stunning, also raised some doubts. For instance, Jia Qingyang (who seems to be everywhere -_-) calculated that Groq’s inference system would require hundreds of LPU chips to handle the 70B model inference, suggesting that the inference cost would far exceed the service price. A flurry of (Chinese) AI/semiconductor media jumped on these doubts, further amplifying Groq’s visibility. I prefer not to ride the hype train, so despite having researched this case in depth, I have no intention of evaluating any product’s value without empirical evidence. Until recently, I encountered several tasks focused on chip definitions and architecture design for optimizing large model inference, so reflecting on Groq’s journey from this perspective may yield some insights.
The Temptation of Speed: Groq’s Past
Unlike many, I have known about Groq since its inception. A high-ranking executive from a previous FPGA employer, whom I know fairly well, jumped to this relatively unknown AI startup, Groq. Of course, Groq’s technical talent primarily comes from the team behind Google’s proprietary AI chip, TPU. From the snippets of information Groq has publicly shared about its black technology, I can identify traces of ASIC/FPGA logic design (rather than CPU/GPU computational architecture design).
After OpenAI, NVidia, and others ignited the commercial value of AI, a logical investment thought emerged: once a large language model completes pre-training and becomes sufficiently intelligent, the growth of workloads for training such models should stabilize, while the inference workloads generated by AI users will continue to grow, potentially leading to an order of magnitude increase. This means there is a larger market that allows us to build a computing infrastructure focused on inference workloads, optimized for inference tasks, enhancing performance to support concurrency, and reducing operational energy consumption and costs.
In pursuit of extreme inference performance, (based on limited news information, I speculate) Groq’s LPU has chosen an architecture completely different from GPUs and other NPUs, deploying a large amount of static storage (SRAM) near parallel tensor computing units, allowing the data available for each vector computation to be close to the computation units themselves. This is essentially a fundamental architectural feature of FPGAs (though the basic units of FPGAs are logic units and multiply-accumulate units rather than tensor units) and the source of efficiency. Because memory access is segmented to the granularity of tensor units, it effectively expands the access width to an extremely wide range, achieving overall extreme compute-memory access bandwidth. However, no efficiency can be obtained without cost; when dealing with the currently inflated parameter counts of large language models, this extremely efficient architecture requires that data involved in computations be delivered to the nearby “supporting” storage units during each computation clock cycle, compiling the model’s mathematical and timing structure into long strings of computation instructions, and arranging the parameters and data involved in the computation into specific physical storage spaces, all organized into seamless pipeline operations, becoming an extremely complex task that cannot be reused across different models. Therefore, I boldly speculate that the compiler used by Groq requires long-term and personalized manual complex arrangements and repeated optimizations for each model, akin to writing a segment of assembly language program with over a hundred million lines for each new model encountered, rather than providing a very simple and abstract description like CUDA language or ONNX format.
What About Costs? Questions from the “Traditional” and Possible Answers.
As mentioned earlier, doubts arose the moment Groq burst onto the scene. Technical doubts stem from the fact that each LPU has “only” 230MB of SRAM, which, even with INT8/FP8 data precision, requires GB-level basic storage for models with hundreds of billions of parameters. Therefore, according to normal GPU/TPU operational logic, you would need to store this many parameters, implying that dozens to hundreds of GB of compute-memory access would be necessary. Thus, as Jia Qingyang’s draft paper suggests, even a 70B model would require hundreds of LPU systems to operate effectively. This leads to the conclusion that Groq’s model service inference costs are far higher than its selling price.
However, this is a software-centric way of thinking.
From a hardware perspective, the thinking is different: the starting point for high efficiency is to ensure that as many computing resources as possible participate in calculations during each clock cycle. With this mindset, it becomes clear that regardless of whether your on-chip computing resources number in the thousands or tens of thousands, you only have a similar magnitude of data participating in calculations during one cycle (rather than all model parameters and intermediate results in the dozens or hundreds of GB). If you can ensure that the required data is available in front of the computing resources in the next clock cycle, then you only need MB-level compute caches. Therefore, I believe Groq’s LPU architecture can indeed achieve efficient computation under low “memory” conditions, but another part of the system must solve data arrangement and transmission, ensuring that the data in the 230MB SRAM is refreshed in a timely manner, allowing the computing units in each clock cycle to receive the data and parameters needed for computation. Essentially, this can be achieved through extensive duplication in data arrangement; if a certain data D is used for computation by resource A in time cycle 1 and by resource B in time cycle 2, the fastest way is to keep two copies in SRAM_a next to resource A and SRAM_b next to resource B. Thus, I again boldly speculate that the computation programs for large models generated by Groq LPU’s compiler will have lengths and instruction counts far exceeding the scale of the large models, but this should be stored in the host’s memory or even hard disk resources, refreshing to the LPU’s SRAM in a ping-pong page manner, which indeed allows the LPU’s computational efficiency to approach theoretical limits. In this light, the unique architecture of the LPU can indeed achieve efficient inference operations at a lower cost.
(Survival disclaimer: Note that this speculation has not been confirmed by Groq; it is merely my personal conjecture)
Supporting Evidence for the Conjecture: Groq’s Business Model
Groq’s market entry strategy is also unique; although it has designed and produced the LPU, a chip specifically for high-performance and efficient large model inference, it repeatedly states that “it does not sell chips (except to third parties for research/scientific purposes)”. Furthermore, Groq’s stance includes that its advantage does not come from a single card, but from large-scale system design, particularly from “the stack from chip to system”, which perfectly corroborates my earlier conjectures regarding Groq’s architecture and performance optimization.
If the aforementioned conjectures about the LPU architecture and optimization strategies are true, then the most challenging aspect of system optimization will occur during the model compilation and instruction/data arrangement phase. No customer can complete these compilation and arrangement tasks externally to Groq, and each time a new large language model is introduced, such a massive task must be established repetitively. Groq can only hide this implicit cost internally, resolved in advance by “in-house experts”. Fortunately, there are only a few large language models that users are most willing to pay for; once these adaptation and optimization tasks are completed, you will outperform all other competitors, potentially attracting thousands or millions of customers.
Thus, Groq has transformed from an AI chip company (which I understand was founded with the intention of being an AI chip company) to an AI online model service company that provides access services without selling chips, because no buyer can effectively utilize them. Its astonishing technological advantage likely relies on both the extreme performance of the chip and the massive multi-copy data memory/storage/transmission of the (main control) system.
Conclusion: Success Does Not Necessarily Lie with Groq, but Perhaps Success Inevitably Includes Groq
From a business perspective, I find it difficult to affirm or deny Groq’s chances of commercial success. For example, in the gold rush era, Groq invented an efficient but difficult-to-learn gold mining technique, yet it does not own a gold mine. Meanwhile, there are less efficient but easier methods available for most workers. Do you think it can succeed?
However, from a technological innovation perspective, I believe Groq’s LPU architecture design undoubtedly possesses the potential to be a disruptive innovation. In the previous article of this series, 【Fried Chips】 Semiconductor Observation Series: Intel’s “Falling Behind Code”, I referenced the theory of sustaining and disruptive innovation, which originates from Clayton Christensen. This empirical theory suggests that disruptive innovations often exhibit lower characteristics than existing solutions when they first appear. In the case of Groq’s LPU architecture, this is manifested in the complexity of its compilation becoming a difficult-to-improve shortcoming. However, the advantages of disruptive innovation, particularly in terms of cost, can often change the world, and we can also see such “star quality” in Groq’s LPU.
When defining and designing a chip for AI large model inference, if you are at the industry leader’s forefront, it is logical to enhance clock frequency/memory bandwidth based on the current GPU and training NPU blueprints, while cutting out some features that support pre-training workloads but are no longer needed in pure inference applications, seeking the benefits brought by Moore’s Law at newer process nodes. However, if you are leading a startup’s product definition, the opportunity to disrupt the market lies in radical architectural designs like Groq’s LPU. Even if Groq has not yet succeeded, it at least has one foot on the edge of innovative success.
— End of Analysis
【Fried Chips】 Semiconductor Observation Series, based on market realities, focusing on theoretical research, grounded in personal viewpoints. I welcome corrections from all experts and look forward to your attention to my other articles on the personal public account *Chips & Fine Wines*.