Imagine you are editing a video on your smartphone and need to add suitable sound effects; or you want to generate a custom sound for a ringtone, alarm, or social media post. You no longer need to search online or purchase audio clips; just input a description, such as “gentle waves at sunset,” and within seconds, your device will generate the appropriate sound, even without an internet connection. Thanks to the new collaboration between Arm and Stability AI, this technology for generating audio entirely on-device has become a reality.
Arm and Stability AI Collaborate to Accelerate Text-to-Audio Response Times
Stability AI is a company focused on developing AI models in the fields of image, video, 3D, and audio. Arm KleidiAI provides optimized performance-critical routines specifically for Arm CPUs (i.e., microkernels). Through the integration of KleidiAI with the XNNPack library and the ExecuTorch framework, along with Stability AI’s own optimizations, significant AI performance improvements have been brought to Stability AI’s text-to-audio open model, “Stable Audio Open.”
The astonishing results include a dramatic reduction in AI generation time for text-to-audio from several minutes to just seconds, achieving a 30-fold increase in response speed. The Stable Audio Open model runs entirely on Arm CPU-based smartphones and does not require an internet connection, marking a pioneering achievement for text-to-audio AI.
Stability AI utilizes the automatic acceleration capabilities of KleidiAI to enhance model response times, thereby improving on-device AI performance without compromising quality. The performance boost provided by KleidiAI allows users of the Stable Audio Open model to save time and costs without additional development effort. Arm and Stability AI will continue to collaborate to achieve further performance leaps, delivering an even better AI user experience.
The significant performance improvements indicate that targeted hardware and software integration makes previously unfeasible AI applications viable on mobile devices, thus driving future innovation opportunities. Arm technology powers 99% of smartphones globally, meaning billions of smartphone users can now access advanced AI audio capabilities.
Collaborating to Tackle Complex AI Challenges
The Stable Audio Open model boasts excellent efficiency, but running the model directly on a smartphone’s CPU is still a challenge. In initial attempts, generating a single audio sample took over four minutes, which is not acceptable for end users.
By collaborating with Arm, Stability AI distilled the model’s training parameters to a scale suitable for mobile devices. Then, using the new distilled model and leveraging the performance acceleration brought by the integration of XNNPack and ExecuTorch, audio segments can now be generated in seconds on mobile Arm CPUs.
Stability AI CEO Prem Akkaraju stated, “As more professional creatives and businesses adopt generative AI to enhance their workflows, it is crucial that our models and workflows are accessible everywhere for builders and creators. We are excited to partner with Arm on this. From servers to smartphones, the Arm platform is widely adopted across the ecosystem, and Arm’s commitment to accelerating AI models across various mainstream frameworks by integrating Arm Kleidi into the software stack makes them our top choice.”
The Rise of Text-to-Audio AI
Since 2022, Stability AI has been at the forefront of generative AI development, having made waves with its industry-leading image model, Stable Diffusion. Building on the success of Stable Diffusion, the company subsequently launched Stable Audio, one of the first fully licensed audio models designed to generate high-quality music and sound effects from text prompts. These AI models rank among the top on major platforms like Hugging Face, boasting millions of users and forming an active tech community.
Advanced Audio AI Experience for Everyone
This achievement is just the beginning of the collaboration between Arm and Stability AI, which has planned more performance optimization initiatives aimed at providing users with an even better experience. By working together, Arm is laying the groundwork for on-device AI in audio, image, video, and 3D fields, reshaping how everyone creates content and interacts with digital media. By distilling advanced models and utilizing optimized software deployed on commonly used hardware, they are paving the way for a future where everyone can directly enjoy advanced AI applications, models, and experiences through their pocket devices.