Implementing Generative AI with NVIDIA Jetson

Recently, NVIDIA launched the Jetson Generative AI Lab, enabling developers to explore the limitless possibilities of generative AI in the real world using NVIDIA Jetson edge devices. Unlike other embedded platforms, Jetson can run large language models (LLMs), visual transformers, and stable diffusion locally, including the Llama-2-70B model running interactively on Jetson AGX Orin.

Figure 1. Leading Generative AI Models on

Inference Performance on Jetson AGX Orin

To quickly test the latest models and applications on Jetson, please use the tutorials and resources provided by the Jetson Generative AI Lab. Now, you can focus on uncovering the untapped potential of generative AI in the physical world.

This article will explore the exciting generative AI applications that can be run and experienced on Jetson devices, all of which are also detailed in the lab’s tutorials.

Edge Generative AI

In the rapidly evolving field of AI, generative models and the following models have garnered significant attention:

LLMs capable of engaging in human-like dialogue.
Visual language models (VLMs) that enable LLMs to perceive and understand the real world through cameras.
Diffusion models that can transform simple text instructions into stunning images.

These significant advancements in AI have sparked the imagination of many. However, if you delve deeper into the infrastructure supporting these cutting-edge model inferences, you will find they are often “tethered” to the cloud, relying on the processing power of their data centers. This cloud-centric approach has largely stunted the development of edge applications that require high bandwidth and low latency data processing.

Video 1. NVIDIA Jetson Orin Brings Powerful Generative AI Models to the Edge

Running LLMs and other generative models in local environments is a growing trend among the developer community. Thriving online communities provide enthusiasts with a platform to discuss the latest advancements in generative AI technologies and their practical applications, such as r/LocalLlama on Reddit. Numerous technical articles published on platforms like Medium delve into the complexities of running open-source LLMs in local setups, some of which mention the use of NVIDIA Jetson.

The Jetson Generative AI Lab is a hub for discovering the latest generative AI models and applications, as well as learning how to run them on Jetson devices. As the field rapidly evolves, new LLMs are emerging almost daily, and the development of quantization libraries has reshaped benchmarks overnight. NVIDIA recognizes the importance of providing the latest information and effective tools. Therefore, we offer easy-to-learn tutorials and pre-built containers.

What makes all this possible is jetson-containers, a carefully designed and maintained open-source project aimed at building containers for Jetson devices. This project uses GitHub Actions to build 100 containers in a CI/CD manner. These containers allow you to quickly test the latest AI models, libraries, and applications on Jetson without the hassle of configuring the underlying tools and libraries.

With the Jetson Generative AI Lab and jetson-containers, you can focus on exploring the limitless possibilities of generative AI in the real world using Jetson.

Demonstration

Here are some exciting generative AI applications that run on NVIDIA Jetson devices provided by the Jetson Generative AI Lab.

stable-diffusion-webui

Figure 2. Stable Diffusion Interface

A1111’s stable-diffusion-webui provides a user-friendly interface for Stability AI’s Stable Diffusion. You can use it to perform many tasks, including:

Text-to-image conversion: Generate images based on text prompts.
Image-to-image conversion: Generate images based on input images and corresponding text prompts.
Image inpainting: Fill in missing or obscured parts of input images.
Image outpainting: Extend the existing boundaries of input images.

The web application will automatically download the Stable Diffusion v1.5 model on first launch, allowing you to start generating images immediately. If you have a Jetson Orin device, you can follow the tutorial to execute the following command very easily.

git clone https://github.com/dusty-nv/jetson-containerscd jetson-containers./run.sh $(./autotag stable-diffusion-webui)

For more information on running stable-diffusion-webui, see the Jetson Generative AI Lab tutorial. The Jetson AGX Orin can also run the newer Stable Diffusion XL (SDXL) model, and the theme image at the beginning of this article was generated using that model.

text-generation-webui

Implementing Generative AI with NVIDIA Jetson

Figure 3. Interacting with Llama-2-13B on Jetson AGX Orin

Oobabooga’s text-generation-webui is also a popular web interface for running LLMs locally based on Gradio. While the official repository provides one-click installers for various platforms, jetson-containers offers a simpler method.

Through this interface, you can easily download models from the Hugging Face model repository. Based on experience, the Jetson Orin Nano can generally accommodate 7 billion parameter models in 4-bit quantization, while the Jetson Orin NX 16GB can run 13 billion parameter models, and the Jetson AGX Orin 64GB can run an astonishing 70 billion parameter models.

Many people are currently studying Llama-2. This open-source large language model from Meta is freely available for research and commercial use. Techniques such as supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) were also used when training models based on Llama-2. Some even claim it outperforms GPT-4 in certain benchmark tests.

Text-generation-webui not only provides extensions but also helps you develop your own extensions. In the following llamaspeak example, you can see that this interface can be used to integrate your applications and supports multimodal VLMs, such as Llava and image chat.

Figure 4. Response to Image Queries by Quantized Llava-13B VLM

For more information on running text-generation-webui, see Jetson Generative AI Lab Tutorial:https://www.jetson-ai-lab.com/tutorial_text-generation.html

llamaspeak

Figure 5. Llamaspeak Voice Conversation with LLM Using Riva ASR/TTS

Llamaspeak is an interactive chat application that allows voice conversations with locally running LLMs through real-time NVIDIA Riva ASR/TTS. Llamaspeak has now become part of jetson-containers.

To achieve smooth and seamless voice conversations, it is essential to minimize the time taken for the LLM’s first output token. Llamaspeak not only reduces this time but also handles conversation interruptions so that you can start speaking while Llamaspeak processes the TTS of the generated response. Container microservices are suitable for Riva, LLM, and chat servers.

Figure 6. Streaming ASR/LLM/TTS Pipeline

Real-time Conversation Control Flow to Web Client

Llamaspeak features a responsive interface that can stream low-latency audio from a browser microphone or a microphone connected to the Jetson device. For more information on running it yourself, see jetson-containers documentation:https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llamaspeak

NanoOWL

Implementing Generative AI with NVIDIA Jetson

Figure 7. NanoOWL Performing Real-time Object Detection

Open World Localization with Vision Transformers (OWL-ViT) is an open vocabulary detection method developed by Google Research. This model allows you to perform object detection by providing text prompts for target objects.

For instance, when detecting people and cars, use a text prompt system that describes the category:

prompt = "a person, a car"

This monitoring method is highly useful as it enables rapid development of new applications without training new models. To unlock edge applications, our team developed a project called NanoOWL, which optimizes this model using NVIDIA TensorRT to achieve real-time performance on the NVIDIA Jetson Orin platform (with encoding speeds of around 95FPS on Jetson AGX Orin). This performance means you can run OWL-ViT at frame rates far exceeding those of ordinary cameras.

The project also includes a new tree detection pipeline that accelerates the combination of the OWL-ViT model with CLIP, enabling zero-shot detection and classification at any level. For example, when detecting faces and distinguishing between happy or sad, use the following prompt:

prompt = "[a face (happy, sad)]"

If you want to detect faces first, then detect facial features in each target area, use the following prompt:

prompt = "[a face [an eye, a nose, a mouth]]"

Combining both:

prompt = "[a face (happy, sad)[an eye, a nose, a mouth]]"

Such examples are countless. This model may be more accurate for certain objects or classes, and due to its simplicity in development, you can quickly try different combinations and determine their applicability. We look forward to seeing the amazing applications you develop!

Segment Anything Model

Figure 8. Jupyter Notebook for Segment Anything Model (SAM)

Meta has released the Segment Anything Model (SAM), an advanced image segmentation model capable of accurately identifying and segmenting objects in images, regardless of their complexity or context.

Its official repository also includes a Jupyter notebook for easy inspection of the model’s effects, and jetson-containers also provides a convenient container with built-in Jupyter Lab.

NanoSAM

Figure 9. Real-time Tracking and Segmentation of a Computer Mouse with NanoSAM

Segment Anything (SAM) is a magical model that can convert points into segmentation masks. Unfortunately, it does not support real-time operation, which limits its use in edge applications.

To overcome this limitation, we recently released a new project, NanoSAM, which refines the SAM image encoder into a lightweight model. We also optimized this model using NVIDIA TensorRT to achieve real-time performance on the NVIDIA Jetson Orin platform. Now, you can easily convert existing bounding boxes or keypoint detectors into instance segmentation models without any additional training.

Track Anything Model

As the team’s paper:https://arxiv.org/abs/2304.11968 describes, the Track Anything Model (TAM) is “the combination of Segment Anything with video.” On its Gradio-based open-source interface, you can click on a frame of the input video to specify anything to be tracked and segmented. The TAM model even has the additional capability of removing tracked objects through image inpainting.

Figure 10. Track Anything Interface

NanoDB

Video 2. Hello AI World –

Real-time Multimodal VectorDB on NVIDIA Jetson

In addition to effective indexing and searching of data at the edge, these vector databases are often used in conjunction with LLMs to achieve retrieval-augmented generation (RAG) on long-term memory that exceeds their built-in context length (4096 tokens for the Llama-2 model). Visual language models also use the same embeddings as input.

Figure 11. Architecture Diagram Centered on LLM/VLM

With all the real-time data from the edge and the ability to understand this data, AI applications become intelligent agents capable of interacting with the real world. If you want to try using NanoDB on your own images and datasets, for more information, see Lab Tutorial:https://www.jetson-ai-lab.com/tutorial_nanodb.html

Conclusion

As you can see, exciting generative AI applications are emerging. You can easily run and experience these on Jetson Orin by following these tutorials. To witness the amazing capabilities of generative AI running locally, visit Jetson Generative AI Lab:https://www.jetson-ai-lab.com/

If you have created your own generative AI applications on Jetson and want to share your thoughts, be sure to showcase your creations on the Jetson Projects Forum:https://forums.developer.nvidia.com/c/agx-autonomous-machines/jetson-embedded-systems/jetson-projects/78 .

Join us for a webinar on November 8, 2023, from 1-2 AM Beijing time, to delve deeper into several topics discussed in this article and ask questions live!

In this webinar, you will learn about:

Performance characteristics and quantization methods of open-source LLM APIs
Accelerating open vocabulary visual transformers like CLIP, OWL-ViT, and SAM
Multimodal visual agents, vector databases, and retrieval-augmented generation
Real-time multilingual dialogue and conversation through NVIDIA Riva ASR/NMT/TTS

Scan the QR code below to register for the event now!

GTC 2024 will be held from March 18 to 21, 2024, at the San Jose Convention Center in California, USA, with an online conference also available at the same time. Click “Read the Original” orscan the poster QR code below, to register for the GTC conference immediately.

Implementing Generative AI with NVIDIA Jetson

Related posts

Leave a Comment Cancel reply