Exploring Large Models on Jetson: Day 15 – NanoLLM Development Platform (4): Visual Analysis Assistant

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

Exploring Large Models on Jetson: Introduction

Exploring Large Models on Jetson Day 2: Environment Setup

Exploring Large Models on Jetson Day 3: TGW Intelligent Assistant

Exploring Large Models on Jetson Day 4: SDW Text to Image

Exploring Large Models on Jetson Day 5: Ollama Command Mode Intelligent Assistant

Exploring Large Models on Jetson Day 6: Ollama’s Webui Intelligent Assistant

Exploring Large Models on Jetson Day 7: Executing RAG Functionality with Jetson Copilot

Exploring Large Models on Jetson Day 8: Multi-modal Image Search with NanoDB

Exploring Large Models on Jetson Day 9: Establishing EffectiveViT Testing Environment

Exploring Large Models on Jetson Day 10: OWL-ViT Application

Exploring Large Models on Jetson Day 11: SAM2 Application

Exploring Large Models on Jetson Day 12: NanoLLM Development Platform (1): Python API Interface Introduction

Exploring Large Models on Jetson Day 12: NanoLLM Development Platform (2): Voice Dialogue Assistant

Exploring Large Models on Jetson Day 14: NanoLLM Development Platform (3): Multi-modal Voice Assistant

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

We have previously used the NanoLLM development platform to easily create applications such as voice dialogue assistants and multi-modal recognition. This article aims to further utilize the video-related API of NanoLLM, combined with a suitable large language model, to perform dynamic analysis functions on content obtained from video files or cameras.

Here, we will switch to an equally multi-modal model, Efficient-Large-Model/VILA1.5-3b, and use three APIs for testing. Everyone can compare them and adjust for the most suitable application scenario.

If you want to test using your own USB camera, please plug the camera into the device before entering the container, and then execute the following command to enter the NanoLLM container:

$ jetson-containers run $(autotag nano_llm)

After entering the container, you can execute the following command to check if the camera is detected:

$ ls /dev/video*

If you see a response, it means the camera has been found inside the container.

Now, let’s start with the simple nano_llm.chat chat function and image/video content recognition. These require providing two prompts to nano_llm.chat; the system will analyze the required recognition file path from the first prompt, and the second prompt will be the question we want to ask. Let’s try using a Chinese prompt:

$ python3 -m nano_llm.chat  --api=mlc \  --model Efficient-Large-Model/VILA1.5-3b \  --prompt '/data/images/path.jpg' \  --prompt 'What information does the image provide?'

Below is the content of /data/images/path.jpg:

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

The system’s final response (may not be exactly the same):

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

It can be seen that the ability to display Chinese in terminal mode is incomplete, resulting in garbled characters. Additionally, the system recognizes the language of our prompt and responds in the same language, which is quite intelligent.

Now, let’s change the second prompt to “What information is in the picture?” The result that comes out is as follows: the response is very accurate: “1. Yellow sign, black text saying ‘Private Road Prohibited’. Image 2: A winding road with grass on both sides.”

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

If we want to analyze a video, the underlying implementation will use the videoSource and videoOutput libraries from the project author’s earlier jetson-utils, supporting input video formats such as H264/H265 encoded MP4/MKV/AVI/FLV, while the output part supports RTP/RTSP/WebRTC and other network video protocols, passed through the “–video-input” and “–video-output” parameters.

In the tools nano_llm.vision.video and nano_llm.agents.video_query, these libraries are called.

Now, using the nano_llm.vision.video tool, we will read a video, perform analysis based on the prompts, merge the analysis results into the video, and output to a specified video, so we can check the results in the output.

Please enter the following command to try:

$ python3 -m nano_llm.vision.video \    --model Efficient-Large-Model/VILA1.5-3b \    --max-images 8 \    --max-new-tokens 48 \    --video-input /data/sample_720p.mp4 \    --video-output /data/sample_output.mp4 \    --prompt 'What changes occurred in the video?'

The input /data/sample_720p.mp4 is a video of traffic and pedestrians shot on the road, with the recognized video stored in /data/sample_output.mp4. The following image is a screenshot with the “prompt response” embedded in the video:

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

Now we can point the “–video-input” parameter to “/dev/video0” to call the camera, but when outputting to a file, the application must be terminated using “Ctrl-C”, which can lead to incomplete video storage and inability to open it, making it impossible to check the execution results.

The recommended approach is to view the camera results by outputting the results via RTP protocol to a specified computer, and then receiving the results on the computer using gst-launch.

Now we execute the following command inside the Jetson Orin container:

$ python3 -m nano_llm.vision.video \--model Efficient-Large-Model/VILA1.5-3b \--max-images 8 \--max-new-tokens 48 \--video-input /dev/video0 \--video-output rtp://<IP_OF_TARGET>:1234  \--prompt 'What are in his hand?'

After starting execution, execute the following command on the target Linux computer:

$ gst-launch-1.0 udpsrc port=1234 \caps="application/x-rtp, media=(string)video, \clock-rate=(int)90000, encoding-name=(string)H264, payload=(int)96" \! rtph264depay ! decodebin ! videoconvert ! autovideosink

Then, you will see the following display on the Linux computer:

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

This way, we can solve the problem of real-time recognition using the camera.

Exploring Large Models on Jetson: Day 15 - NanoLLM Development Platform (4): Visual Analysis Assistant

Leave a Comment

×