IBM Releases Granite 4.0 Nano Series Small AI Models, NVIDIA Open Sources OmniVinci Multimodal Understanding Model

1. IBM Releases Granite 4.0 Nano Series Small AI Models

IBM has launched four Granite 4.0 Nano models with parameter scales ranging from 3.5 million to 1.5 billion, capable of running on standard laptops or even locally in browsers. These models are open-sourced under the Apache 2.0 license for commercial and research use, certified with ISO 42001, and benchmark performance exceeds that of competing products.

Article:

https://huggingface.co/blog/ibm-granite/granite-4-nano

Hugging Face:

https://huggingface.co/collections/ibm-granite/granite-40-nano-language-models

2. NVIDIA Open Sources OmniVinci Multimodal Understanding Model

The NVIDIA research team has released and open-sourced the new OmniVinci multimodal understanding model, which integrates visual, audio, and text information. Through innovative architecture and a two-stage training method, it surpasses current top models in multimodal understanding benchmarks with a score of 19.05, using only 0.2 trillion training tokens (1/6 of the competitor’s 1.2 trillion), significantly improving data efficiency.

GitHub:

https://github.com/NVlabs/OmniVinci

3. Chinese AI Startup MiniMax Launches Hailuo2.3 Model on Replicate Platform

The latest video generation model Hailuo2.3 from the Chinese AI startup MiniMax has officially launched on the Replicate platform. This model continues the previous NCR architecture, achieving a 2.5 times improvement in training efficiency, supporting native 10-second 1080p video output, and generating videos from text and image inputs. It performs excellently in global video generation benchmarks, surpassing Google’s Veo3, with highlights including refined human physics and motion simulation, cinematic visual effects, and improved detail clarity, consistency, and adherence to prompts, supporting multilingual prompts suitable for film production, advertising, digital entertainment, and mobile content creation.

Experience Link:

https://replicate.com/minimax/hailuo-02

4. Thinking Machine Introduces “Online Policy Distillation”

The AI team at Thinking Machine, led by Kevin Lu (who previously led key projects at OpenAI), has released the “Online Policy Distillation” training method, which combines reinforcement learning and supervised learning, allowing small models to achieve training efficiency improvements of 50 to 100 times on specific tasks. For example, an 8B small model can reach performance close to a 32B large model with only 1/7 to 1/10 of the steps using original reinforcement learning in mathematical reasoning tasks, and it addresses the “catastrophic forgetting” problem. This method has garnered attention from former OpenAI CTO Mira Murati, and the team believes it will drive the commercial viability of “small and specialized” models, shifting the AI industry from a focus on “large models only” to “efficient intelligence”.

Paper:

https://thinkingmachines.ai/blog/on-policy-distillation/

5. The University of Hong Kong and LeapStar Team Propose Visual Foundation Model VFMTok

The University of Hong Kong and the LeapStar team have proposed VFMTok, which utilizes frozen pre-trained visual foundation models (such as CLIP, DINOv2) to construct a visual tokenizer, breaking through the limitations of traditional VQGAN. VFMTok achieves high-quality image reconstruction and generation with only 256 tokens through multi-layer feature extraction and region-adaptive quantization, significantly improving training convergence speed (3 times) and inference efficiency (4 times), and ensuring generation quality without CFG.

Paper:

https://arxiv.org/pdf/2507.08441

Hugging Face:

https://huggingface.co/papers/2507.08441

GitHub:

https://github.com/CVMI-Lab/VFMTok

https://github.com/CVMI-Lab/VFMTok-RAR

6. Uni-Instruct: Unified Framework for Single-Step Diffusion Models Achieves ImageNet Generation FID Breakthrough of 1.0

Peking University and Xiaohongshu team proposed Uni-Instruct, which unifies over a dozen single-step diffusion model distillation methods (such as Diff-Instruct, SIM, etc.) through the f-divergence diffusion expansion theorem, merging KL divergence and score divergence routes into a weighted loss function. This framework achieves a single-step generation effect with an FID of only 1.02 on ImageNet-64, surpassing multi-step diffusion models, and achieves SOTA performance in CIFAR-10 and text-to-3D generation tasks.

Paper:

https://arxiv.org/abs/2505.20755v4

GitHub:

https://github.com/a-little-hoof/Uni_Instruct

7. SVG Diffusion Model: Achieving Training and Generation Efficiency Improvements by Dozens of Times without VAE Architecture

The Tsinghua and Kuaishou teams proposed the SVG model, which combines a dual-path design of DINOv3 semantic branches and residual detail encoders, along with a distribution alignment mechanism, completely avoiding the semantic entanglement issues of VAE. This method achieves a 62 times improvement in training efficiency (FID reaches 1.92) on ImageNet, a 35 times increase in generation speed, and the feature space can be directly transferred to downstream tasks such as classification and segmentation, achieving multi-task universality.

Paper:

https://arxiv.org/abs/2510.15301

GitHub:

https://github.com/shiml20/SVG

8. CapRL: 3B Small Model Achieves Image Description Performance Comparable to 72B Model

Researchers proposed the CapRL reinforcement learning framework, innovatively using the accuracy of answering visual choice questions by a pure language model as a reward signal, addressing the subjectivity issue in reward design for image description tasks. The 3B parameter model trained with this method achieves description quality comparable to Qwen2.5-VL-72B, and the generated CapRL-5M dataset significantly improves LVLM performance across 12 benchmark tests.

Paper:

https://arxiv.org/abs/2509.22647

GitHub:

https://github.com/InternLM/CapRL

Hugging Face:

https://huggingface.co/internlm/CapRL-3B

https://huggingface.co/datasets/internlm/CapRL-2M

9. MTI: Achieving Lossless Improvement in LLM Inference Performance through Perturbation of High-Entropy Words

The team from the Hong Kong University of Science and Technology proposed a minimal testing-time intervention method, discovering that LLM inference errors are primarily caused by a few high-entropy words. By using selective CFG interventions and lightweight negative prompting techniques, only 3.5% of high-entropy words need to be manipulated to achieve an average improvement of 1.58 points in tasks such as mathematics and coding. This method requires no training and is compatible with existing inference frameworks.

Paper:

https://arxiv.org/abs/2510.13940

GitHub:

https://github.com/EnVision-Research/MTI

10. GAR Region Understanding Model: New Breakthrough in Fine-Grained Visual Description Beyond NVIDIA DAM

The Chinese Academy of Sciences and ByteDance team proposed the GAR model, which uses RoI alignment feature replay technology to balance local details and global context, excelling in tasks such as region description and multi-object relationship reasoning. The 1B parameter model surpasses large models like InternVL3-78B in multiple tests on GAR-Bench and possesses zero-shot video transfer capabilities.

Paper:

https://huggingface.co/papers/2510.18876

GitHub:

https://github.com/Haochen-Wang409/Grasp-Any-Region

References

1.https://www.aibase.com/zh/news

2.https://mp.weixin.qq.com/s/vneLUyekdK5rQQ9wpluU0w

3.https://mp.weixin.qq.com/s/teLGcdJSzqV8AUgD0Cy34Q

4.https://mp.weixin.qq.com/s/6wmG2OtuUIb64ZuL12Uz4g

5.https://mp.weixin.qq.com/s/UlKseo4v6Lk2fvzKNmXmRQ

6.https://mp.weixin.qq.com/s/6DzGkhPV98tWguQbfqYp3g

7.https://mp.weixin.qq.com/s/SqY3AKStSzhpsP1dhH_pfg

Related posts

Leave a Comment Cancel reply