AI Paper Daily | Multimodal Agent Explosion: Tool Reasoning, Early Experience, Breakthroughs in Autonomous Learning

Multimodal Large Models × Agent × New Progress in Model AlignmentSource: AI Algorithm Society · Daily Selection of Cutting-Edge AI Papers

📌 Today’s Summary

Today, we have compiled 3 papers, mainly focusing on:

🧩 Multimodal Large Models (MMFM) 🤖 Basic Agent Models (Tool/Action Models) ⚙️ Multimodal SFT/Preference Alignment

🧾 Paper List

1. MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning🔗 https://arxiv.org/abs/2510.08567v1

2. Agent Learning via Early Experience🔗 https://arxiv.org/abs/2510.08558v1

3. SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models🔗 https://arxiv.org/abs/2510.08559v1

🧠 Paper 1: MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

👩💻 Authors: Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan

🎯 Focus: Multimodal SFT / Preference Alignment / Multimodal Agent Models

⭐ Rating:79.8

🔗 Original Text:arXiv 💻 Source Code:GitHub

🔍 Abstract

Visual Language Models (VLMs) are increasingly used as controllers with access to external tools for complex reasoning and decision-making, but their effectiveness is limited by the scarcity of high-quality multimodal trajectories and the high cost of manual annotation. This paper proposes a vision-centered agent tuning framework that can automatically synthesize multimodal trajectories, generate stepwise preferences, and train VLM controllers for robust tool-use reasoning. The authors constructed the M-TRACE dataset (28.5K tasks, 177K validated trajectories) and trained the MATRIX Agent, further refining behavioral decisions through stepwise preference optimization.

🧩 Method Highlights

  • Two-stage:Trajectory Supervised Fine-tuning + Stepwise Preference Optimization
  • Automatically synthesizes and validates multimodal task data, reducing manual annotation costs
  • Utilizes the ReAct reasoning paradigm and DPO preference optimization

📊 Experimental Results

  • Comprehensively outperformed open-source and some closed-source baselines on three benchmarks: Agent-X, GTA, GAIA
  • Agent-X metric improvements: Tool accuracy, fidelity, and semantic accuracy increased by +37%, +9%, +33%

💡 Significance

MATRIX demonstrates that the combination of “trajectory supervision + preference optimization” can significantly enhance the robustness of multimodal agents’ tool-use reasoning, and the automated data construction reduces costs, facilitating large-scale training and deployment.

🧠 Paper 2: Agent Learning via Early Experience

👩💻 Authors: Kai Zhang, Xiangchao Chen, Bo Liu, et al.

🎯 Focus: Reinforcement Learning / Metacognitive Reflection / Basic Agent Models

⭐ Rating:76.0

🔗 Original Text:arXiv

🔍 Abstract

The long-term goal of language agents is to learn and continuously improve through their own experiences, thereby surpassing humans in complex real-world tasks. However, reinforcement learning is difficult to apply directly due to many environments lacking verifiable rewards or requiring costly long-sequence interactions. This paper proposes an Early Experience paradigm that allows agents to generate data through their own interactions and use future states as supervisory signals to learn from their own actions.

🧩 Core Methods

  • Implicit World Modeling: Utilizing collected states to help the policy internalize environmental dynamics
  • Self-Reflective Learning: Summarizing patterns from suboptimal behaviors and correcting reasoning
  • Does not require external reward signals, relying solely on the agent’s own interaction results to construct supervision

📊 Experimental Results

  • Covers eight environments (ALFWorld, ScienceWorld, TravelPlanner, BFCLv3, Tau-Bench, SearchQA, WebShop, WebArena-Lite)
  • Average improvements: Success rate +9.6%, Generalization ability +9.4%
  • Further improvement of RL initialization in verifiable reward environments by +6.4%

💡 Significance

“Early Experience” serves as a bridge between imitation learning and reinforcement learning, allowing language agents to continuously improve from their own behaviors, reducing reliance on expert demonstrations, and demonstrating the potential for human-like growth learning.

🧠 Paper 3: SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

👩💻 Authors: Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang

🎯 Focus: Multimodal Benchmarking / Scientific Video Understanding

⭐ Rating:74.0

🔗 Original Text:arXiv

🔍 Abstract

Current multimodal models still face challenges in scientific video reasoning. Existing video benchmarks focus on general scenarios, and the tasks are relatively simple, making it difficult to assess models’ advanced cognitive abilities. This paper proposes SciVideoBench — the first rigorous benchmark for scientific experimental video reasoning, covering 25 disciplines and 1000 multiple-choice questions, to test models’ complex visual reasoning capabilities in scientific contexts.

🧩 Experiments and Findings

  • Evaluated 21 models (Gemini 2.5 Pro, GPT-4o, InternVL, Qwen2.5-VL, etc.)
  • Gemini 2.5 Pro accuracy 64.3%, the best open-source model only 38.8%
  • Quantitative reasoning is the most challenging; chain-of-thought prompts can improve proprietary models by about +21%

💡 Significance

SciVideoBench reveals the current models’ shortcomings in scientific video reasoning (visual perception errors, logical reasoning flaws, insufficient domain knowledge), providing the first research-grade evaluation platform for “AI assisting scientists”.

— The above is the selection for 2025-10-09 —Follow the ‘AI Algorithm Society’ to grasp AI frontiers in just 10 minutes daily.

Leave a Comment