Skip to content

RAGEN-AI/VAGEN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VAGEN: Training VLM agents with multi-turn reinforcement learning

Reinforcing Visual State Reasoning for Multi-Turn VLM Agents

Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li

(* equal contribution)

Paper Documentation Blog Experiment Log Website




FrozenLake



Navigation



Sokoban



ManiSkill



SVG

This repository contains the official implementation of our paper, "Reinforcing Visual State Reasoning for Multi-Turn VLM Agents".

We introduce VAGEN, a multi-turn reinforcement learning framework designed specifically for training vision-language model (VLM) agents. Built upon this framework, we propose Visual Reasoning RL, a novel reinforcement learning approach that significantly improves the multi-turn performance of VLMs by explicitly supervising their visual state reasoning process.

image

News

[2025/05] We are excited to release our paper, "Reinforcing Visual State Reasoning for Multi-Turn VLM Agents", introducing the Visual Reasoning RL method!

[2025/04] We've introduced a new modular design for environments and services in VAGEN:

  • Enhanced environment framework for easier creation of custom environments
  • New service architecture for efficient distributed training
  • Check out our new guides:

[2025/03] We release VAGEN, a multi-turn reinforcement learning framework for training VLM Agents!

Why Visual Reasoning RL?

Standard RL methods applied to VLMs struggle with multi-turn agentic tasks due to:

  1. Visual State Ambiguity: VLMs lack mechanisms to explicitly interpret and track evolving visual environments.
  2. Precision Bottlenecks: Existing representations fall short in tasks requiring fine-grained spatial or temporal understanding.

Our approach, Visual Reasoning RL, addresses these challenges through:

Boost #1: Visual Reasoning

  1. Reasoning Prompts: Injects structured prompts like grounding (current state description) and world modeling (future state prediction) to scaffold the model’s internal reasoning.
  2. Reasoning Rewards: We use LLM-as-Judge to reward the agent when its predicted or observed visual state matches the ground truth.

Boost #2: Turn-level

  1. Turn-level reasoning rewards for supervising accuracy.
  2. Bi-Level GAE for fine-grained credit assignment at both turn and token levels.

Boost #1: Visual Reasoning

Boost #2: Turn-level

Key Innovations of VAGEN

Two key innovations are introduced in VAGEN to support methods like Visual Reasoning RL:

  1. Selective Token Masking - Focuses optimization on action-critical tokens through:

    • Loss masking (M^loss): Identifies tokens to update during policy optimization
    • Advantage masking (M^adv): Determines tokens to include in advantage calculations
  2. Cross-turn Credit Assignment - Enables more effective credit attribution through:

    • Bi-level advantage estimation with separate discount factors for cross-turn (γ_turn) and within-turn (γ_token) calculations
    • Turn-level rewards applied at each interaction boundary

Why VAGEN Works Better for VLM Agents

Traditional RL frameworks for LLM agents treat all tokens in a trajectory equally. This approach is suboptimal for VLM agents due to:

  • Distribution Shift: Most VLMs aren't pretrained to generate image tokens
  • State Redundancy: Visual tasks contain excessive low-level information in long-context inputs

VAGEN addresses these challenges by focusing optimization on the most critical decision-making tokens and creating a more nuanced reward structure across interaction turns.

The VAGEN Workflow

We present the workflow of VAGEN in the image below. The rollout.py module facilitates interactions between ray_trainer.py and various environments. Our framework operates with two forms of “language”: token sequences (used by the model) and structured information from the environments. rollout.py serves as a translator, parsing structured environment data into tokens for the model and converting model outputs back into structured actions or observations. It also records data of each step to form the entire trajectory.

Installation

# Create a new conda environment
conda create -n vagen python=3.10 -y
conda activate vagen

# verl
git clone https://github.com/JamesKrW/verl.git
cd verl
pip install -e .
cd ../

# vagen
git clone https://github.com/RAGEN-AI/VAGEN.git
cd VAGEN
bash scripts/install.sh
# This script installs dependencies for Frozenlake and Sokoban, for other environments, please refer to vagen/env/README.md

Usage

# Login to wandb
wandb login

# You can run different environments and algorithms:
bash scripts/examples/masked_grpo/frozenlake/run_tmux.sh
bash scripts/examples/masked_turn_ppo/frozenlake/run_tmux.sh
bash scripts/examples/vagen_base/sokoban/run_tmux.sh

# Use Visual Reasoning Reward
# Setup OPENAI_API_KEY in the Environment
bash scripts/examples/vagen_full/sokoban/run_tmux.sh

How to Add New Environment and Services

See our Creating Environments guide. You may also want to check our Creating Service for scaling your environments.

How to Add New Model

  1. Refer to VERL for adding new MLLM.
  2. Refer to QwenVLRolloutManager to understand how rollout works. In most cases, you can use QwenVLRolloutManager directly with only minor modifications to the model's special tokens

Experimental Results

We benchmark closed- and open-sourced models on five environments. Reasoning on visual states, including both grounding and world modeling, can improve the performance.

image

Incorporating Visual Reasoning RL leads to improved performance.

  • VAGEN-Base uses the Grounding-WorldModeling reasoning strategy along with format and task-specific rewards.
  • VAGEN-Full builds on this and incorporates Visual Reasoning RL
image

Environments

Note: VAGEN currently supports several environments: sokoban, frozenlake, svg, navigation, and primitive skill. image

Cases

Preview (click to show full cases)

Acknowledgement

We thank RAGEN for its innovative exploration in multi-turn reinforcement learning for LLM agents. We thank verl for its RL framework. We thank EasyR1 for adding initial support for VLMs to verl.

References

RAGEN: Training Agents by Reinforcing Reasoning

verl: Volcano Engine Reinforcement Learning for LLM

ArCHer: Hierarchical Multi-Turn RL Agent Training Framework

Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

OpenManus-RL: A live stream development of RL tunning for LLM agents

Citation

If you find our framework and paper useful, we appreciate it if you could cite our work:

@misc{wang2025vagen,
  title={Reinforcing Visual State Reasoning for Multi-Turn VLM Agents},
  author={Kangrui Wang* and Pingyue Zhang* and Zihan Wang* and Yaning Gao* and Linjie Li* and Qineng Wang and Hanyang Chen and Chi Wan and Yiping Lu and Zhengyuan Yang and Lijuan Wang and Ranjay Krishna and Jiajun Wu and Li Fei-Fei and Yejin Choi and Manling Li},
  year={2025},
  url={https://github.com/RAGEN-AI/VAGEN}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •