Reinforcing Visual State Reasoning for Multi-Turn VLM Agents
Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li
(* equal contribution)
This repository contains the official implementation of our paper, "Reinforcing Visual State Reasoning for Multi-Turn VLM Agents".
We introduce VAGEN, a multi-turn reinforcement learning framework designed specifically for training vision-language model (VLM) agents. Built upon this framework, we propose Visual Reasoning RL, a novel reinforcement learning approach that significantly improves the multi-turn performance of VLMs by explicitly supervising their visual state reasoning process.
[2025/05] We are excited to release our paper, "Reinforcing Visual State Reasoning for Multi-Turn VLM Agents", introducing the Visual Reasoning RL method!
[2025/04] We've introduced a new modular design for environments and services in VAGEN:
- Enhanced environment framework for easier creation of custom environments
- New service architecture for efficient distributed training
- Check out our new guides:
- Creating Environments: New environment protocal.
- Creating Services: We now support hosting environments in a separate process
[2025/03] We release VAGEN, a multi-turn reinforcement learning framework for training VLM Agents!
Standard RL methods applied to VLMs struggle with multi-turn agentic tasks due to:
- Visual State Ambiguity: VLMs lack mechanisms to explicitly interpret and track evolving visual environments.
- Precision Bottlenecks: Existing representations fall short in tasks requiring fine-grained spatial or temporal understanding.
Our approach, Visual Reasoning RL, addresses these challenges through:
- Reasoning Prompts: Injects structured prompts like grounding (current state description) and world modeling (future state prediction) to scaffold the model’s internal reasoning.
- Reasoning Rewards: We use LLM-as-Judge to reward the agent when its predicted or observed visual state matches the ground truth.
- Turn-level reasoning rewards for supervising accuracy.
- Bi-Level GAE for fine-grained credit assignment at both turn and token levels.
![]() Boost #1: Visual Reasoning |
![]() Boost #2: Turn-level |
Two key innovations are introduced in VAGEN to support methods like Visual Reasoning RL:
-
Selective Token Masking - Focuses optimization on action-critical tokens through:
- Loss masking (
M^loss
): Identifies tokens to update during policy optimization - Advantage masking (
M^adv
): Determines tokens to include in advantage calculations
- Loss masking (
-
Cross-turn Credit Assignment - Enables more effective credit attribution through:
- Bi-level advantage estimation with separate discount factors for cross-turn (
γ_turn
) and within-turn (γ_token
) calculations - Turn-level rewards applied at each interaction boundary
- Bi-level advantage estimation with separate discount factors for cross-turn (
Traditional RL frameworks for LLM agents treat all tokens in a trajectory equally. This approach is suboptimal for VLM agents due to:
- Distribution Shift: Most VLMs aren't pretrained to generate image tokens
- State Redundancy: Visual tasks contain excessive low-level information in long-context inputs
VAGEN addresses these challenges by focusing optimization on the most critical decision-making tokens and creating a more nuanced reward structure across interaction turns.
We present the workflow of VAGEN in the image below. The rollout.py
module facilitates interactions between ray_trainer.py
and various environments. Our framework operates with two forms of “language”: token sequences (used by the model) and structured information from the environments. rollout.py
serves as a translator, parsing structured environment data into tokens for the model and converting model outputs back into structured actions or observations. It also records data of each step to form the entire trajectory.
# Create a new conda environment
conda create -n vagen python=3.10 -y
conda activate vagen
# verl
git clone https://github.com/JamesKrW/verl.git
cd verl
pip install -e .
cd ../
# vagen
git clone https://github.com/RAGEN-AI/VAGEN.git
cd VAGEN
bash scripts/install.sh
# This script installs dependencies for Frozenlake and Sokoban, for other environments, please refer to vagen/env/README.md
# Login to wandb
wandb login
# You can run different environments and algorithms:
bash scripts/examples/masked_grpo/frozenlake/run_tmux.sh
bash scripts/examples/masked_turn_ppo/frozenlake/run_tmux.sh
bash scripts/examples/vagen_base/sokoban/run_tmux.sh
# Use Visual Reasoning Reward
# Setup OPENAI_API_KEY in the Environment
bash scripts/examples/vagen_full/sokoban/run_tmux.sh
See our Creating Environments guide. You may also want to check our Creating Service for scaling your environments.
- Refer to VERL for adding new MLLM.
- Refer to QwenVLRolloutManager to understand how rollout works. In most cases, you can use QwenVLRolloutManager directly with only minor modifications to the model's special tokens
We benchmark closed- and open-sourced models on five environments. Reasoning on visual states, including both grounding and world modeling, can improve the performance.

Incorporating Visual Reasoning RL leads to improved performance.
- VAGEN-Base uses the Grounding-WorldModeling reasoning strategy along with format and task-specific rewards.
- VAGEN-Full builds on this and incorporates Visual Reasoning RL

Note: VAGEN currently supports several environments: sokoban, frozenlake, svg, navigation, and primitive skill.

We thank RAGEN for its innovative exploration in multi-turn reinforcement learning for LLM agents. We thank verl for its RL framework. We thank EasyR1 for adding initial support for VLMs to verl.
RAGEN: Training Agents by Reinforcing Reasoning
verl: Volcano Engine Reinforcement Learning for LLM
ArCHer: Hierarchical Multi-Turn RL Agent Training Framework
Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
OpenManus-RL: A live stream development of RL tunning for LLM agents
If you find our framework and paper useful, we appreciate it if you could cite our work:
@misc{wang2025vagen,
title={Reinforcing Visual State Reasoning for Multi-Turn VLM Agents},
author={Kangrui Wang* and Pingyue Zhang* and Zihan Wang* and Yaning Gao* and Linjie Li* and Qineng Wang and Hanyang Chen and Chi Wan and Yiping Lu and Zhengyuan Yang and Lijuan Wang and Ranjay Krishna and Jiajun Wu and Li Fei-Fei and Yejin Choi and Manling Li},
year={2025},
url={https://github.com/RAGEN-AI/VAGEN}
}