Skip to content

ShadeCloak/ADORA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training Reasoning Model with Dynamic Advantage Estimation on Reinforcement Learning

Hugging Face Notion Wandb LLM Wandb VLM

ADORA (Advantage Dynamics via Online Rollout Adaptation) is a reinforcement learning (RL) framework designed to dynamically adjust advantage values during training based on the model's rollout distribution. By simple yet effective experiments, it significantly improves long Chain-of-Thought (CoT) reasoning and reflective capabilities in Large Language Models (LLMs) and Vision-Language Models (VLMs).

For LLMs, our ADORA implementation in the Logic-RL framework achieves 40 AMC with only 100 training steps compared to the original paper's 39 AMC with 1200 steps, while maintaining comparable AIME performance at 8. For VLMs, using only 2K samples from the Geometry3K dataset and starting from Qwen2.5-VL-7B-Instruct, we achieved 73.5% accuracy on MathVista, with consistent response-length progression, setting a new SOTA for the multimodal implementation of DeepSeek-R1-Zero.

News

Key Results

ADORA

Implementing ADORA within the Logic-RL experiments achieved 40 AMC and 8 AIME scores, surpassing the GRPO's 35 and 6 respectively.

adora-figure_00

All results in pass@1 accuracy

2 3 4 5 6 7 8 K&K ID K&K OOD Avg AIME AMC
GRPO
100 0.44 0.42 0.21 0.26 0.27 0.17 0.20 0.33 0.21 0.31 6 35
200 0.39 0.23 0.22 0.18 0.21 0.20 0.20 0.26 0.20 0.23 5 34
300 0.83 0.81 0.81 0.64 0.59 0.48 0.46 0.77 0.51 0.66 5 34
350 0.74 0.78 0.78 0.70 0.61 0.47 0.41 0.75 0.50 0.64 6 34
ADORA
100 0.34 0.27 0.21 0.13 0.14 0.05 0.09 0.24 0.09 0.17 7 40
200 0.79 0.62 0.67 0.51 0.36 0.26 0.24 0.65 0.29 0.49 8 36
300 0.84 0.63 0.67 0.67 0.57 0.45 0.44 0.70 0.49 0.61 6 38
350 0.84 0.74 0.73 0.67 0.50 0.38 0.42 0.74 0.44 0.61 8 35

ADORA-VL

Training dynamics comparison of GRPO vs ADORA on Qwen2.5-VL-7B-Instruct (geometry3k). GRPO exhibits stagnant response length growth with KL/policy loss outliers. ADORA achieves sustained length expansion with stabilized optimization at the cost of slight training reward degradation. Benchmark results demonstrate GRPO&ADORA's superior in/out-of-domain task performance.

adora-figure_01

Data comparison of the compared approaches

MM-EUREKA-8B MMR1-math-v0 Vision-R1-7B ADORA (ours)
Base model InternVL2.5-8B-Instruct Qwen2.5-VL-7B Qwen2.5-VL-7B Qwen2.5-VL-7B
Cold Start Data 54k (open-source) None 200k (Modality Bridging VLM CoT) None
RL Data 9.3k (K-12 data) 6k (open-source, carefully curated) 10k (math data) 2k (geometry3k@ train)

All results in pass@1 accuracy

MathVista (AVG) MathVista(ID) MathVista(OOD) MMStar
Qwen2.5-VL-7B-Instruct 67.3 69.6 65.5 63.9
MM-EUREKA-8B 68.1 73.4 63.8 64.3
MMR1-math-v0 70.2 72.3 68.5 64.9
Vision-R1-7B (report) 73.5 81.9 66.8 -
GRPO 70.2 71.6 69.1 61.9
ADORA 73.5 76.1 71.4 63.8

Reproducing

To reproduce the experiments of LLM and VLM in the article, you can refer to the tutorials in the ADORA and ADORA_VL folders. We conducted the experiments in an environment with 8 * A800 GPUs, and both experiments took approximately 1.5 days to complete.

One More Thing

ADORA can be implemented in verl or OpenRLHF by modifying only a single function. Still, based on your specific training objectives, you need to define a method to generate advantage weights from the results of actor rollouts. Moreover, you can also choose to incorporate ADORA only at certain stages of RL training. Notably, ADORA demonstrates compatibility and independence, showing seamless integration capabilities with cold-start scenarios and the recently proposed DAPO. We welcome feedback, improvements, and collaboration opportunities to further explore ADORA's potential implementations.

Citation

If you find this blog or our code useful, we would appreciate it if you could cite our work:

@misc{gui2025adora,
  title={Training Reasoning Model with Dynamic Advantage Estimation on Reinforcement Learning},
  author={Lujun Gui and Qingnan Ren},
  year={2025},
  howpublished={\url{https://www.notion.so/Training-Reasoning-Model-with-Dynamic-Advantage-Estimation-on-Reinforcement-Learning-1a830cc0904681fa9df3e076b6557a3e}},
  note={Notion Blog},
}

Previous Work

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Acknowledgement

We thank the verl and OpenRLHF for the awesome open-source RL infrastructure.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •