Feat: Clip Higher #199

SabaPivot · 2025-03-28T05:40:30Z

No description provided.

SabaPivot · 2025-03-28T05:41:40Z

set beta = 0 disabling kv_divergence
add Clip Higher from DAPO

SZhanZ · 2025-03-30T14:07:49Z

Hello，could you explain the update parameters?

SabaPivot · 2025-03-31T03:09:11Z

set beta = 0 disabling kv_divergene.

Setting beta = 0 effectively removes this KL divergence penalty (as I see in your blog posts).
So, I have enabled to set beta = 0.0 in .sh file, and when beta is set to 0.0, no reference model is initialized (which significantly reduces GPU usage during the GRPO training).

Add Clip Higher from DAPO:

This change directly implements one of the core techniques introduced in the DAPO paper, specifically designed to promote exploration and prevent "entropy collapse" (where the model becomes too conservative and produces non-diverse outputs).
Standard policy gradient methods often "clip" the objective function to limit how large the policy updates can be.
DAPO modifies this by using a higher upper clipping threshold (ε_high).

The "Clip Higher" technique allows the probability ratio (new policy / old policy) to increase more for advantageous actions before being clipped(just like kl_beta=0.0).

SabaPivot · 2025-03-31T03:10:09Z

Hello，could you explain the update parameters?

Hope my reply helps, during my experiment, these improvements had contributed to achieve higher performance.

SZhanZ · 2025-03-31T07:44:52Z

set beta = 0 disabling kv_divergene.

Setting beta = 0 effectively removes this KL divergence penalty (as I see in your blog posts).

So, I have enabled to set beta = 0.0 in .sh file, and when beta is set to 0.0, no reference model is initialized (which significantly reduces GPU usage during the GRPO training).

Add Clip Higher from DAPO:

This change directly implements one of the core techniques introduced in the DAPO paper, specifically designed to promote exploration and prevent "entropy collapse" (where the model becomes too conservative and produces non-diverse outputs).

Standard policy gradient methods often "clip" the objective function to limit how large the policy updates can be.

DAPO modifies this by using a higher upper clipping threshold (ε_high).

The "Clip Higher" technique allows the probability ratio (new policy / old policy) to increase more for advantageous actions before being clipped(just like kl_beta=0.0).

Thanks for your reply. We will merge it. Could you please further provide an example script in src/open-r1-multimodal/run_script folder?

SabaPivot · 2025-03-31T09:33:18Z

set beta = 0 disabling kv_divergene.

Setting beta = 0 effectively removes this KL divergence penalty (as I see in your blog posts).

So, I have enabled to set beta = 0.0 in .sh file, and when beta is set to 0.0, no reference model is initialized (which significantly reduces GPU usage during the GRPO training).

Add Clip Higher from DAPO:

This change directly implements one of the core techniques introduced in the DAPO paper, specifically designed to promote exploration and prevent "entropy collapse" (where the model becomes too conservative and produces non-diverse outputs).

Standard policy gradient methods often "clip" the objective function to limit how large the policy updates can be.

DAPO modifies this by using a higher upper clipping threshold (ε_high).

The "Clip Higher" technique allows the probability ratio (new policy / old policy) to increase more for advantageous actions before being clipped(just like kl_beta=0.0).

Thanks for your reply. We will merge it. Could you please further provide an example script in src/open-r1-multimodal/run_script folder?

Sure. My bad.

Just added two lines below the .sh files.

src/open-r1-multimodal/run_script/run_grpo_rec.sh

cd /workspace/VLM-R1-DAPO/src/open-r1-multimodal

export DEBUG_MODE="true"
export CUDA_VISIBLE_DEVICES=1,2

RUN_NAME="Qwen2.5-VL-3B-GRPO-REC-SFT"
export LOG_PATH="./debug_log_$RUN_NAME.txt"

torchrun --nproc_per_node="2" \
    --nnodes="1" \
    --node_rank="0" \
    (..your code..)
    --beta 0.0 \ # disable reference model
    --epsilon_high 0.28 # Clip higher, epsilon_high=0.28 as recommended in the paper DAPO

Added Clip Higher parameters according to the paper DAPO(https://arxiv.org/html/2503.14476v1)

SabaPivot · 2025-03-31T09:41:15Z

I have also updated a new commit.

Thank you.

SZhanZ · 2025-03-31T13:30:05Z

I have also updated a new commit.

Thank you.

okay~ thanks for PR

Feat: Clip Higher

Shengqi77 · 2025-07-31T13:14:01Z

Hello @SabaPivot How do I implement DAPO's dynamic sampling in this code?

Feat: Clip Higher

5b731b0

SabaPivot closed this Mar 28, 2025

SabaPivot reopened this Mar 28, 2025

Update: run_grpo_rec.sh

7029f70

Added Clip Higher parameters according to the paper DAPO(https://arxiv.org/html/2503.14476v1)

SZhanZ merged commit 0f6158a into om-ai-lab:main Mar 31, 2025

IANNXANG pushed a commit to IANNXANG/VLM-R1 that referenced this pull request May 20, 2025

Merge pull request om-ai-lab#199 from SabaPivot/kl_divergence

8fff4a8

Feat: Clip Higher

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Clip Higher #199

Feat: Clip Higher #199

Uh oh!

SabaPivot commented Mar 28, 2025

Uh oh!

SabaPivot commented Mar 28, 2025

Uh oh!

SZhanZ commented Mar 30, 2025

Uh oh!

SabaPivot commented Mar 31, 2025

Uh oh!

SabaPivot commented Mar 31, 2025

Uh oh!

SZhanZ commented Mar 31, 2025

Uh oh!

SabaPivot commented Mar 31, 2025 •

edited

Loading

Uh oh!

SabaPivot commented Mar 31, 2025

Uh oh!

SZhanZ commented Mar 31, 2025

Uh oh!

Shengqi77 commented Jul 31, 2025

Uh oh!

Uh oh!

Feat: Clip Higher #199

Feat: Clip Higher #199

Uh oh!

Conversation

SabaPivot commented Mar 28, 2025

Uh oh!

SabaPivot commented Mar 28, 2025

Uh oh!

SZhanZ commented Mar 30, 2025

Uh oh!

SabaPivot commented Mar 31, 2025

Uh oh!

SabaPivot commented Mar 31, 2025

Uh oh!

SZhanZ commented Mar 31, 2025

Uh oh!

SabaPivot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SabaPivot commented Mar 31, 2025

Uh oh!

SZhanZ commented Mar 31, 2025

Uh oh!

Shengqi77 commented Jul 31, 2025

Uh oh!

Uh oh!

SabaPivot commented Mar 31, 2025 •

edited

Loading