Skip to content

Feat: Clip Higher #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 31, 2025
Merged

Feat: Clip Higher #199

merged 2 commits into from
Mar 31, 2025

Conversation

SabaPivot
Copy link
Contributor

No description provided.

@SabaPivot SabaPivot closed this Mar 28, 2025
@SabaPivot SabaPivot reopened this Mar 28, 2025
@SabaPivot
Copy link
Contributor Author

  1. set beta = 0 disabling kv_divergence
  2. add Clip Higher from DAPO

@SZhanZ
Copy link
Contributor

SZhanZ commented Mar 30, 2025

Hello,could you explain the update parameters?

@SabaPivot
Copy link
Contributor Author

  1. set beta = 0 disabling kv_divergene.
  • Setting beta = 0 effectively removes this KL divergence penalty (as I see in your blog posts).
  • So, I have enabled to set beta = 0.0 in .sh file, and when beta is set to 0.0, no reference model is initialized (which significantly reduces GPU usage during the GRPO training).
  1. Add Clip Higher from DAPO:
  • This change directly implements one of the core techniques introduced in the DAPO paper, specifically designed to promote exploration and prevent "entropy collapse" (where the model becomes too conservative and produces non-diverse outputs).

  • Standard policy gradient methods often "clip" the objective function to limit how large the policy updates can be.

  • DAPO modifies this by using a higher upper clipping threshold (ε_high).

The "Clip Higher" technique allows the probability ratio (new policy / old policy) to increase more for advantageous actions before being clipped(just like kl_beta=0.0).

@SabaPivot
Copy link
Contributor Author

Hello,could you explain the update parameters?

Hope my reply helps, during my experiment, these improvements had contributed to achieve higher performance.

@SZhanZ
Copy link
Contributor

SZhanZ commented Mar 31, 2025

  1. set beta = 0 disabling kv_divergene.
  • Setting beta = 0 effectively removes this KL divergence penalty (as I see in your blog posts).
  • So, I have enabled to set beta = 0.0 in .sh file, and when beta is set to 0.0, no reference model is initialized (which significantly reduces GPU usage during the GRPO training).
  1. Add Clip Higher from DAPO:
  • This change directly implements one of the core techniques introduced in the DAPO paper, specifically designed to promote exploration and prevent "entropy collapse" (where the model becomes too conservative and produces non-diverse outputs).
  • Standard policy gradient methods often "clip" the objective function to limit how large the policy updates can be.
  • DAPO modifies this by using a higher upper clipping threshold (ε_high).

The "Clip Higher" technique allows the probability ratio (new policy / old policy) to increase more for advantageous actions before being clipped(just like kl_beta=0.0).

Thanks for your reply. We will merge it. Could you please further provide an example script in src/open-r1-multimodal/run_script folder?

@SabaPivot
Copy link
Contributor Author

SabaPivot commented Mar 31, 2025

  1. set beta = 0 disabling kv_divergene.
  • Setting beta = 0 effectively removes this KL divergence penalty (as I see in your blog posts).
  • So, I have enabled to set beta = 0.0 in .sh file, and when beta is set to 0.0, no reference model is initialized (which significantly reduces GPU usage during the GRPO training).
  1. Add Clip Higher from DAPO:
  • This change directly implements one of the core techniques introduced in the DAPO paper, specifically designed to promote exploration and prevent "entropy collapse" (where the model becomes too conservative and produces non-diverse outputs).
  • Standard policy gradient methods often "clip" the objective function to limit how large the policy updates can be.
  • DAPO modifies this by using a higher upper clipping threshold (ε_high).

The "Clip Higher" technique allows the probability ratio (new policy / old policy) to increase more for advantageous actions before being clipped(just like kl_beta=0.0).

Thanks for your reply. We will merge it. Could you please further provide an example script in src/open-r1-multimodal/run_script folder?

Sure. My bad.

Just added two lines below the .sh files.

src/open-r1-multimodal/run_script/run_grpo_rec.sh

cd /workspace/VLM-R1-DAPO/src/open-r1-multimodal

export DEBUG_MODE="true"
export CUDA_VISIBLE_DEVICES=1,2

RUN_NAME="Qwen2.5-VL-3B-GRPO-REC-SFT"
export LOG_PATH="./debug_log_$RUN_NAME.txt"

torchrun --nproc_per_node="2" \
    --nnodes="1" \
    --node_rank="0" \
    (..your code..)
    --beta 0.0 \ # disable reference model
    --epsilon_high 0.28 # Clip higher, epsilon_high=0.28 as recommended in the paper DAPO

Added Clip Higher parameters according to the paper DAPO(https://arxiv.org/html/2503.14476v1)
@SabaPivot
Copy link
Contributor Author

I have also updated a new commit.

Thank you.

@SZhanZ
Copy link
Contributor

SZhanZ commented Mar 31, 2025

I have also updated a new commit.

Thank you.

okay~ thanks for PR

@SZhanZ SZhanZ merged commit 0f6158a into om-ai-lab:main Mar 31, 2025
IANNXANG pushed a commit to IANNXANG/VLM-R1 that referenced this pull request May 20, 2025
@Shengqi77
Copy link

Hello @SabaPivot How do I implement DAPO's dynamic sampling in this code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants