Beyond the Training Plateau: Verbose Length Reduction in Reinforcement Learning on LLMs

This repository investigates the issue of verbose responses generated by large language models during reinforcement learning training.

We define verbose length as the number of tokens following the answer enclosed in \boxed{}.

Our analysis reveals that verbose length naturally decreases over training steps. These findings suggest an emergent ability of the model to reduce verbosity, even in the absence of explicit constraints and even when other training metrics have already plateaued. Moreover, responses containing Python code are initially more verbose but also become more concise over time.

As accuracy improves, the average verbose length decreases.

The proportion of responses with shorter verbose lengths increases progressively during training.

Installation

conda create -n vlr python==3.10
conda activate vlr
pip3 install vllm==0.8.1 liger_kernel==0.5.8
# If the following step gets stuck, you can manually download the .whl file
# from https://github.com/Dao-AILab/flash-attention/releases/
# and install it with: pip3 install flash_attn-XXXX.whl
pip3 install flash-attn --no-build-isolation 
pip3 install -e .

Training

We have already prepared MATH dataset in data/math.

As an example, to run REINFORCE training:

export MODEL_PATH=path_to_the_model
conda activate vlr
. scripts/reinforce.sh

During training, various data including responses, scores, and lengths will be recorded in a .ndjson file under dump/, which could be used for further analysis.

To save checkpoint, please set trainer.save_freq>0.

(Optional) To use wandb for logging, please export your wandb api key before training.

export WANDB_API_KEY=your_wandb_api_key

Then append 'wandb' to trainer.logger.

Acknowledge

We built this repository based on verl and chose Qwen2.5-3B as the base model.

Citation

If this blog or code helped you, a citation would be greatly appreciated!

@misc{wu2025vlr,
  title        = {Beyond the Training Plateau: Verbose Length Reduction in Reinforcement Learning on LLMs},
  author       = {Wu, Lumeng},
  year         = {2025},
  howpublished = {\url{https://dirtydan0.notion.site/Beyond-the-Training-Plateau-Verbose-Length-Reduction-in-Reinforcement-Learning-on-LLMs-1bdbfca801e180bfa63afba2dd28ba4c?pvs=4}},
  note         = {Code: \url{https://github.com/dirtyDan0/VerboseLengthReduction}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
scripts		scripts
verl		verl
.gitignore		.gitignore
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
VERL_README.md		VERL_README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
vlr_reward_fn.py		vlr_reward_fn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Beyond the Training Plateau: Verbose Length Reduction in Reinforcement Learning on LLMs

Installation

Training

Acknowledge

Citation

About

Uh oh!

Languages

License

dirtyDan0/VerboseLengthReduction

Folders and files

Latest commit

History

Repository files navigation

Beyond the Training Plateau: Verbose Length Reduction in Reinforcement Learning on LLMs

Installation

Training

Acknowledge

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages