Skip to content

NCCL timeout when GRPO training with vllm #2923

@edwardzjl

Description

@edwardzjl

I'm working on reproducing the experiments from this awesome blog. However, I’m encountering an issue when enabling vLLM.

When I run the training without vLLM, everything works fine, and I get an approximately training speed of 115.96s/it.

part of the training log:
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,573 - __main__ - INFO - *** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  0%|                                                                                                                                                                                                | 0/450 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
{'loss': 0.0, 'grad_norm': 1.5064643621444702, 'learning_rate': 7.142857142857142e-08, 'rewards/format_reward_func': 0.234375, 'rewards/equation_reward_func': 0.015625, 'reward': 0.25, 'reward_std': 0.4207531735301018, 'completion_length': 301.15625, 'kl': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 2.2298247814178467, 'learning_rate': 1.4285714285714285e-07, 'rewards/format_reward_func': 0.3125, 'rewards/equation_reward_func': 0.0, 'reward': 0.3125, 'reward_std': 0.3221687823534012, 'completion_length': 313.96875, 'kl': 3.722056598576273e-05, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 2.245654344558716, 'learning_rate': 2.1428571428571426e-07, 'rewards/format_reward_func': 0.40625, 'rewards/equation_reward_func': 0.015625, 'reward': 0.421875, 'reward_std': 0.49775634706020355, 'completion_length': 247.390625, 'kl': 0.0007590312125103083, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 3.145155429840088, 'learning_rate': 2.857142857142857e-07, 'rewards/format_reward_func': 0.484375, 'rewards/equation_reward_func': 0.03125, 'reward': 0.515625, 'reward_std': 0.5373184233903885, 'completion_length': 173.875, 'kl': 0.002920848364738049, 'epoch': 0.0}
  2%|███▌                                                                                                                                                                                | 9/450 [21:35<14:12:16, 115.96s/it] 

When I enable vLLM, I hit an NCCL timeout before training even starts.

Here are some extra information:

part of the error log:
[rank3]:[I221 08:35:07.859408571 ProcessGroupNCCL.cpp:914] [PG ID 1 PG GUID 1 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[rank2]:[I221 08:35:07.861143314 ProcessGroupWrapper.cpp:587] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[151936, 1536], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank3]:[I221 08:35:07.861209797 ProcessGroupWrapper.cpp:587] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[151936, 1536], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[W221 08:45:06.568395840 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=42, addr=[localhost]:34988, remote=[localhost]:29500) returned 0, likely a timeout
[rank0]:[W221 08:45:06.569851551 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=42, addr=[localhost]:34988, remote=[localhost]:29500) timed out after 600000ms
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank0]:     grpo_function(model_args, script_args, training_args)
[rank0]:   File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank0]:     trainer = GRPOTrainer(
[rank0]:               ^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank0]:     self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank0]:     model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]:     engine = DeepSpeedEngine(args=args,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank0]:     self._configure_distributed_model(model)
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1200, in _configure_distributed_model
[rank0]:     self.data_parallel_group = groups._get_data_parallel_group()
[rank0]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 410, in _get_data_parallel_group
[rank0]:     return _clone_world_group()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 367, in _clone_world_group
[rank0]:     _WORLD_GROUP = dist.new_group(ranks=range(dist.get_world_size()))
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 185, in new_group
[rank0]:     return cdb.new_group(ranks)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 357, in new_group
[rank0]:     return torch.distributed.new_group(ranks)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
[rank0]:     func_return = func(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4565, in new_group
[rank0]:     return _new_group_with_tag(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4648, in _new_group_with_tag
[rank0]:     pg, pg_store = _new_process_group_helper(
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1840, in _new_process_group_helper
[rank0]:     backend_class = _create_process_group_wrapper(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4255, in _create_process_group_wrapper
[rank0]:     helper_pg = ProcessGroupGloo(store, rank, world_size, timeout=timeout)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistStoreError: wait timeout after 600000ms, keys: //worker/attempt_0/default_pg/0//7//cuda//pg_wrapper:7/0/1
[rank3]:[E221 08:45:07.886838878 ProcessGroupGloo.cpp:143] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:[E221 08:45:07.887390966 ProcessGroupGloo.cpp:143] Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank3]:     grpo_function(model_args, script_args, training_args)
[rank3]:   File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank3]:     trainer = GRPOTrainer(
[rank3]:               ^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank3]:     self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank3]:     model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank3]:     engine = DeepSpeedEngine(args=args,
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank3]:     self._configure_distributed_model(model)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1213, in _configure_distributed_model
[rank3]:     self._broadcast_model()
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1131, in _broadcast_model
[rank3]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank3]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank3]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
[rank3]:     work = group.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]:  Original exception: 
[rank3]: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 600000ms for send operation to complete
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank2]:     grpo_function(model_args, script_args, training_args)
[rank2]:   File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank2]:     trainer = GRPOTrainer(
[rank2]:               ^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank2]:     self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank2]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank2]:     model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank2]:     engine = DeepSpeedEngine(args=args,
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank2]:     self._configure_distributed_model(model)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1213, in _configure_distributed_model
[rank2]:     self._broadcast_model()
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1131, in _broadcast_model
[rank2]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank2]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank2]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
[rank2]:     work = group.broadcast([tensor], opts)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:  Original exception: 
[rank2]: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 600000ms for send operation to complete

(I can provide the full log if that's helpful. But I did not find any information there.)

Here's the trl env output:
$ trl env
[2025-02-21 08:58:07,927] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/jovyan/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
INFO 02-21 08:58:10 __init__.py:190] Automatically detected platform cuda.

Copy-paste the following information when reporting an issue:

- Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
- Python version: 3.11.11
- PyTorch version: 2.5.1+cu121
- CUDA device(s): Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB
- Transformers version: 4.50.0.dev0
- Accelerate version: 1.3.0
- Accelerate config: 
  - compute_environment: LOCAL_MACHINE
  - distributed_type: FSDP
  - mixed_precision: fp16
  - use_cpu: False
  - debug: False
  - num_processes: 4
  - machine_rank: 0
  - num_machines: 1
  - rdzv_backend: static
  - same_network: True
  - main_training_function: main
  - enable_cpu_affinity: False
  - fsdp_config: {'fsdp_auto_wrap_policy': 'NO_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_use_orig_params': True}
  - downcast_bf16: no
  - tpu_use_cluster: False
  - tpu_use_sudo: False
  - tpu_env: []
- Datasets version: 3.2.0
- HF Hub version: 0.27.1
- TRL version: 0.15.1
- bitsandbytes version: not installed
- DeepSpeed version: 0.16.3
- Diffusers version: not installed
- Liger-Kernel version: 0.5.2
- LLM-Blender version: not installed
- OpenAI version: 1.63.2
- PEFT version: 0.14.0

Has anyone encountered a similar problem or have any advice on how to resolve this? Any help would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏋 GRPORelated to GRPO🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions