NCCL timeout when GRPO training with vllm

I'm working on reproducing the experiments from [this awesome blog](https://www.philschmid.de/mini-deepseek-r1). However, I’m encountering an issue when enabling vLLM.

When I run the training without vLLM, everything works fine, and I get an approximately training speed of 115.96s/it. 

<details><summary>part of the training log:</summary>

```console
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,573 - __main__ - INFO - *** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  0%|                                                                                                                                                                                                | 0/450 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
{'loss': 0.0, 'grad_norm': 1.5064643621444702, 'learning_rate': 7.142857142857142e-08, 'rewards/format_reward_func': 0.234375, 'rewards/equation_reward_func': 0.015625, 'reward': 0.25, 'reward_std': 0.4207531735301018, 'completion_length': 301.15625, 'kl': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 2.2298247814178467, 'learning_rate': 1.4285714285714285e-07, 'rewards/format_reward_func': 0.3125, 'rewards/equation_reward_func': 0.0, 'reward': 0.3125, 'reward_std': 0.3221687823534012, 'completion_length': 313.96875, 'kl': 3.722056598576273e-05, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 2.245654344558716, 'learning_rate': 2.1428571428571426e-07, 'rewards/format_reward_func': 0.40625, 'rewards/equation_reward_func': 0.015625, 'reward': 0.421875, 'reward_std': 0.49775634706020355, 'completion_length': 247.390625, 'kl': 0.0007590312125103083, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 3.145155429840088, 'learning_rate': 2.857142857142857e-07, 'rewards/format_reward_func': 0.484375, 'rewards/equation_reward_func': 0.03125, 'reward': 0.515625, 'reward_std': 0.5373184233903885, 'completion_length': 173.875, 'kl': 0.002920848364738049, 'epoch': 0.0}
  2%|███▌                                                                                                                                                                                | 9/450 [21:35<14:12:16, 115.96s/it] 
```

</details>

When I enable vLLM, I hit an NCCL timeout before training even starts.

Here are some extra information:

- I set `NCCL_CUMEM_ENABLE=0` following the vLLM's documentation: <https://docs.vllm.ai/en/v0.7.3/getting_started/troubleshooting.html#known-issues>.
- I'm using first 4 GPUs for training and the 5th GPU for vllm. So `CUDA_VISIBLE_DEVICES="0,1,2,3,4"`, `torchrun --nproc_per_node 4` and `--vllm_device "cuda:4"`.
- As I am running on V100, and FlashAttention does not supports V100, my vllm fallbacks to `xformers` backend. There's a small issue of the `xformers` backend when not running on the first GPU (cuda:0), so I made some modifications to vLLM (https://github.com/vllm-project/vllm/pull/13468).
- Also related to V100, I cannot use prefix caching on V100 (https://github.com/vllm-project/vllm/issues/3944), so I turned `enable_prefix_caching=False` when initializing the LLM.
- I’ve also tried enforcing IPv4 networking by setting `NCCL_SOCKET_FAMILY=AF_INET`, adjusting the NCCL P2P level with `NCCL_P2P_LEVEL=NVL`, and even disabling P2P with `NCCL_P2P_DISABLE=1`, but none of these solutions resolved the issue.

<details><summary>part of the error log:</summary>

```console
[rank3]:[I221 08:35:07.859408571 ProcessGroupNCCL.cpp:914] [PG ID 1 PG GUID 1 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[rank2]:[I221 08:35:07.861143314 ProcessGroupWrapper.cpp:587] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[151936, 1536], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank3]:[I221 08:35:07.861209797 ProcessGroupWrapper.cpp:587] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[151936, 1536], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[W221 08:45:06.568395840 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=42, addr=[localhost]:34988, remote=[localhost]:29500) returned 0, likely a timeout
[rank0]:[W221 08:45:06.569851551 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=42, addr=[localhost]:34988, remote=[localhost]:29500) timed out after 600000ms
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank0]:     grpo_function(model_args, script_args, training_args)
[rank0]:   File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank0]:     trainer = GRPOTrainer(
[rank0]:               ^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank0]:     self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank0]:     model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]:     engine = DeepSpeedEngine(args=args,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank0]:     self._configure_distributed_model(model)
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1200, in _configure_distributed_model
[rank0]:     self.data_parallel_group = groups._get_data_parallel_group()
[rank0]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 410, in _get_data_parallel_group
[rank0]:     return _clone_world_group()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 367, in _clone_world_group
[rank0]:     _WORLD_GROUP = dist.new_group(ranks=range(dist.get_world_size()))
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 185, in new_group
[rank0]:     return cdb.new_group(ranks)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 357, in new_group
[rank0]:     return torch.distributed.new_group(ranks)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
[rank0]:     func_return = func(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4565, in new_group
[rank0]:     return _new_group_with_tag(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4648, in _new_group_with_tag
[rank0]:     pg, pg_store = _new_process_group_helper(
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1840, in _new_process_group_helper
[rank0]:     backend_class = _create_process_group_wrapper(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4255, in _create_process_group_wrapper
[rank0]:     helper_pg = ProcessGroupGloo(store, rank, world_size, timeout=timeout)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistStoreError: wait timeout after 600000ms, keys: //worker/attempt_0/default_pg/0//7//cuda//pg_wrapper:7/0/1
[rank3]:[E221 08:45:07.886838878 ProcessGroupGloo.cpp:143] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:[E221 08:45:07.887390966 ProcessGroupGloo.cpp:143] Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank3]:     grpo_function(model_args, script_args, training_args)
[rank3]:   File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank3]:     trainer = GRPOTrainer(
[rank3]:               ^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank3]:     self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank3]:     model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank3]:     engine = DeepSpeedEngine(args=args,
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank3]:     self._configure_distributed_model(model)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1213, in _configure_distributed_model
[rank3]:     self._broadcast_model()
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1131, in _broadcast_model
[rank3]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank3]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank3]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
[rank3]:     work = group.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]:  Original exception: 
[rank3]: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 600000ms for send operation to complete
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank2]:     grpo_function(model_args, script_args, training_args)
[rank2]:   File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank2]:     trainer = GRPOTrainer(
[rank2]:               ^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank2]:     self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank2]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank2]:     model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank2]:     engine = DeepSpeedEngine(args=args,
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank2]:     self._configure_distributed_model(model)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1213, in _configure_distributed_model
[rank2]:     self._broadcast_model()
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1131, in _broadcast_model
[rank2]:     dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank2]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank2]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
[rank2]:     work = group.broadcast([tensor], opts)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:  Original exception: 
[rank2]: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 600000ms for send operation to complete
```

</details>

(I can provide the full log if that's helpful. But I did not find any information there.)

<details><summary>Here's the trl env output:</summary>

```console
$ trl env
[2025-02-21 08:58:07,927] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/jovyan/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
INFO 02-21 08:58:10 __init__.py:190] Automatically detected platform cuda.

Copy-paste the following information when reporting an issue:

- Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
- Python version: 3.11.11
- PyTorch version: 2.5.1+cu121
- CUDA device(s): Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB
- Transformers version: 4.50.0.dev0
- Accelerate version: 1.3.0
- Accelerate config: 
  - compute_environment: LOCAL_MACHINE
  - distributed_type: FSDP
  - mixed_precision: fp16
  - use_cpu: False
  - debug: False
  - num_processes: 4
  - machine_rank: 0
  - num_machines: 1
  - rdzv_backend: static
  - same_network: True
  - main_training_function: main
  - enable_cpu_affinity: False
  - fsdp_config: {'fsdp_auto_wrap_policy': 'NO_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_use_orig_params': True}
  - downcast_bf16: no
  - tpu_use_cluster: False
  - tpu_use_sudo: False
  - tpu_env: []
- Datasets version: 3.2.0
- HF Hub version: 0.27.1
- TRL version: 0.15.1
- bitsandbytes version: not installed
- DeepSpeed version: 0.16.3
- Diffusers version: not installed
- Liger-Kernel version: 0.5.2
- LLM-Blender version: not installed
- OpenAI version: 1.63.2
- PEFT version: 0.14.0
```

</details>

Has anyone encountered a similar problem or have any advice on how to resolve this? Any help would be greatly appreciated!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL timeout when GRPO training with vllm #2923

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL timeout when GRPO training with vllm #2923

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions