generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
Description
I'm working on reproducing the experiments from this awesome blog. However, I’m encountering an issue when enabling vLLM.
When I run the training without vLLM, everything works fine, and I get an approximately training speed of 115.96s/it.
part of the training log:
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,573 - __main__ - INFO - *** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
INFO:__main__:*** Starting training 2025-02-21 08:08:48 for 3.0 epochs***
2025-02-21 08:08:48,574 - __main__ - INFO - nccl_cumem_enable: 0
INFO:__main__:nccl_cumem_enable: 0
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
0%| | 0/450 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
{'loss': 0.0, 'grad_norm': 1.5064643621444702, 'learning_rate': 7.142857142857142e-08, 'rewards/format_reward_func': 0.234375, 'rewards/equation_reward_func': 0.015625, 'reward': 0.25, 'reward_std': 0.4207531735301018, 'completion_length': 301.15625, 'kl': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 2.2298247814178467, 'learning_rate': 1.4285714285714285e-07, 'rewards/format_reward_func': 0.3125, 'rewards/equation_reward_func': 0.0, 'reward': 0.3125, 'reward_std': 0.3221687823534012, 'completion_length': 313.96875, 'kl': 3.722056598576273e-05, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 2.245654344558716, 'learning_rate': 2.1428571428571426e-07, 'rewards/format_reward_func': 0.40625, 'rewards/equation_reward_func': 0.015625, 'reward': 0.421875, 'reward_std': 0.49775634706020355, 'completion_length': 247.390625, 'kl': 0.0007590312125103083, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 3.145155429840088, 'learning_rate': 2.857142857142857e-07, 'rewards/format_reward_func': 0.484375, 'rewards/equation_reward_func': 0.03125, 'reward': 0.515625, 'reward_std': 0.5373184233903885, 'completion_length': 173.875, 'kl': 0.002920848364738049, 'epoch': 0.0}
2%|███▌ | 9/450 [21:35<14:12:16, 115.96s/it]
When I enable vLLM, I hit an NCCL timeout before training even starts.
Here are some extra information:
- I set
NCCL_CUMEM_ENABLE=0
following the vLLM's documentation: https://docs.vllm.ai/en/v0.7.3/getting_started/troubleshooting.html#known-issues. - I'm using first 4 GPUs for training and the 5th GPU for vllm. So
CUDA_VISIBLE_DEVICES="0,1,2,3,4"
,torchrun --nproc_per_node 4
and--vllm_device "cuda:4"
. - As I am running on V100, and FlashAttention does not supports V100, my vllm fallbacks to
xformers
backend. There's a small issue of thexformers
backend when not running on the first GPU (cuda:0), so I made some modifications to vLLM ([Bugfix] Initialize attention bias on the same device as Query/Key/Value vllm-project/vllm#13468). - Also related to V100, I cannot use prefix caching on V100 ([Bug]: Crash with --enable-prefix-caching enabled vllm-project/vllm#3944), so I turned
enable_prefix_caching=False
when initializing the LLM. - I’ve also tried enforcing IPv4 networking by setting
NCCL_SOCKET_FAMILY=AF_INET
, adjusting the NCCL P2P level withNCCL_P2P_LEVEL=NVL
, and even disabling P2P withNCCL_P2P_DISABLE=1
, but none of these solutions resolved the issue.
part of the error log:
[rank3]:[I221 08:35:07.859408571 ProcessGroupNCCL.cpp:914] [PG ID 1 PG GUID 1 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.21.5, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_CUDA_EVENT_CACHE: 0, TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN: 1
[rank2]:[I221 08:35:07.861143314 ProcessGroupWrapper.cpp:587] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[151936, 1536], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank3]:[I221 08:35:07.861209797 ProcessGroupWrapper.cpp:587] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=BROADCAST, TensorShape=[151936, 1536], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[W221 08:45:06.568395840 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=42, addr=[localhost]:34988, remote=[localhost]:29500) returned 0, likely a timeout
[rank0]:[W221 08:45:06.569851551 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=42, addr=[localhost]:34988, remote=[localhost]:29500) timed out after 600000ms
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank0]: grpo_function(model_args, script_args, training_args)
[rank0]: File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank0]: trainer = GRPOTrainer(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank0]: self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank0]: model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank0]: self._configure_distributed_model(model)
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1200, in _configure_distributed_model
[rank0]: self.data_parallel_group = groups._get_data_parallel_group()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 410, in _get_data_parallel_group
[rank0]: return _clone_world_group()
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 367, in _clone_world_group
[rank0]: _WORLD_GROUP = dist.new_group(ranks=range(dist.get_world_size()))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 185, in new_group
[rank0]: return cdb.new_group(ranks)
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 357, in new_group
[rank0]: return torch.distributed.new_group(ranks)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4565, in new_group
[rank0]: return _new_group_with_tag(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4648, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1840, in _new_process_group_helper
[rank0]: backend_class = _create_process_group_wrapper(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4255, in _create_process_group_wrapper
[rank0]: helper_pg = ProcessGroupGloo(store, rank, world_size, timeout=timeout)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistStoreError: wait timeout after 600000ms, keys: //worker/attempt_0/default_pg/0//7//cuda//pg_wrapper:7/0/1
[rank3]:[E221 08:45:07.886838878 ProcessGroupGloo.cpp:143] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]:[E221 08:45:07.887390966 ProcessGroupGloo.cpp:143] Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank3]: grpo_function(model_args, script_args, training_args)
[rank3]: File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank3]: trainer = GRPOTrainer(
[rank3]: ^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank3]: self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank3]: model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank3]: engine = DeepSpeedEngine(args=args,
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank3]: self._configure_distributed_model(model)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1213, in _configure_distributed_model
[rank3]: self._broadcast_model()
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1131, in _broadcast_model
[rank3]: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank3]: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank3]: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
[rank3]: work = group.broadcast([tensor], opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank3]: Original exception:
[rank3]: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 600000ms for send operation to complete
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/jovyan/part4/grpo.py", line 277, in <module>
[rank2]: grpo_function(model_args, script_args, training_args)
[rank2]: File "/home/jovyan/part4/grpo.py", line 201, in grpo_function
[rank2]: trainer = GRPOTrainer(
[rank2]: ^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 444, in __init__
[rank2]: self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/trl/models/utils.py", line 247, in prepare_deepspeed
[rank2]: model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank2]: engine = DeepSpeedEngine(args=args,
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
[rank2]: self._configure_distributed_model(model)
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1213, in _configure_distributed_model
[rank2]: self._broadcast_model()
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1131, in _broadcast_model
[rank2]: dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank2]: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank2]: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
[rank2]: work = group.broadcast([tensor], opts)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[rank2]: Original exception:
[rank2]: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 600000ms for send operation to complete
(I can provide the full log if that's helpful. But I did not find any information there.)
Here's the trl env output:
$ trl env
[2025-02-21 08:58:07,927] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/jovyan/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
INFO 02-21 08:58:10 __init__.py:190] Automatically detected platform cuda.
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
- Python version: 3.11.11
- PyTorch version: 2.5.1+cu121
- CUDA device(s): Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB, Tesla V100-SXM2-32GB
- Transformers version: 4.50.0.dev0
- Accelerate version: 1.3.0
- Accelerate config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- fsdp_config: {'fsdp_auto_wrap_policy': 'NO_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- Datasets version: 3.2.0
- HF Hub version: 0.27.1
- TRL version: 0.15.1
- bitsandbytes version: not installed
- DeepSpeed version: 0.16.3
- Diffusers version: not installed
- Liger-Kernel version: 0.5.2
- LLM-Blender version: not installed
- OpenAI version: 1.63.2
- PEFT version: 0.14.0
Has anyone encountered a similar problem or have any advice on how to resolve this? Any help would be greatly appreciated!