Skip to content
This repository was archived by the owner on Aug 18, 2025. It is now read-only.
This repository was archived by the owner on Aug 18, 2025. It is now read-only.

Training abruptly crashes on single GPU #318

@pranavsinghps1

Description

@pranavsinghps1

While working with the knee dataset on a VarNet from Pytorch-lighting's library and using the FastMriDataModule data-loaders, I observed that the training is unstable and crashes fairly often. I tried looking for similar issues within this repo but couldn't find any. I looked up PyTorch's forum to check for the same and observed such an issue is encountered when the data loader doesn't work well with multiprocessing link (pytorch/pytorch#8976) -- they recommended using workers=0 which did stabilize my training for some time but after a while it crashes as well.

  • Training on single GPU with:
backend = "gpu"
num_gpus = 1
batch_size = 8

using the FastMriDataModule on the single-coil Knee dataset. Reproduced on single V100 and RTX8000 GPU.

lightning    1.8.6
torch          2.0.1
  • The Entire Traceback is as follows:

File "/ext3/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/ext3/miniconda3/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/ext3/miniconda3/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
fd = df.detach()
File "/ext3/miniconda3/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/ext3/miniconda3/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/ext3/miniconda3/lib/python3.10/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File "/ext3/miniconda3/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/ext3/miniconda3/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/ext3/miniconda3/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/ext3/miniconda3/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/scratch/ps4364/fmri2020/varnet_l1_2/unet_knee_sc.py", line 192, in
run_cli()
File "/scratch/ps4364/fmri2020/varnet_l1_2/unet_knee_sc.py", line 188, in run_cli
cli_main(args)
File "/scratch/ps4364/fmri2020/varnet_l1_2/unet_knee_sc.py", line 72, in cli_main
trainer.fit(model, datamodule=data_module)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 188, in advance
batch = next(data_fetcher)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
batch = next(iterator)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 568, in next
return self.request_next_batch(self.loader_iters)
File "/ext3/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 580, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/ext3/miniconda3/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 51, in apply_to_collection
return function(data, *args, **kwargs)
File "/ext3/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/ext3/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
idx, data = self._get_data()
File "/ext3/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1294, in _get_data
success, data = self._try_get_data()
File "/ext3/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/ext3/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3489789) is killed by signal: Killed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions