RuntimeError: DataLoader worker (pid 16560) is killed by signal: Killed. #195

jario-jin · 2018-11-21T13:25:53Z

❓ Questions and Help

I use COCO dataset and default setting.
This error occurred when running to 23540 iters.
Is this my dataset problem, or?

My Environment:
2018-11-21 17:02:50,065 maskrcnn_benchmark INFO: Using 2 GPUs
2018-11-21 17:02:50,066 maskrcnn_benchmark INFO: Namespace(config_file='configs/e2e_faster_rcnn_R_50_FPN_1x.yaml', distributed=True, local_rank=0, opts=[], skip_test=False)
2018-11-21 17:02:50,066 maskrcnn_benchmark INFO: Collecting env info (might take some time)
2018-11-21 17:02:51,106 maskrcnn_benchmark INFO:
PyTorch version: 1.0.0a0+5d0ef34
Is debug build: No
CUDA used to build PyTorch: 9.1.85

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 2.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: TITAN X (Pascal)

Nvidia driver version: 390.59
cuDNN version: Probably one of the following:
/usr/local/cuda-9.1/lib64/libcudnn.so
/usr/local/cuda-9.1/lib64/libcudnn.so.7
/usr/local/cuda-9.1/lib64/libcudnn.so.7.0.5
/usr/local/cuda-9.1/lib64/libcudnn_static.a

Error Info:
2018-11-21 19:57:51,134 maskrcnn_benchmark.trainer INFO: eta: 1 day, 17:36:49 iter: 23540 loss: 0.6437 (0.6677) loss_box_reg: 0.1438 (0.1584) loss_rpn_box_reg: 0.0716 (0.0752) loss_classifier: 0.3284 (0.3506) time: 0.4422 (0.4453) loss_objectness: 0.0696 (0.0836) data: 0.0059 (0.0088) lr: 0.005000 max mem: 3546
2018-11-21 19:57:59,999 maskrcnn_benchmark.trainer INFO: eta: 1 day, 17:36:39 iter: 23560 loss: 0.5838 (0.6677) loss_box_reg: 0.1453 (0.1583) loss_rpn_box_reg: 0.0342 (0.0751) loss_classifier: 0.3052 (0.3506) time: 0.4419 (0.4452) loss_objectness: 0.0557 (0.0836) data: 0.0045 (0.0088) lr: 0.005000 max mem: 3546
2018-11-21 19:58:08,802 maskrcnn_benchmark.trainer INFO: eta: 1 day, 17:36:29 iter: 23580 loss: 0.6262 (0.6676) loss_box_reg: 0.1667 (0.1583) loss_rpn_box_reg: 0.0962 (0.0752) loss_classifier: 0.2801 (0.3505) time: 0.4419 (0.4452) loss_objectness: 0.0805 (0.0836) data: 0.0054 (0.0088) lr: 0.005000 max mem: 3546
2018-11-21 19:58:17,906 maskrcnn_benchmark.trainer INFO: eta: 1 day, 17:36:23 iter: 23600 loss: 0.5771 (0.6676) loss_box_reg: 0.1431 (0.1583) loss_rpn_box_reg: 0.0493 (0.0752) loss_classifier: 0.3077 (0.3505) time: 0.4511 (0.4453) loss_objectness: 0.0478 (0.0836) data: 0.0049 (0.0088) lr: 0.005000 max mem: 3546
2018-11-21 19:58:26,721 maskrcnn_benchmark.trainer INFO: eta: 1 day, 17:36:13 iter: 23620 loss: 0.6393 (0.6676) loss_box_reg: 0.1706 (0.1583) loss_rpn_box_reg: 0.0479 (0.0752) loss_classifier: 0.3366 (0.3505) time: 0.4390 (0.4452) loss_objectness: 0.0615 (0.0836) data: 0.0050 (0.0088) lr: 0.005000 max mem: 3546
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
send(obj)
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 18, in send
self.send_bytes(buf.getvalue())
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "tools/train_net.py", line 170, in
main()
File "tools/train_net.py", line 163, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/home/jario/spire-net-1810/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
loss_dict = model(images, targets)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 479, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/deprecated/distributed.py", line 223, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 479, in call
result = self.forward(*input, **kwargs)
File "/home/jario/spire-net-1810/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 49, in forward
features = self.backbone(images.tensors)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 479, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 479, in call
result = self.forward(*input, **kwargs)
File "/home/jario/spire-net-1810/maskrcnn-benchmark/maskrcnn_benchmark/modeling/backbone/resnet.py", line 117, in forward
x = self.stem(x)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 479, in call
result = self.forward(*input, **kwargs)
File "/home/jario/spire-net-1810/maskrcnn-benchmark/maskrcnn_benchmark/modeling/backbone/resnet.py", line 287, in forward
x = self.conv1(x)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 479, in call
result = self.forward(*input, **kwargs)
File "/home/jario/spire-net-1810/maskrcnn-benchmark/maskrcnn_benchmark/layers/misc.py", line 33, in forward
return super(Conv2d, self).forward(x)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 313, in forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 274, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 16560) is killed by signal: Killed.

jario-jin · 2018-11-22T02:08:53Z

I used Detectron (https://github.com/facebookresearch/Detectron) for training in the same environment and did not encounter this problem. This means that the dataset should be all right. Anyone have the same problem?

fmassa · 2018-11-22T16:55:18Z

Can you try setting DATALOADER.NUM_WORKERS to 0 and run again? This should give you a better idea of the problem.

Also, one possibility is that one of the images of your dataset is corrupted, but because opencv (which is used by Detectron) loads corrupted images without complaining (while PIL by default raises a warning), this might be the issue.

LaoYang1994 · 2019-03-10T06:15:24Z

I have run into this problem twice. Does anyone have solved this problem?

jario-jin · 2019-03-10T06:42:39Z

This problem occasionally occurs.
When I set DATALOADER. NUM_WORKERS to 0, this error did not occur again, but the training speed would decrease.
I still haven't found the specific reason.

LaoYang1994 · 2019-03-10T06:48:24Z

Except setting DATALOADER. NUM_WORKERS to 0, is there any other ways to solve this problem?

eric-xw · 2019-04-20T00:31:14Z

I encountered the same issue at iteration 54700. Any fix for this?

eric-xw · 2019-04-22T22:00:09Z

In my case, it was caused by the CPU memory. After I increased my CPU memory on the cluster, the problem is gone.

lironmo · 2019-08-08T08:33:03Z

same here:
ERROR - worker:73 - call - trainer crashed. Error: DataLoader worker (pid 1159) is killed by signal: Killed.

mainguyenanhvu · 2019-09-20T04:22:44Z

I have run it in CPU and also met an error: RuntimeError: DataLoader worker (pid 16560) is killed by signal: Killed.

I followed @fmassa, it was killed and printed in terminal:

Count of instances per bin: [85308 31958]
Test: [ 0/5000] eta: 8:26:32 model_time: 5.8156 (5.8156) evaluator_time: 0.1168 (0.1168) time: 6.0785 data: 0.1460
Killed

Please, help me. Thanks a lot.

SaynaEbrahimi · 2019-10-09T23:00:10Z

I removed def _init_fn(): np.random.seed(args.seed) form my torch.utils.data.DataLoader(). Now it is working with 4 workers but it is still not using multiple gpus when I use torch.nn.DataParallel(model).cuda(). Does anyone encountered this?

Tato14 · 2019-12-05T08:24:54Z

Hi @SaynaEbrahimi, I am not able to find def _init_fn(): np.random.seed(args.seed) in the dataloader.py file. Should I find it somewhere else? Or is a matter of versions?

qingnanli · 2020-01-04T14:06:03Z

@SaynaEbrahimi @Tato14 Have you solved this problem? I met the same problem...xingyizhou/CenterNet#566

b5l · 2020-02-20T11:09:04Z

If you are also experiencing this problem, I found a solution: In my case, the Out Of Memory Killer killed the process. The solution was to create a bigger swap file and let the system use that:

linux@linux:~# fallocate -l 64G /tmp/swap
linux@linux:~# chmod 600 /tmp/swap
linux@linux:~# mkswap /tmp/swap
linux@linux:~# swapon /tmp/swap

You can verify that the swap is working by running swapon --show.

Michel-liu · 2020-03-05T17:20:24Z

For me, this problem sometimes happens in a docker virtual machine. Commonly, I will set the cpu count lager than NUM_WORKERS & use a bigger memory than the dataset used.

lishen · 2020-04-15T15:57:29Z

If you are also experiencing this problem, I found a solution: In my case, the Out Of Memory Killer killed the process. The solution was to create a bigger swap file and let the system use that:
linux@linux:~# fallocate -l 64G /tmp/swap
linux@linux:~# chmod 600 /tmp/swap
linux@linux:~# mkswap /tmp/swap
linux@linux:~# swapon /tmp/swap
You can verify that the swap is working by running swapon --show.

Yeah! This solved my problem. I can finally use multiple workers.

pzhren · 2022-05-29T08:27:01Z

For me, this problem sometimes happens in a docker virtual machine. Commonly, I will set the cpu count lager than NUM_WORKERS & use a bigger memory than the dataset used.
I'm in the same situation, how do I fix it?

fmassa added the awaiting response label Nov 22, 2018

xingyizhou mentioned this issue Apr 22, 2019

DataLoader worker (pid 8107) is killed by signal: Killed. Details are lost due to multip│ rocessing. Rerunning with num_workers=0 may give better error trace xingyizhou/CenterNet#20

Closed

apsdehal mentioned this issue Jul 29, 2019

Continue Training Question facebookresearch/mmf#131

Closed

pushkalkatara mentioned this issue Oct 30, 2019

MemoryError: std::bad_alloc while loading fasttext model. facebookresearch/mmf#188

Closed

ChaofanTao mentioned this issue Mar 3, 2021

Warning: Moving average ignored a value of nan/inf WisconsinAIVision/yolact_edge#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: DataLoader worker (pid 16560) is killed by signal: Killed. #195

RuntimeError: DataLoader worker (pid 16560) is killed by signal: Killed. #195

jario-jin commented Nov 21, 2018 •

edited

Loading

jario-jin commented Nov 22, 2018

fmassa commented Nov 22, 2018

LaoYang1994 commented Mar 10, 2019

jario-jin commented Mar 10, 2019

LaoYang1994 commented Mar 10, 2019

eric-xw commented Apr 20, 2019 •

edited

Loading

eric-xw commented Apr 22, 2019

lironmo commented Aug 8, 2019

mainguyenanhvu commented Sep 20, 2019 •

edited

Loading

SaynaEbrahimi commented Oct 9, 2019

Tato14 commented Dec 5, 2019

qingnanli commented Jan 4, 2020

b5l commented Feb 20, 2020 •

edited

Loading

Michel-liu commented Mar 5, 2020

lishen commented Apr 15, 2020

pzhren commented May 29, 2022

RuntimeError: DataLoader worker (pid 16560) is killed by signal: Killed. #195

RuntimeError: DataLoader worker (pid 16560) is killed by signal: Killed. #195

Comments

jario-jin commented Nov 21, 2018 • edited Loading

❓ Questions and Help

jario-jin commented Nov 22, 2018

fmassa commented Nov 22, 2018

LaoYang1994 commented Mar 10, 2019

jario-jin commented Mar 10, 2019

LaoYang1994 commented Mar 10, 2019

eric-xw commented Apr 20, 2019 • edited Loading

eric-xw commented Apr 22, 2019

lironmo commented Aug 8, 2019

mainguyenanhvu commented Sep 20, 2019 • edited Loading

SaynaEbrahimi commented Oct 9, 2019

Tato14 commented Dec 5, 2019

qingnanli commented Jan 4, 2020

b5l commented Feb 20, 2020 • edited Loading

Michel-liu commented Mar 5, 2020

lishen commented Apr 15, 2020

pzhren commented May 29, 2022

jario-jin commented Nov 21, 2018 •

edited

Loading

eric-xw commented Apr 20, 2019 •

edited

Loading

mainguyenanhvu commented Sep 20, 2019 •

edited

Loading

b5l commented Feb 20, 2020 •

edited

Loading