Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader crashes if num_worker>0 #566

Open
qingnanli opened this issue Jan 3, 2020 · 1 comment
Open

Dataloader crashes if num_worker>0 #566

qingnanli opened this issue Jan 3, 2020 · 1 comment

Comments

@qingnanli
Copy link

Hi, xingyizhou, thanks for sharing the code! I have some troubles.
If num_works = 0, we can train the network on kitti dataset well. However, if num_workers > 0, our training crashes:

ubuntu 16.04
pytorch 1.0.1.post2
python 3.6

~/Downloads/qingqing_disk/p4600_disk/CenterNet/src/lib/trains/base_trainer.py(63)run_epoch()
58 num_iters = len(data_loader) if opt.num_iters < 0 else opt.num_iters
59 bar = Bar('{}/{}'.format(opt.task, opt.exp_id), max=num_iters)
60 end = time.time()
61 import pdb
62 pdb.set_trace()
63 -> for iter_id, batch in enumerate(data_loader):
64 if iter_id >= num_iters:

~/anaconda3/lib/python3.6/sitepackages/torch/utils/data/dataloader.py(818)__iter__()
818 def __iter__(self):
819 -> return _DataLoaderIter(self)

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py(560)__init__()
557 # it started, so that we do not call .join() if program dies
558 # before it starts, and __del__ tries to join but will get:
559 # AssertionError: can only join a started process.
560 -> w.start()
561 self.index_queues.append(index_queue)
562 self.workers.append(w)

~/anaconda3/lib/python3.6/multiprocessing/process.py(105)start()
102 assert not _current_process._config.get('daemon'), \
103 'daemonic processes are not allowed to have children'
104 _cleanup()
105 -> self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel

~/anaconda3/lib/python3.6/multiprocessing/context.py(223)_Popen()
219 class Process(process.BaseProcess):
220 _start_method = None
221 @staticmethod
222 def _Popen(process_obj):
223 -> return _default_context.get_context().Process._Popen(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/context.py(277)_Popen()
272 class ForkProcess(process.BaseProcess):
273 _start_method = 'fork'
274 @staticmethod
275 def _Popen(process_obj):
276 from .popen_fork import Popen
277 -> return Popen(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(19)__init__()
16 def __init__(self, process_obj):
17 util._flush_std_streams()
18 self.returncode = None
19 -> self._launch(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(66)_launch()
63 def _launch(self, process_obj):
64 code = 1
65 parent_r, child_w = os.pipe()
66 -> self.pid = os.fork()
67 if self.pid == 0:
68 try:
69 os.close(parent_r)
70 if 'random' in sys.modules:
71 import random

Here, self.pid = os.fork(), I can't step into the os.fork() function or press key n to train the networks. However, os.fork() seems OK in terminal as follows:
qingqing@qingqing-PowerEdge-T630:~$ python
Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.fork()
23346
0
>>> >>>

My problem is similar to pytorch/pytorch#25302 (He uses win10)

I got troubled. Could you help me? Thanks!

@qingnanli
Copy link
Author

ddd/3dop |############################### | train: [5][305/309]|Tot: 0:11:30 |ETA: 0:00:10 |loss 2.1862 |hm_loss 0.5696 |dep_loss 0.1432 |dim_
ddd/3dop |############################### | train: [5][306/309]|Tot: 0:11:32 |ETA: 0:00:07 |loss 2.1857 |hm_loss 0.5695 |dep_loss 0.1429 |dim_
ddd/3dop |############################### | train: [5][307/309]|Tot: 0:11:35 |ETA: 0:00:05 |loss 2.1850 |hm_loss 0.5687 |dep_loss 0.1429 |dim_
ddd/3dop |################################| train: [5][308/309]|Tot: 0:11:37 |ETA: 0:00:03 |loss 2.1855 |hm_loss 0.5692 |dep_loss 0.1427 |dim_
loss 0.0091 |rot_loss 1.4146 |wh_loss 0.3049 |off_loss 0.0194 |Data 0.352s(0.367s) |Net 2.257s
ddd/3dop

If num_workers = 0, next epoch stops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant