Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training fails if using multiple gpus #697

Open
cocacola0 opened this issue Apr 30, 2020 · 5 comments
Open

Training fails if using multiple gpus #697

cocacola0 opened this issue Apr 30, 2020 · 5 comments

Comments

@cocacola0
Copy link

I want to train coco_dla_2x with 8 gpus with the following command:
python main.py ctdet --exp_id coco_dla_2x --batch_size 8 --master_batch 1 --lr 5e-4 --gpus 0,1,2,3,4,5,6,7 --num_workers 8 --num_epochs 230
and I get CUDA_OUT_OF_MEMORY. However if I use a single gpu, it works just fine (using the same command). I also mention that all the gpus are the same.

@xingyizhou
Copy link
Owner

That's strange. Can you try removing --master_batch? If that doesn't work, can you specify your cuda/ pytorch version and GPU type?

@cocacola0
Copy link
Author

Thanks for responding! It didn't work, I use pytorch = 1.0.0, cuda = 10.0, gpu Titan RTX on ubuntu 18.04.3. Also, I use this server simultaneously with my other uni colleagues ( if that is something that matters )

@cocacola0
Copy link
Author

If it helps, here is the entire error message:
(base) adriantura@tmas395x:~/CenterNetDoneUntilEaster/CenterNet/src$ python main.py ctdet --exp_id coco_dla_2x --batch_size 8 --lr 5e-4 --gpus 0,1,2,3,4,5,6,7 --num_workers 16 --num_epochs 230
Fix size testing.
training chunk_sizes: [1, 1, 1, 1, 1, 1, 1, 1]
The output will be saved to /home/adriantura/CenterNetDoneUntilEaster/CenterNet/src/lib/../../exp/ctdet/coco_dla_2x
heads {'hm': 80, 'wh': 2, 'reg': 2}
Creating model...
Setting up data...
==> initializing coco 2017 val data.
loading annotations into memory...
Done (t=0.60s)
creating index...
index created!
Loaded val 5000 samples
==> initializing coco 2017 train data.
loading annotations into memory...
Done (t=15.95s)
creating index...
index created!
Loaded train 118287 samples
Starting training...
ctdet/coco_dla_2xTraceback (most recent call last):
File "main.py", line 106, in
main(opt)
File "main.py", line 72, in main
log_dict_train, _ = trainer.train(epoch, train_loader)
File "/home/adriantura/CenterNetDoneUntilEaster/CenterNet/src/lib/trains/base_trainer.py", line 119, in train
return self.run_epoch('train', epoch, data_loader)
File "/home/adriantura/CenterNetDoneUntilEaster/CenterNet/src/lib/trains/base_trainer.py", line 69, in run_epoch
output, loss, loss_stats = model_with_loss(batch)
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 139, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 35, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
return scatter_map(inputs)
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
return list(map(type(obj), zip(map(scatter_map, obj.items()))))
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(map(scatter_map, obj)))
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/home/adriantura/anaconda3/lib/python3.7/site-packages/torch/cuda/comm.py", line 148, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: out of memory (malloc at /opt/conda/conda-bld/pytorch_1544202130060/work/aten/src/THC/THCCachingAllocator.cpp:205)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f6e9dc03cc5 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x135af8f (0x7f6ea16f4f8f in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: + 0x135b79a (0x7f6ea16f579a in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, at::TensorOptions const&) + 0x2d6 (0x7f6ea2d5c1c6 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::CUDAFloatType::empty(c10::ArrayRef, at::TensorOptions const&) const + 0x161 (0x7f6ea1606931 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #5: torch::autograd::VariableType::empty(c10::ArrayRef, at::TensorOptions const&) const + 0x179 (0x7f6e9b0a6bc9 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #6: at::TypeDefault::copy(at::Tensor const&, bool, c10::optionalc10::Device) const + 0x122 (0x7f6e9e5dede2 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #7: + 0x5fa057 (0x7f6e9e40f057 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #8: at::native::to(at::Tensor const&, at::TensorOptions const&, bool, bool) + 0x295 (0x7f6e9e410cd5 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #9: at::TypeDefault::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17 (0x7f6e9e5a4d27 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #10: torch::autograd::VariableType::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17a (0x7f6e9b04cb2a in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #11: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalat::cuda::CUDAStream, std::allocator<c10::optionalat::cuda::CUDAStream > > > const&) + 0x491 (0x7f6ede76d161 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x4fae71 (0x7f6ede772e71 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0x112176 (0x7f6ede38a176 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #21: THPFunction_apply(_object
, _object
) + 0x5a1 (0x7f6ede585bf1 in /home/adriantura/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

@xingyizhou
Copy link
Owner

I am assuming you also changed the dcn version to pytorch >= 1.0 (If not, please do so). Not sure if RTX works with cuda 10. I will suggest upgrading the cuda and pytorch version for RTX ... I am using torch 1.4, cuda 10.2, and RTX 2080, it works fine.

@QihuaCheng
Copy link

You can try to reduce “--num_workers”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants