Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can I run training on single GPU not gpu 0?--gpus 1 doesn't work #440

Closed
Shawnnnnn opened this issue Oct 31, 2019 · 11 comments
Closed

Comments

@Shawnnnnn
Copy link

I have to set --gpus 1 in my code, but it still run on GPU 0. It doesn't seem to work.
How can I fix this problem?? Thx for ur work!!

@xingyizhou
Copy link
Owner

I haven't met this. Can you be more specific?

@Shawnnnnn
Copy link
Author

I set --gpus 2 in start command
python main.py --task ctdet --exp_id xray --batch_size 16 --master_batch_size 16 --lr 1.25e-4 --gpus 2 --save_all --resume
Parameter log is here
==> Opt:
K: 100
aggr_weight: 0.0
agnostic_ex: False
arch: dla_34
aug_ddd: 0.5
aug_rot: 0
batch_size: 16
cat_spec_wh: False
center_thresh: 0.1
chunk_sizes: [16]
data_dir: /mnt/workspace/member/chengxiao/centerNet/CentNet-debug/src/lib/../../data
dataset: xray
debug: 0
debug_dir: /mnt/workspace/member/chengxiao/centerNet/CentNet-debug/src/lib/../../exp/ctdet/xray/debug
debugger_theme: white
demo: ../images/33823288584_1d21cf0a26_k.jpg
dense_hp: False
dense_wh: False
dep_weight: 1
dim_weight: 1
down_ratio: 4
eval_oracle_dep: False
eval_oracle_hm: False
eval_oracle_hmhp: False
eval_oracle_hp_offset: False
eval_oracle_kps: False
eval_oracle_offset: False
eval_oracle_wh: False
exp_dir: ../../exp/ctdet
exp_id: xray
fix_res: True
flip: 0.5
flip_test: False
gpus: [0]
gpus_str: 2

head_conv: 256
heads: {'hm': 22, 'wh': 2, 'reg': 2}
hide_data_time: False
hm_hp: True
hm_hp_weight: 1
hm_weight: 1
hp_weight: 1
input_h: 512
input_res: 512
input_w: 512
keep_res: False
kitti_split: 3dop
load_model: ../../exp/ctdet/xray/model_last.pth
lr: 0.000125
lr_step: [90, 120]
master_batch_size: 16
mean: [[[0.87136596 0.8363191 0.76000565]]]
metric: loss
mse_loss: False
nms: False
no_color_aug: False
norm_wh: False
not_cuda_benchmark: False
not_hm_hp: False
not_prefetch_test: False
not_rand_crop: False
not_reg_bbox: False
not_reg_hp_offset: False
not_reg_offset: False
num_classes: 22
num_epochs: 140
num_iters: -1
num_stacks: 1
num_workers: 4
off_weight: 1
output_h: 128
output_res: 128
output_w: 128
pad: 31
peak_thresh: 0.2
print_iter: 0
rect_mask: False
reg_bbox: True
reg_hp_offset: True
reg_loss: l1
reg_offset: True
resume: True
root_dir: /mnt/workspace/member/chengxiao/centerNet/CentNet-debug/src/lib/../..
rot_weight: 1
rotate: 0
save_all: True
save_dir: /mnt/workspace/member/chengxiao/centerNet/CentNet-debug/src/lib/../../exp/ctdet/xray
scale: 0.4
scores_thresh: 0.1
seed: 317
shift: 0.1
std: [[[0.21791764 0.19087061 0.24200565]]]
task: ctdet
test: False
test_scales: [1.0]
trainval: False
val_intervals: 5
vis_thresh: 0.3
wh_weight: 0.1

Graphics card usage situation is here, you can see gpu 2 is not working.

Wed Nov 6 09:53:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:03:00.0 On | N/A |
| 39% 74C P2 170W / 250W | 10737MiB / 10986MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:04:00.0 Off | N/A |
| 48% 84C P2 196W / 250W | 9060MiB / 10989MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:81:00.0 Off | N/A |
| 29% 39C P8 3W / 250W | 11MiB / 10989MiB | 0% Default |

+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:82:00.0 Off | N/A |
| 42% 77C P2 129W / 250W | 8908MiB / 10989MiB | 46% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2676 G /usr/lib/xorg/Xorg 60MiB |
| 0 2871 G /usr/bin/gnome-shell 78MiB |
| 0 32645 C python 1367MiB |
| 0 52838 C python 1566MiB |
| 0 55014 C python 7651MiB |
| 1 5736 C python 9049MiB |
| 3 45525 C python 8897MiB |
+-----------------------------------------------------------------------------+

@Shawnnnnn
Copy link
Author

I guess I know what caused this problem.
Your code in opts.py opt.gpus = [i for i in range(len(opt.gpus))] if opt.gpus[0] >=0 else [-1] have a bug.
If opt.gus = [1], then this code will set opt.gpus = [0]. Because len(opt.gpus) == 1, for i in range(1), i is 0.
Why do you want to add this code?

@xingyizhou
Copy link
Owner

This is intended. I have reset the CUDA_VISIBLE_DEVICES to the original --gpus 1 string here so that opt.gpus[0] will map to the first GPU of in --gpus 1. You can try comment out opt.gpus = [i for i in range(len(opt.gpus))] if opt.gpus[0] >=0 else [-1] and this line and manually set CUDA_VISIBLE_DEVICES.

@Shawnnnnn
Copy link
Author

I have commented out opt.gpus = [i for i in range(len(opt.gpus))] if opt.gpus[0] >=0 else [-1] and set os.environ['CUDA_VISIBLE_DEVICES'] = '2' in main.py. It still run on gpu 0 :( It's weird.

@Shawnnnnn
Copy link
Author

I tried some ways to solve it.
When I use os.environ['CUDA_VISIBLE_DEVICES'] = "1" it doesn't work and still run on GPU 0.
When I use CUDA_VISIBLE_DEVICES = 1 python main.py it works and run on GPU1.
When I use torch.cuda.set_device(1) it also works and run on GPU1.

@xingyizhou
Copy link
Owner

Closing this since the problem is solved.

@xukuanHIT
Copy link

@Shawnnnnn
Hi, I guess it would work if you set os.environ['CUDA_VISIBLE_DEVICES'] = "1" at the beginning of the file. (out of the main())

@hemp110
Copy link

hemp110 commented Feb 17, 2020

@Shawnnnnn
os.environ['CUDA_VISIBLE_DEVICES'] = "1" should appear before import torch and import any module that includes import torch e.g. from models.model import create_model, load_model, save_model

@lijain
Copy link

lijain commented Jun 16, 2020

python src/main.py ctdet --exp_id dior_dla20200616 --batch_size 24 --lr 2e-4 --gpus 1,2,3 --lr_step 30,60,80 --num_epochs 100
在main.py中CUDA_VISIBLE_DEVICES=1,2,3
opt.device = torch.device('cuda' if opt.gpus[0] >= 0 else 'cpu')设置0号gpu也一直在工作,求解

@tuanlda78202
Copy link

Thank you so much @Shawnnnnn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants