how to specify the traing param lr and lr_step #50

sevenseablue · 2019-05-06T04:01:24Z

i modify the code to train my data on detection. 280 classes, 80,000 images, and run the cmd below.

['main.py', 'ctdet', '--exp_id', 'pascal_dla_512', '--dataset', 'pascal', '--input_res', '512', '--num_epochs', '230', '--batch_size', '47', '--master_batch', '7', '--lr', '5e-4', '--lr_step', '180,210', '--gpus', '0,1,2,3,4,5', '--num_workers', '12']

but the param lr and lr_step, i can not find a good value.
when the program run for 7 days (k80, slow),

81117734 May  3 23:07 model_best.pth
243113237 May  6 10:44 model_last.pth

the program runs using nohup, and i can not see the loss and progress.
only when load the the model best, i can find the model best is on the 75th epoch.
it does not update for two days, and it can not go further about 50 epoches.

is there a way to see the training progress and the loss
how to specify the param lr and lr_step. is there a way to make the lr and lr_step auto adapt to train. for example, when loss increase three time continously, decrease the loss.

The text was updated successfully, but these errors were encountered:

xingyizhou · 2019-05-06T04:23:27Z

Hi,

The program use progress to dynamically print logs to the screen. If it does not work, you can set --print_iters 1 to use the standard python print function. The log, as well as a tensorboard log, is also saved to exp/ctdet/$exp_id.
The learning rate 5e-4 is set for batchsize 128, you will need to linearly scale it to your batchsize (in your case, 5e-4 * 47 / 128).
It is recommended to test with fewer epochs first, e.g., --num_epochs 70 --lr_step 60.
The "best" model (with the lowest loss) is usually not the best one with the highest AP.

sevenseablue · 2019-05-07T09:53:30Z

@xingyizhou thank you.

lawpdas · 2019-06-26T02:20:35Z

@xingyizhou The learning rate 5e-4 is set for batchsize 128 on 8 GPUs (batchsize 16 on each GPU). If I use 2 GPUs and batchsize is 40, should I set lr = 5e-4 * (40 / 2) / (128 / 8) ?

sisrfeng · 2020-04-24T10:47:23Z

Anybody knows why?
The "best" model (with the lowest loss) is usually not the best one with the highest AP.
Many thanks!

sisrfeng · 2020-04-24T10:54:43Z

the program runs using nohup, and i can not see the loss and progress.

I can see the loss with this cmd:
nohup python main.py ctdet --exp_id coco_dla --batch_size 16 --master_batch 1 --lr 6.25e-5 -
-gpus 4,0 --print_iter 1 --num_workers 0 >../trian_log.json 2>&1 &

sevenseablue closed this as completed May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to specify the traing param lr and lr_step #50

how to specify the traing param lr and lr_step #50

sevenseablue commented May 6, 2019 •

edited

Loading

xingyizhou commented May 6, 2019

sevenseablue commented May 7, 2019

lawpdas commented Jun 26, 2019

sisrfeng commented Apr 24, 2020

sisrfeng commented Apr 24, 2020 •

edited

Loading

how to specify the traing param lr and lr_step #50

how to specify the traing param lr and lr_step #50

Comments

sevenseablue commented May 6, 2019 • edited Loading

xingyizhou commented May 6, 2019

sevenseablue commented May 7, 2019

lawpdas commented Jun 26, 2019

sisrfeng commented Apr 24, 2020

sisrfeng commented Apr 24, 2020 • edited Loading

sevenseablue commented May 6, 2019 •

edited

Loading

sisrfeng commented Apr 24, 2020 •

edited

Loading