Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to specify the traing param lr and lr_step #50

Closed
sevenseablue opened this issue May 6, 2019 · 5 comments
Closed

how to specify the traing param lr and lr_step #50

sevenseablue opened this issue May 6, 2019 · 5 comments

Comments

@sevenseablue
Copy link

sevenseablue commented May 6, 2019

i modify the code to train my data on detection. 280 classes, 80,000 images, and run the cmd below.

['main.py', 'ctdet', '--exp_id', 'pascal_dla_512', '--dataset', 'pascal', '--input_res', '512', '--num_epochs', '230', '--batch_size', '47', '--master_batch', '7', '--lr', '5e-4', '--lr_step', '180,210', '--gpus', '0,1,2,3,4,5', '--num_workers', '12']

but the param lr and lr_step, i can not find a good value.
when the program run for 7 days (k80, slow),

81117734 May  3 23:07 model_best.pth
243113237 May  6 10:44 model_last.pth

the program runs using nohup, and i can not see the loss and progress.
only when load the the model best, i can find the model best is on the 75th epoch.
it does not update for two days, and it can not go further about 50 epoches.

  1. is there a way to see the training progress and the loss
  2. how to specify the param lr and lr_step. is there a way to make the lr and lr_step auto adapt to train. for example, when loss increase three time continously, decrease the loss.
@xingyizhou
Copy link
Owner

Hi,

  1. The program use progress to dynamically print logs to the screen. If it does not work, you can set --print_iters 1 to use the standard python print function. The log, as well as a tensorboard log, is also saved to exp/ctdet/$exp_id.
  2. The learning rate 5e-4 is set for batchsize 128, you will need to linearly scale it to your batchsize (in your case, 5e-4 * 47 / 128).
  3. It is recommended to test with fewer epochs first, e.g., --num_epochs 70 --lr_step 60.
  4. The "best" model (with the lowest loss) is usually not the best one with the highest AP.

@sevenseablue
Copy link
Author

@xingyizhou thank you.

@lawpdas
Copy link

lawpdas commented Jun 26, 2019

@xingyizhou The learning rate 5e-4 is set for batchsize 128 on 8 GPUs (batchsize 16 on each GPU). If I use 2 GPUs and batchsize is 40, should I set lr = 5e-4 * (40 / 2) / (128 / 8) ?

@sisrfeng
Copy link

Anybody knows why?
The "best" model (with the lowest loss) is usually not the best one with the highest AP.
Many thanks!

@sisrfeng
Copy link

sisrfeng commented Apr 24, 2020

the program runs using nohup, and i can not see the loss and progress.

I can see the loss with this cmd:
nohup python main.py ctdet --exp_id coco_dla --batch_size 16 --master_batch 1 --lr 6.25e-5 -
-gpus 4,0 --print_iter 1 --num_workers 0 >../trian_log.json 2>&1 &

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants