Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MGDA, DG, and normal training code #9

Open
wants to merge 74 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
822e341
add MGDA, DG, and normal training code
mseg-dataset Aug 23, 2020
81360f0
clean up training script
mseg-dataset Aug 23, 2020
8498539
continue cleaning up the training script
mseg-dataset Aug 23, 2020
b1584ce
clean up imports
mseg-dataset Aug 23, 2020
8724d2e
rename StupidTaxonomyConverter to NaiveTaxonomyConverter
Sep 25, 2020
82ab6a0
remove commented lines
Sep 25, 2020
196d10c
remove commented out lines
Sep 25, 2020
535cec1
remove commented out lines
Sep 25, 2020
f3586f9
remove commented out lines
Sep 25, 2020
9710a55
remove commented out lines
Sep 25, 2020
2584bd7
remove commented out lines
Sep 25, 2020
5c15c5d
remove commented out lines
Sep 25, 2020
b30ac36
remove commented out lines
Sep 25, 2020
19a5d26
remove commented out lines
Sep 25, 2020
53b5d46
remove commented out lines
Sep 25, 2020
c774215
remove commented out lines
Sep 25, 2020
f0e5b64
remove commented-out lines
Sep 25, 2020
86b941d
remove commented out lines
Sep 25, 2020
db8390c
remove commented-out lines
Sep 25, 2020
6a29c5e
remove commented-out lines
Sep 25, 2020
fdb38a0
remove commented out lines
Sep 25, 2020
b7c7a09
Create training.md
Sep 25, 2020
7c2f909
remove commented-out lines
Sep 25, 2020
dd57119
Update training.md
Sep 25, 2020
0cc23ac
Update training.md
Sep 25, 2020
0deeac1
remove commented-out lines
Sep 25, 2020
ef89a9b
update instructions for training
johnwlambert Oct 15, 2020
20fb658
remove commented out lines
johnwlambert Oct 15, 2020
5ae9cac
remove commented out lines
johnwlambert Oct 15, 2020
81705c1
remove deprecated version ref in TaxonomyConverter
johnwlambert Oct 15, 2020
d7d88f9
remove tax version param
johnwlambert Oct 15, 2020
32be482
remove tax version param
johnwlambert Oct 15, 2020
1692391
remove tax version param
johnwlambert Oct 15, 2020
c2028d9
remove tax version param
johnwlambert Oct 15, 2020
5184faf
remove tax version param
johnwlambert Oct 15, 2020
c71d55c
remove tax version param
johnwlambert Oct 15, 2020
5fd9ed3
remove tax version param
johnwlambert Oct 15, 2020
fcfaecb
update ToFlatLabel to ToUniversalLabel
johnwlambert Oct 17, 2020
9b6a43f
clean up logic with naive taxonomy
johnwlambert Oct 17, 2020
96bec76
improve variable names
johnwlambert Oct 17, 2020
0f84e15
improve variable names
johnwlambert Oct 17, 2020
f3f9dbb
improve variable names
johnwlambert Oct 17, 2020
d1928bd
improve var names
johnwlambert Oct 17, 2020
fdbdec9
improve var names
johnwlambert Oct 17, 2020
cacd162
improve var names
johnwlambert Oct 17, 2020
b2b8d29
improve var names
johnwlambert Oct 17, 2020
41d48bf
improve var names
johnwlambert Oct 17, 2020
3726638
improve var names
johnwlambert Oct 17, 2020
f8afb3c
update args.tc.classes to args.tc.num_uclasses to reflect TaxononomyC…
johnwlambert Oct 22, 2020
b7ad193
remove outdated config
Dec 9, 2020
566a1ad
clean up old yaml files, just pass dataset name at command line
Dec 9, 2020
71e4cd7
remove unused config param
Dec 9, 2020
dbe7b06
remove outdated configs
Dec 9, 2020
12f6655
Delete unused configs
Dec 9, 2020
61fc77b
remove unused configs
Dec 9, 2020
dcda7e1
remove unused configs
Dec 9, 2020
dee5a37
remove old VGA configs
Dec 9, 2020
ea31c75
correct typo
Dec 9, 2020
0917abb
clean up train.py logic
Dec 9, 2020
ae063cc
clean up train.py logic
Dec 9, 2020
751000e
remove tensorboard, since not using writer anyways
Dec 9, 2020
d011fab
fix typos in train script
Dec 9, 2020
6711886
merge master into training branch
Dec 9, 2020
87cc9c8
remove old print statements
Dec 9, 2020
c987204
Merge branch 'training' of https://github.com/mseg-dataset/mseg-seman…
Dec 9, 2020
cb23214
reformat train script using Python black formatter
Dec 9, 2020
e741a3a
reformat more code with python black and remove finetune option (unused)
Dec 9, 2020
6592799
remove unused finetune option from configs
Dec 9, 2020
b121012
make a separate function to just compute number of iterations required
Dec 9, 2020
f8aac5c
reformat with black
Dec 9, 2020
fdf3b60
clarify docstring when determining number of iters to run
Dec 9, 2020
f45dfbb
edit docstring describing number of iters
Dec 9, 2020
f6d6a8b
fix type hint
Dec 10, 2020
e9e35bf
move apex docstring to training.md
Dec 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
edit docstring describing number of iters
  • Loading branch information
John Lambert committed Dec 9, 2020
commit f45dfbb15a40f87ea59f3ddf033b2586b2a26831
47 changes: 23 additions & 24 deletions mseg_semantic/tool/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -417,35 +417,22 @@ def get_rank_to_dataset_map(args) -> Dict[int, str]:
def set_number_of_training_iters(args):
"""
There are two scenarios we consider to determine number of required training iters
when training on MSeg:

1. We are mixing many datasets together. We determine which dataset this GPU
is assigned to. Each GPU runs 1 process, and multiple GPU IDs may be assigned
to a single dataset.
when training on MSeg. We set a max number of training crops, and then subdivide the
work between our GPUs.

The max number of iters is the number of

2. We are training with a single dataset. Suppose we want to train for 1 million
1. We are training with a single dataset. Suppose we want to train for 1 million
crops in total (args.num_examples). Suppose our dataset has 18k images. Then
we will train for 56 epochs. Suppose our training node has 8 GPUs. Then
with a batch size of 32, and 8 GPUs, we need ~3906 iterations to reach 1M crops.
"""
if len(args.dataset) > 1:
rank_to_dataset_map = get_rank_to_dataset_map(args)
# # which dataset this gpu is for
args.dataset_name = rank_to_dataset_map[args.rank]
# within this dataset, its rank, i.e. 0,1,2,3 etc gpu ID assigned to this dataset
args.dataset_rank = args.dataset_gpu_mapping[args.dataset_name].index(args.rank)
args.num_replica_per_dataset = len(args.dataset_gpu_mapping[args.dataset_name])

# num_replicas_for_max_dataset = len(args.dataset_gpu_mapping[max_dataset_name])
# num_replicas_for_max_dataset = args.num_replica_per_dataset # assuming the same # replicas for each dataset
args.max_iters = math.floor(args.num_examples / (args.batch_size * args.num_replica_per_dataset))
# args.max_iters = iters_per_epoch_for_max_dataset * 3 # should be the max_iters for all dataset, args.epochs needs recompute later

logger.info(f'max_iters = {args.max_iters}')

elif (len(args.dataset) == 1) and (not args.use_mgda):
2. We are mixing many datasets together. We determine which dataset this GPU
is assigned to. Each GPU runs 1 process, and multiple GPU IDs (referred to
as replicas) may be assigned to a single dataset. The computation is the same
as before, except instead of counting all of the GPUs on the node, we only
count the number of replicas counting towards this dataset.
"""
# single dataset training
if (len(args.dataset) == 1) and (not args.use_mgda):
from util.txt_utils import read_txt_file
# number of examples for 1 epoch of this dataset
num_d_examples = len(read_txt_file(infos[args.dataset[0]].trainlist))
Expand All @@ -459,6 +446,18 @@ def set_number_of_training_iters(args):
if args.epochs > 1000:
args.save_freq = args.epochs // 100

# multiple dataset training
elif len(args.dataset) > 1:
rank_to_dataset_map = get_rank_to_dataset_map(args)
# # which dataset this gpu is for
args.dataset_name = rank_to_dataset_map[args.rank]
# within this dataset, its rank, i.e. 0,1,2,3 etc gpu ID assigned to this dataset
args.dataset_rank = args.dataset_gpu_mapping[args.dataset_name].index(args.rank)
args.num_replica_per_dataset = len(args.dataset_gpu_mapping[args.dataset_name])

args.max_iters = math.floor(args.num_examples / (args.batch_size * args.num_replica_per_dataset))
logger.info(f'max_iters = {args.max_iters}')

return args


Expand Down