Skip to content

Commit

Permalink
use WORLD_SIZE instead of device_count, supports both the case where …
Browse files Browse the repository at this point in the history
…the number of gpus we train on is smaller than gpus available, and also multinode training may be a bugfix
  • Loading branch information
karpathy committed Jun 14, 2023
1 parent f08abb4 commit 7339b90
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,10 @@
torch.cuda.set_device(device)
master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
seed_offset = ddp_rank # each process gets a different seed
assert gradient_accumulation_steps % torch.cuda.device_count() == 0
gradient_accumulation_steps //= torch.cuda.device_count()
# world_size number of processes will be training simultaneously, so we can scale
# down the desired gradient accumulation iterations per process proportionally
assert gradient_accumulation_steps % ddp_world_size == 0
gradient_accumulation_steps //= ddp_world_size
else:
# if not ddp, we are running on a single gpu, and one process
master_process = True
Expand Down

0 comments on commit 7339b90

Please sign in to comment.