Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug with identical augmentation across workers / epochs #233

Open
HamsterHuey opened this issue Jul 26, 2019 · 1 comment
Open

Comments

@HamsterHuey
Copy link

So I just got bitten by this on a separate project and just realized that this likely affects CenterNet as well. Essentially, when we use Pytorch DataLoaders with num_workers > 1, Pytorch uses multiprocessing in the background to spawn the different processes. The problem however is related to numpy and random seeds. What ends up happening is that all the child processes end up having identical numpy seeds. So if you have num_workers = 4, the 1st image each of the workers processes will all have exactly identical augmentations. What's worse is that once we finish with an epoch, the next time around, all the workers will again start with the same initial numpy seed. This is more easily explained by example based on discussions here: pytorch/pytorch#5059

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
class FalseDataset(Dataset):
    def __init__(self, length):
        self.len = length
    def __len__(self):
        return self.len
    def __getitem__(self, ID):
        r = np.random.randint(1, 10000)
        return [r, torch.initial_seed()]
false_instance = FalseDataset(8)
train_loader = DataLoader(false_instance, shuffle=False, num_workers=4, worker_init_fn=lambda id: np.random.seed(torch.initial_seed() // 2**32 + id))
def train_epoch(loader, epoch):
    for batch_idx, output in enumerate(loader):
        print(batch_idx, output)
    return "epoch" + str(epoch) + " ended"
for i in range(2):  # simulating epochs
    print(train_epoch(train_loader, i))

If this is run without the worker_init_fn parameter (which is the default in Pytorch DataLoaders), we get the following output:

0 [tensor([5764]), tensor([7694063100383444054])]
1 [tensor([5764]), tensor([7694063100383444055])]
2 [tensor([5764]), tensor([7694063100383444056])]
3 [tensor([5764]), tensor([7694063100383444057])]
4 [tensor([3552]), tensor([7694063100383444054])]
5 [tensor([3552]), tensor([7694063100383444055])]
6 [tensor([3552]), tensor([7694063100383444056])]
7 [tensor([3552]), tensor([7694063100383444057])]
epoch0 ended
0 [tensor([5764]), tensor([2414831255868077781])]
1 [tensor([5764]), tensor([2414831255868077782])]
2 [tensor([5764]), tensor([2414831255868077783])]
3 [tensor([5764]), tensor([2414831255868077784])]
4 [tensor([3552]), tensor([2414831255868077781])]
5 [tensor([3552]), tensor([2414831255868077782])]
6 [tensor([3552]), tensor([2414831255868077783])]
7 [tensor([3552]), tensor([2414831255868077784])]
epoch1 ended

However, if we include the worker_init_fn parameter, we make use of 2 things. We utilize the unique pid each child process has to differentiate the seed across the workers during a given epoch. We also utilize the torch.initial_seed() which Pytorch handles correctly and randomizes after each epoch (i.e., iteration through the entire dataloader) to ensure that random seeds across epochs also stay randomized. This results in the following (desired) output:

0 [tensor([4897]), tensor([8149209332210546018])]
1 [tensor([2195]), tensor([8149209332210546019])]
2 [tensor([2363]), tensor([8149209332210546020])]
3 [tensor([431]), tensor([8149209332210546021])]
4 [tensor([4342]), tensor([8149209332210546018])]
5 [tensor([6209]), tensor([8149209332210546019])]
6 [tensor([5192]), tensor([8149209332210546020])]
7 [tensor([2202]), tensor([8149209332210546021])]
epoch0 ended
0 [tensor([4513]), tensor([7807720372388396883])]
1 [tensor([5422]), tensor([7807720372388396884])]
2 [tensor([2471]), tensor([7807720372388396885])]
3 [tensor([7137]), tensor([7807720372388396886])]
4 [tensor([4556]), tensor([7807720372388396883])]
5 [tensor([2131]), tensor([7807720372388396884])]
6 [tensor([3456]), tensor([7807720372388396885])]
7 [tensor([9496]), tensor([7807720372388396886])]
epoch1 ended

Unfortunately I couldn't confirm this for sure in this repo as the imports in the files don't work correctly except via some specific entrypoints into the codebase, so I couldn't make a quick test script like above to iterate through the Dataset and see if identical augmentations were being applied across workers and across epochs. But I am pretty sure this is very likely impacting this codebase and I wonder if you might get better performance if you modify the Datasets to utilize the worker_init_fn method to truly randomize the seed properly during training.

@xingyizhou
Copy link
Owner

Thank you for your report! We will investigate this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants