You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue doesn't occur when training without DDP. Without DDP, dataloader is <torch.utils.data.dataloader.DataLoader> and num_examples() returns expected number.
With DDP, dataloader is <torch_xla.distributed.parallel_loader.MpDeviceLoader> and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a <torch.utils.data.dataloader.DataLoader> and len(dataloader._loader.dataset) is 56403. Perhaps we should call self.num_examples(train_dataloader._loader)?
Expected behavior
"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.
For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is <accelerate.data_loader.DataLoaderShard> and "num examples" is reported as expected: 56403.
The text was updated successfully, but these errors were encountered:
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
I can't share the project code which has a dataset of type Dataset and len of 56403, but wrote another case for simplicity, that shows the same issue.
Command:
torchrun --nproc_per_node=2 issue.py
Code (issue.py)
When calling trainer.train(), we get the output:
Num examples should be 56403, but it returns a different number when using DDP, like 28208 (in this test).
"Num examples" is calculated by Trainer's num_examples() in https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1408 which is called by https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 .
The issue doesn't occur when training without DDP. Without DDP, dataloader is <torch.utils.data.dataloader.DataLoader> and num_examples() returns expected number.
With DDP, dataloader is <torch_xla.distributed.parallel_loader.MpDeviceLoader> and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a <torch.utils.data.dataloader.DataLoader> and len(dataloader._loader.dataset) is 56403. Perhaps we should call
self.num_examples(train_dataloader._loader)
?Expected behavior
"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.
For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is <accelerate.data_loader.DataLoaderShard> and "num examples" is reported as expected: 56403.
The text was updated successfully, but these errors were encountered: