Training output reports incorrect num examples when using DDP #683

syl-taylor-aws · 2024-08-24T19:25:15Z

System Info

AWS EC2 instance: trn1.32xlarge
OS: Ubuntu 22.04.4 LTS

Platform:

- Platform: Linux-6.5.0-1023-aws-x86_64-with-glibc2.35
- Python version: 3.10.12

Python packages:

- `optimum-neuron` version: 0.0.24
- `neuron-sdk` version: 2.19.1
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.24.5
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.2335
- `neuronx-cc` version: 2.14.227.0+2d4f85be
- `neuronx-distributed` version: 0.8.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:
aws-neuronx-collectives/unknown,now 2.21.46.0-69b77134b amd64 [installed]
aws-neuronx-dkms/unknown,now 2.17.17.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.21.41.0-fb1705f5f amd64 [installed]
aws-neuronx-tools/unknown,now 2.18.3.0 amd64 [installed]

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I can't share the project code which has a dataset of type Dataset and len of 56403, but wrote another case for simplicity, that shows the same issue.

Command: torchrun --nproc_per_node=2 issue.py

Code (issue.py)

import torch
from transformers import RobertaForCausalLM
from optimum.neuron import NeuronTrainer as Trainer
from optimum.neuron import NeuronTrainingArguments as TrainingArguments


class CustomDataset(torch.utils.data.Dataset):
    def __getitem__(self, index):
        return {
            "input_ids": torch.randint(0, 50265, (512,)),
            "labels": torch.randint(0, 50265, (512,))
        }

    def __len__(self):
        return 56403

dataset = CustomDataset()

model = RobertaForCausalLM.from_pretrained("roberta-base")

training_args = TrainingArguments(output_dir="./model", max_steps=100)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

trainer.train() # note the output line: "[INFO|trainers.py:<num>] <timestamp> >>   Num examples = <number>""
# the issue is at https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 
# currently "self.num_examples(train_dataloader)" = 28208
# should maybe be "self.num_examples(train_dataloader._loader)" = 56403 (expected)

When calling trainer.train(), we get the output:

[INFO|trainers.py:] <timestamp> >> ***** Running training *****
[INFO|trainers.py:] <timestamp> >>   Num examples = 28,208
...

Num examples should be 56403, but it returns a different number when using DDP, like 28208 (in this test).

"Num examples" is calculated by Trainer's num_examples() in https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1408 which is called by https://github.com/huggingface/optimum-neuron/blob/v0.0.24/optimum/neuron/trainers.py#L700 .

The issue doesn't occur when training without DDP. Without DDP, dataloader is <torch.utils.data.dataloader.DataLoader> and num_examples() returns expected number.

With DDP, dataloader is <torch_xla.distributed.parallel_loader.MpDeviceLoader> and dataloader.dataset raises AttributeError("'MpDeviceLoader' object has no attribute 'dataset'"). This makes num_examples() return an unexpected number at https://github.com/huggingface/transformers/blob/v4.41.1/src/transformers/trainer.py#L1420 . However, we have dataloader._loader which is a <torch.utils.data.dataloader.DataLoader> and len(dataloader._loader.dataset) is 56403. Perhaps we should call self.num_examples(train_dataloader._loader)?

Expected behavior

"Num examples" is not reported correctly when training with DDP on AWS Trainium/Inferentia instances. In the reproducible code, it should be 56403 (len of dataset), but it returns 28208 based on an exception occurring in num_examples() in the transformers package.

For additional reference, on a EC2 p4d instance (Nvidia A100 GPUs), when using DDP with the Trainer from the transformers package, dataloader is <accelerate.data_loader.DataLoaderShard> and "num examples" is reported as expected: 56403.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-14T11:40:13Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

syl-taylor-aws added the bug Something isn't working label Aug 24, 2024

github-actions bot added the Stale label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training output reports incorrect num examples when using DDP #683

Training output reports incorrect num examples when using DDP #683

syl-taylor-aws commented Aug 24, 2024

github-actions bot commented Oct 14, 2024

Training output reports incorrect num examples when using DDP #683

Training output reports incorrect num examples when using DDP #683

Comments

syl-taylor-aws commented Aug 24, 2024

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

github-actions bot commented Oct 14, 2024