Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size mismatch while loading consolidated checkpoints trained with Tensor parallelism for custom LLama Model #734

Open
3 of 4 tasks
unography opened this issue Nov 9, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@unography
Copy link

unography commented Nov 9, 2024

System Info

Sagemaker Docker images:

763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04
763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-training-neuronx:2.1.2-transformers4.41.1-neuronx-py310-sdk2.19.1-ubuntu20.04

Who can help?

@michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I'm using a custom Llama model - its a subclass of the original LlamaForCausalLM - unography/llama-with-feats
The training is run on Sagemaker with pipeline parallelism of 1 and tensor parallelism of 8 - using the script run_clm.py present in this repo.

Checkpoints are consolidated with -

from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints_to_unified_checkpoint

consolidate_model_parallel_checkpoints_to_unified_checkpoint(
    training_args.output_dir, training_args.output_dir
)

On using the PyTorch 1 image, the tensor_parallel_shards directory has files like -

tp_rank_00_pp_rank_00
tp_rank_00_pp_rank_00.tensors/
...
tp_rank_07_pp_rank_00
tp_rank_07_pp_rank_00.tensors/

On using the PyTorch 2 image, the files are -

shards/
..model/
....dp_rank_00_tp_rank_00_pp_rank_00.pt
....dp_rank_00_tp_rank_00_pp_rank_00.pt.info.pt

In both cases its able to create the model.safetensors consolidated file

But while loading, get errors like -

	size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
	size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
	size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([2048, 256]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).

Expected behavior

Model loads after consolidating checkpoints

@unography unography added the bug Something isn't working label Nov 9, 2024
@michaelbenayoun
Copy link
Member

Can you try on the main branch please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants