Size mismatch while loading consolidated checkpoints trained with Tensor parallelism for custom LLama Model #734

unography · 2024-11-09T15:27:05Z

System Info

Sagemaker Docker images:

763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04
763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-training-neuronx:2.1.2-transformers4.41.1-neuronx-py310-sdk2.19.1-ubuntu20.04

Who can help?

@michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I'm using a custom Llama model - its a subclass of the original LlamaForCausalLM - unography/llama-with-feats
The training is run on Sagemaker with pipeline parallelism of 1 and tensor parallelism of 8 - using the script run_clm.py present in this repo.

Checkpoints are consolidated with -

from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints_to_unified_checkpoint

consolidate_model_parallel_checkpoints_to_unified_checkpoint(
    training_args.output_dir, training_args.output_dir
)

On using the PyTorch 1 image, the tensor_parallel_shards directory has files like -

tp_rank_00_pp_rank_00
tp_rank_00_pp_rank_00.tensors/
...
tp_rank_07_pp_rank_00
tp_rank_07_pp_rank_00.tensors/

On using the PyTorch 2 image, the files are -

shards/
..model/
....dp_rank_00_tp_rank_00_pp_rank_00.pt
....dp_rank_00_tp_rank_00_pp_rank_00.pt.info.pt

In both cases its able to create the model.safetensors consolidated file

But while loading, get errors like -

	size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
	size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
	size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([2048, 256]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).

Expected behavior

Model loads after consolidating checkpoints

The text was updated successfully, but these errors were encountered:

michaelbenayoun · 2024-11-20T16:29:57Z

Can you try on the main branch please?

unography added the bug Something isn't working label Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size mismatch while loading consolidated checkpoints trained with Tensor parallelism for custom LLama Model #734

Size mismatch while loading consolidated checkpoints trained with Tensor parallelism for custom LLama Model #734

unography commented Nov 9, 2024 •

edited

Loading

michaelbenayoun commented Nov 20, 2024

Size mismatch while loading consolidated checkpoints trained with Tensor parallelism for custom LLama Model #734

Size mismatch while loading consolidated checkpoints trained with Tensor parallelism for custom LLama Model #734

Comments

unography commented Nov 9, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

michaelbenayoun commented Nov 20, 2024

unography commented Nov 9, 2024 •

edited

Loading