You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
I'm using a custom Llama model - its a subclass of the original LlamaForCausalLM - unography/llama-with-feats
The training is run on Sagemaker with pipeline parallelism of 1 and tensor parallelism of 8 - using the script run_clm.py present in this repo.
Checkpoints are consolidated with -
from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints_to_unified_checkpoint
consolidate_model_parallel_checkpoints_to_unified_checkpoint(
training_args.output_dir, training_args.output_dir
)
On using the PyTorch 1 image, the tensor_parallel_shards directory has files like -
In both cases its able to create the model.safetensors consolidated file
But while loading, get errors like -
size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([2048, 256]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
Expected behavior
Model loads after consolidating checkpoints
The text was updated successfully, but these errors were encountered:
System Info
Sagemaker Docker images:
Who can help?
@michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
I'm using a custom Llama model - its a subclass of the original
LlamaForCausalLM
-unography/llama-with-feats
The training is run on Sagemaker with pipeline parallelism of 1 and tensor parallelism of 8 - using the script
run_clm.py
present in this repo.Checkpoints are consolidated with -
On using the PyTorch 1 image, the
tensor_parallel_shards
directory has files like -On using the PyTorch 2 image, the files are -
In both cases its able to create the
model.safetensors
consolidated fileBut while loading, get errors like -
Expected behavior
Model loads after consolidating checkpoints
The text was updated successfully, but these errors were encountered: