Neuron `consolidate` leads to size mismatch. #368

philschmid · 2023-12-07T14:39:37Z

When trying to load a model after training, which was consolidated with optimum-cli neuron consolidate dolly_llama/tensor_parallel_shards dolly_llama you get an tensor size miss match error.
I tried to fine-tune llama on dolly dataset, the training succeeds with TP=8. Afterwards i consolidated the weights and tried to load the model to make sure it learned the dolly format. For this i tried to load the model with the NeuronModelForCausalLM and AutoModelForCausalLM class, e.g.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dolly_llama")
model = AutoModelForCausalLM.from_pretrained("dolly_llama")

This leads to the error

Error(s) in loading state_dict for LlamaForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([4000, 4096]).
	size mismatch for lm_head.weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([4000, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

This feels that there is some error when consolidating the weights from the sharded ones since 32000/8=4000.

The text was updated successfully, but these errors were encountered:

5cp · 2023-12-07T17:39:57Z

Hi @philschmid

I was able to reproduce this. The vocab_size in the fine-tuned model's config.json is incorrectly set to 4000 and is not adjusted during consolidation. I manually changed this to 32000 and was able to load the checkpoint.

* Fix issue #373 * Fix issue #368 * Fix #367

michaelbenayoun · 2024-01-12T16:43:22Z

Closing since I think this is solved in #378

philschmid added the bug Something isn't working label Dec 7, 2023

philschmid assigned michaelbenayoun Dec 7, 2023

michaelbenayoun added a commit that referenced this issue Dec 14, 2023

Fix issue #368

c2d554d

michaelbenayoun mentioned this issue Dec 14, 2023

Fix checkpoint saving and consolidation for TP #378

Merged

michaelbenayoun added a commit that referenced this issue Dec 15, 2023

Fix checkpoint saving and consolidation for TP (#378)

ecdeee8

* Fix issue #373 * Fix issue #368 * Fix #367

michaelbenayoun closed this as completed Jan 12, 2024

Kelv1nYu mentioned this issue Nov 4, 2024

Size Mismatch Error During LoRA Adapter Merge in Supervised Fine-Tuning of Llama 3.2 1B on AWS Trainium Instance #733

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neuron `consolidate` leads to size mismatch. #368

Neuron `consolidate` leads to size mismatch. #368

philschmid commented Dec 7, 2023

5cp commented Dec 7, 2023

michaelbenayoun commented Jan 12, 2024

Neuron consolidate leads to size mismatch. #368

Neuron consolidate leads to size mismatch. #368

Comments

philschmid commented Dec 7, 2023

5cp commented Dec 7, 2023

michaelbenayoun commented Jan 12, 2024

Neuron `consolidate` leads to size mismatch. #368

Neuron `consolidate` leads to size mismatch. #368