Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neuron consolidate leads to size mismatch. #368

Closed
philschmid opened this issue Dec 7, 2023 · 2 comments
Closed

Neuron consolidate leads to size mismatch. #368

philschmid opened this issue Dec 7, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@philschmid
Copy link
Member

When trying to load a model after training, which was consolidated with optimum-cli neuron consolidate dolly_llama/tensor_parallel_shards dolly_llama you get an tensor size miss match error.
I tried to fine-tune llama on dolly dataset, the training succeeds with TP=8. Afterwards i consolidated the weights and tried to load the model to make sure it learned the dolly format. For this i tried to load the model with the NeuronModelForCausalLM and AutoModelForCausalLM class, e.g.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dolly_llama")
model = AutoModelForCausalLM.from_pretrained("dolly_llama")

This leads to the error

Error(s) in loading state_dict for LlamaForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([4000, 4096]).
	size mismatch for lm_head.weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([4000, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

This feels that there is some error when consolidating the weights from the sharded ones since 32000/8=4000.

@philschmid philschmid added the bug Something isn't working label Dec 7, 2023
@5cp
Copy link
Contributor

5cp commented Dec 7, 2023

Hi @philschmid

I was able to reproduce this. The vocab_size in the fine-tuned model's config.json is incorrectly set to 4000 and is not adjusted during consolidation. I manually changed this to 32000 and was able to load the checkpoint.

@michaelbenayoun
Copy link
Member

Closing since I think this is solved in #378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants