Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot convert to pytorch using huggingface #178

Open
designgrande opened this issue Mar 10, 2020 · 6 comments
Open

Cannot convert to pytorch using huggingface #178

designgrande opened this issue Mar 10, 2020 · 6 comments

Comments

@designgrande
Copy link

Huggingface requires "/path/to/gpt2/pretrained/weights" and I just don't understand what path should I enter here. It's not like I haven't tried everything - checkpoint, the folder inside it and even the checkpoint itself. But it just doesn't work.

!python export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights

!python transformers-cli convert --model_type gpt2 \
  --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \
  --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
  [--config OPENAI_GPT2_CONFIG] \
  [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK]

Any help will be greatly appreciated.

Thanks!

@apeguero1
Copy link

apeguero1 commented Mar 14, 2020

Not sure how it works in the cli but in python you could do

config = transformers.GPT2Config.from_pretrained('gpt2')
model = transformers.GPT2Model.from_pretrained(index_path,from_tf=True,config=config)

Where index_path is the tensorflow index checkpoint So something like "checkpoint/run1/model-XXX.index".

Docs on loading pretrained models

@bablf
Copy link

bablf commented Mar 17, 2021

I managed to load the model, thanks to your advice.
But I was struggling to load the Tokenizer.
This seems to do the trick:

tokenizer = GPT2Tokenizer("checkpoint/run1/encoder.json", "checkpoint/run1/vocab.bpe")
print(tokenizer("Bxe3")['input_ids'])

but

tokenizer2 = GPT2Tokenizer.from_pretrained('gpt2')
print(tokenizer2("Bxe3")['input_ids'])

outputs the same input_ids [33, 27705, 18] which made me suspect something was wrong. But I guess I was wrong since I have a finetuned model and finetuning does not change the tokenizer as far as I am aware of.

Anyways maybe this helps someone.

@enochlev
Copy link

Not sure how it works in the cli but in python you could do

config = transformers.GPT2Config.from_pretrained('gpt2')
model = transformers.GPT2Model.from_pretrained(index_path,from_tf=True,config=config)

Where index_path is the tensorflow index checkpoint So something like "checkpoint/run1/model-XXX.index".

Docs on loading pretrained models

For those of you who seem to get an tensor error, make sure you load the correct config.

config = transformers.GPT2Config.from_pretrained('/content/checkpoint/run1/hparams.json')
model = transformers.GPT2Model.from_pretrained(index_path,from_tf=True,config=config)

@aenriquez27
Copy link

Hello! Have this been figured out? I tried to follow the same from here with this code

config = transformers.GPT2Config.from_pretrained('/content/hparams.json')
tokenizer = transformers.GPT2Tokenizer("/content/encoder.json", "/content/vocab.bpe")
model = transformers.GPT2Model.from_pretrained('/content/model.index',from_tf=True,config=config)

But I'm seeing this error:

OpError: /content/model.data-00000-of-00001; No such file or directory

Has anybody seen this and know how to fix it?

Thanks!

@enochlev
Copy link

enochlev commented Jan 29, 2024

Typically the GPT2Model.from_pretrained should point to a folder rather then a specific file. Try
model = transformers.GPT2Model.from_pretrained('/content/model.index',from_tf=True,config=config)

If that doesn't work print off your file structure in your content directory so we can debug
Windows
Get-ChildItem -Directory -Recurse -Depth 2 | ForEach-Object {Write-Host $_.FullName}

Linux
tree -L 2

Or give us the code on how you downloaded the model

@aenriquez27
Copy link

Typically the GPT2Model.from_pretrained should point to a folder rather then a specific file. Try model = transformers.GPT2Model.from_pretrained('/content/model.index',from_tf=True,config=config)

If that doesn't work print off your file structure in your content directory so we can debug Windows Get-ChildItem -Directory -Recurse -Depth 2 | ForEach-Object {Write-Host $_.FullName}

Linux tree -L 2

Or give us the code on how you downloaded the model

Doing the above solved the issue from before but raised a new issue "IndexError: Read fewer bytes than requested".

I can't seem to find much information about this error, just 2 so discussions that mention memory not being enough or corrupted weight files. I didn't see Colab crashing and I'm unsure how to check for corrupted weight files. I could retrain and try with those files instead.

Have you seen that issue before?

As to how I downloaded the model, I just used what was provided in the gpt2-simple colab.

gpt2.copy_checkpoint_to_gdrive(run_name='run1')

Followed by downloading it to my local computer. So there is a chance something got corrupted along the way but I figured I'd ask anyways just in case.

Many thanks for all the help so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants