Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When an epoch is trained, an error is reported when the model is saved and the next epoch is trained。 #4

Open
hangzeli05 opened this issue Jan 3, 2024 · 1 comment

Comments

@hangzeli05
Copy link

image image No matter when I running run_stage1.sh or run_stage2.sh,this error will occur。 Even if changed to save_strategy is steps,this error will occur ,too。
@waxnkw
Copy link
Collaborator

waxnkw commented Jan 4, 2024

It seems to be a problem with the torch and FSDP. I guess that changing to a new torch version will help. My torch version is "2.1.1+cu118". You can take a try for the torch==2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants