-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
worker process died #25
Comments
Could you please specify where exactly the error occurred? For example, was it during rollout, making experience, gradient descent, or broadcasting phase? |
It happens during making experience, and it's pretty random, sometimes happen after the 1st step, sometimes after 10 global steps (where I use vllm_engine=8 rather than 16)
And I run it on 32 A100 GPUs, I don't change too much of the script:
|
me too |
I just rerun the experiment, it will start from the latest checkpoint |
make_experience: 2%|▏ | 50/2048 [02:49<1:32:39, 2.78s/it]�[A |
same problem. I run it on 4 nodes with 32 gpus. |
Pretty same issue. When I train Qwen-math-base on MATH 3-5 dataset with 5 nodes H20s it works well, but when I change checkpoint (base+sft) or dataset(Omni) it just randomly dies(some time in step 1, some time in step 10)
And this not work |
Thanks for your cool work!!
when trying to run experiments on 5nodes with 40 A100, I face this error after the first training epoch. What may be the potential reason for this?
The text was updated successfully, but these errors were encountered: