Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to retrieve a running job and cancel it #1766

Open
mzhang2FW opened this issue Jun 14, 2024 · 4 comments
Open

How to retrieve a running job and cancel it #1766

mzhang2FW opened this issue Jun 14, 2024 · 4 comments

Comments

@mzhang2FW
Copy link

Since SLURM is not installed on our supercomputing center, I cannot use command like scancel. I can use submitit to run a job and be able to monitor the progress from .out/.err files, but what command should be used to cancel a wrong job given the job id? Thanks.

@ApiaoSamaa
Copy link

ApiaoSamaa commented Aug 26, 2024

Similar issue. I'm using LocalExecutor, when I start a submitit task, I'll get a job id, but I'm not sure how to utilize it for stopping the submitit jobs.

As jobs command doesn't print any information about the running submitit jobs, a temporary expedient for me is to use nvitop and find process ids runned by submitit.core._submit, then manually cancel them using kill -9 <PID>. I'm wondering whether there're some more sophisticated methods to tackle the problem.

@baldassarreFe
Copy link
Contributor

The local executor uses the process PID as the job ID. You should be able to send a termination signal with kill <JOB_ID>.

Look for os.getpid() and job_id in https://github.com/facebookincubator/submitit/blob/main/submitit/local/local.py

Two more things:

  • jobs is a shell command that prints the processes that are running in the background from that shall, e.g. if you do my_program & or if you type CTRL+Z into a foreground process. This is to say that jobs has nothing to do with submitit.
  • kill -9 is pretty harsh, you may get issues if the processes don't release GPU resources properly. If you don't specify -9 it will send the default SIGTERM which tells the process to terminate nicely if possible.

@ApiaoSamaa
Copy link

The local executor uses the process PID as the job ID. You should be able to send a termination signal with kill <JOB_ID>.↳

Look for os.getpid() and job_id in https://github.com/facebookincubator/submitit/blob/main/submitit/local/local.py↳

Two more things:↳

  • jobs is a shell command that prints the processes that are running in the background from that shall, e.g. if you do my_program & or if you type CTRL+Z into a foreground process. This is to say that jobs has nothing to do with submitit.
  • kill -9 is pretty harsh, you may get issues if the processes don't release GPU resources properly. If you don't specify -9 it will send the default SIGTERM which tells the process to terminate nicely if possible.

Thank you so much for your reply! It's pretty helpful. But there is one minor issue for me here: when I use kill <PID>, the processes aren't successfully killed since the submitit will by pass the SIGTERM, below is the output.... I think this might relate to issue #1677:

[11:32:41.570438] Epoch: [0]  [  120/10009]  eta: 2:14:34  lr: 0.000000  loss: 2.0240 (2.1143)  time: 0.7982  data: 0.0001  max mem: 10676
submitit WARNING (2024-09-26 11:32:42,332) - Bypassing signal SIGTERM
submitit WARNING (2024-09-26 11:32:45,545) - Bypassing signal SIGTERM
[11:32:57.500890] Epoch: [0]  [  140/10009]  eta: 2:13:50  lr: 0.000000  loss: 1.9500 (2.0967)  time: 0.7965  data: 0.0001  max mem: 10676

@baldassarreFe
Copy link
Contributor

Hello, if you are sure that you will always use the local executor (not SLURM) you can reset the signal handler for SIGTERM to the default instead of the bypass that is configured by submitit:

import signal

signal.signal(signal.SIGTERM, signal.SIG_DFL)

Alternatively, you can mimic the signal sequence that SLURM would send which is a SIGTERM followed by a SIGKILL after a small delay. In bash you can use:

function my_scancel {
    kill $1
    sleep 10
    kill -9 $1
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants