Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing "is_alive" checks before acquiring the GIL in destructors. #894

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

smurfix
Copy link
Contributor

@smurfix smurfix commented Jan 30, 2025

cf. #891

src/nb_ndarray.cpp Outdated Show resolved Hide resolved
src/nb_ndarray.cpp Outdated Show resolved Hide resolved
docs/api_extra.rst Outdated Show resolved Hide resolved
@wjakob
Copy link
Owner

wjakob commented Jan 30, 2025

Could you add a changelog as well?

@smurfix
Copy link
Contributor Author

smurfix commented Jan 30, 2025

Done.

We'll see whether that is sufficient …

@wjakob
Copy link
Owner

wjakob commented Jan 30, 2025

After some more thought, I don't think that this really fixes the issue.

While nb::is_alive() might return True when the condition is checked, there is no guarantee that this is still the case at the next instruction. Basically this is a race condition. If you want to avoid undefined behavior, you will need to prevent this problem in another way.

@smurfix
Copy link
Contributor Author

smurfix commented Jan 30, 2025

That is true in general. But what happens when the Python program throws an uncaught exception or otherwise ends abnormally, or simply is signalled to end?

I'm not concerned with preventing a 0.1% chance of a hard coredump and whatnot due to a race condition in this case. I'm concerned with preventing a 100% chance of getting one.

@wjakob
Copy link
Owner

wjakob commented Jan 30, 2025

By having a condition variable, mutex, or similar synchronization mechanism, you can guarantee that the right order is enforced during shutdown. If the application is killed, then the kernel will shut things down and none of this code will run at all.

A crash that happens in 0.1% of runs is super annoying because it is so hard to reproduce. I would rather have software fail spectacularly than accumulate lots of issues in the long tail.

@smurfix
Copy link
Contributor Author

smurfix commented Jan 30, 2025

On the other hand, in a program that does have mutexes and whatnot, this issue raises its ugly head only when an abnormal situation happens. In that case it transforms a reasonably-clean stacktrace and debug dump into an inconsistent ugly mess. Been there done that; with this patch applied I can at least get out of my debugging session without crashing or having to resort to "kill -9".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants