Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to properly hold onto an object in embedded mode? #267

Open
residentsummer opened this issue Nov 10, 2024 · 19 comments
Open

How to properly hold onto an object in embedded mode? #267

residentsummer opened this issue Nov 10, 2024 · 19 comments

Comments

@residentsummer
Copy link

Hello! First, I'd like to thank you for the amazing work on this library. It's a mind-bending level of inter-language interaction, IMO. 👏

I was experimenting with it in the embedded mode. Not sure if I was using it in the intended way, but... I took a considerably sized web-app and tried to add some kind of a debugging/introspection interface to it with clj & cljs.

The app uses gevent internally for non-blocking IO operations, so the setup was a bit involving, but in the end it worked out. I'll explain it to get a better picture:

  • app starts as a Python process
  • gevent is called at the earliest to do its monkey-patching of IO
  • python app continues it's startup
  • separate python thread (real OS thread) is started, which does all JVM/clojure initialization (one more thread I guess) and than calls -main of the clojure's part of the app
  • in the main thread of the python app a queue is created and worker greenlet (fake thread or coroutine) is started to watch it
  • when clojure world needs to call something from python world, that involves gevent, it puts a task into a queue for the worker to execute it on the main thread
  • if task needs to return something to clojure world, a one-off queue is created to transfer the return value

It worked fine, I've even made a little introspection tool, that can peer into a running python app vars and call functions, toggle stuff and so on.

Then, I've tried to implement a tap> like system (just putting python objects into an atom for later inspection), to send info from python to clojure... And was abruptly stopped by a SIGSEGV. :(

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000aaaac541ef74, pid=1685, tid=1869
#
# JRE version: OpenJDK Runtime Environment (17.0.12+7) (build 17.0.12+7-Ubuntu-1ubuntu222.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.12+7-Ubuntu-1ubuntu222.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C  [python3+0xcef74]
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /var/www/hs_err_pid1685.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-17
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

It seems that objects I send through "tap" got GC'ed on python part. I've tried to find something about holding to objects in the documentation, but nothing was particularly fitting to my case. Digging through the code, I've stumbled upon track-pyobject and incref-and-track. Using the latter on the objects I send to clojure world did not help though... I still got a SIGSEGV, but in different internal python function. To validate my assumption about objects being GD'ed, I've implemented a dumb "borrowing" mechanics on python side (putting objects, to be sent to clojure, in a dict) an it worked okay.

I'd prefer not to do ->jvm on objects, because the idea is to use them again in a python app to replay some actions that it performs.

My question: Is there a way to hold onto python objects from clojure in embedded mode so that refcounts will be decreased when references to the pyobjs are GC'ed on the JVM?

Thank you for your time :)

Python 3.10.12
Clojure 1.12
openjdk 17.0.12 2024-07-16
OpenJDK Runtime Environment (build 17.0.12+7-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 17.0.12+7-Ubuntu-1ubuntu222.04, mixed mode, sharing)

@jjtolton
Copy link
Contributor

Hello! First, I'd like to thank you for the amazing work on this library. It's a mind-bending level of inter-language interaction, IMO. 👏

I was experimenting with it in the embedded mode. Not sure if I was using it in the intended way, but... I took a considerably sized web-app and tried to add some kind of a debugging/introspection interface to it with clj & cljs.

The app uses gevent internally for non-blocking IO operations, so the setup was a bit involving, but in the end it worked out. I'll explain it to get a better picture:

  • app starts as a Python process

  • gevent is called at the earliest to do its monkey-patching of IO

  • python app continues it's startup

  • separate python thread (real OS thread) is started, which does all JVM/clojure initialization (one more thread I guess) and than calls -main of the clojure's part of the app

  • in the main thread of the python app a queue is created and worker greenlet (fake thread or coroutine) is started to watch it

  • when clojure world needs to call something from python world, that involves gevent, it puts a task into a queue for the worker to execute it on the main thread

Do you mean gevent handles it, or this technique is only employed when gevent information is required?

What specific type of queue are using? collections.deqeue?

  • if task needs to return something to clojure world, a one-off queue is created to transfer the return value

You can also do it continuation/callback style. Meaning, pass the consuming Clojure function as an argument to the Python function.

It worked fine, I've even made a little introspection tool, that can peer into a running python app vars and call functions, toggle stuff and so on.

Then, I've tried to implement a tap> like system (just putting python objects into an atom for later inspection), to send info from python to clojure... And was abruptly stopped by a SIGSEGV. :(

Have you considered using the actual tap function?

i.e., (require-python '[__main__ :bind-ns true]) (py/set-attr! __main__ "pytap" tap>)

Not positive it will help, just trying to help out with some additional options-- you are in uncharted waters, so I want to give you as many tools as I can.


#

# A fatal error has been detected by the Java Runtime Environment:

#

#  SIGSEGV (0xb) at pc=0x0000aaaac541ef74, pid=1685, tid=1869

#

# JRE version: OpenJDK Runtime Environment (17.0.12+7) (build 17.0.12+7-Ubuntu-1ubuntu222.04)

# Java VM: OpenJDK 64-Bit Server VM (17.0.12+7-Ubuntu-1ubuntu222.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)

# Problematic frame:

# C  [python3+0xcef74]

#

# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

#

# An error report file with more information is saved as:

# /var/www/hs_err_pid1685.log

#

# If you would like to submit a bug report, please visit:

#   https://bugs.launchpad.net/ubuntu/+source/openjdk-17

# The crash happened outside the Java Virtual Machine in native code.

# See problematic frame for where to report the bug.

#

It seems that objects I send through "tap" got GC'ed on python part. I've tried to find something about holding to objects in the documentation, but nothing was particularly fitting to my case. Digging through the code, I've stumbled upon track-pyobject and incref-and-track. Using the latter on the objects I send to clojure world did not help though... I still got a SIGSEGV, but in different internal python function. To validate my assumption about objects being GD'ed, I've implemented a dumb "borrowing" mechanics on python side (putting objects, to be sent to clojure, in a dict) an it worked okay.

I'd prefer not to do ->jvm on objects, because the idea is to use them again in a python app to replay some actions that it performs.

I've written many production web applications and search engines with libpython-clj and I treat the the clojure/python "barrier" as if it were a serialized interface-- I am not shy about using ->jvm, ->py, etc as needed. I also tend to stick to core data structures that have strong transference such hashmaps/dicts and lists/vectors. If not you would need to implement some sort of memento/state pattern.

(I will also mention if this a highly performant commercial application where absolute performance is critical, you may want to contact TechAcent directly)

@residentsummer
Copy link
Author

residentsummer commented Nov 11, 2024

Hi, @jjtolton ! Thanks for helping out :)

Do you mean gevent handles it, or this technique is only employed when gevent information is required?

Not sure I understood the question. In my setup, the python app is running in on the main thread, mostly unaware that clojure threads exist.

The python app uses gevent internally for the non-blocking IO. For example, there are functions, that pull various data from databases, and those use connection pools. Python code from the main thread can call them freely, but trying the same from other threads (python or not) will lead to errors due to gevent hub being bound to the main thread. So, when trying to call those functions from clojure, I had to do it through the special queue (see below).

What specific type of queue are using? collections.deqeue?

I had to use thread-safe queue.Queue.

You can also do it continuation/callback style. Meaning, pass the consuming Clojure function as an argument to the Python function.

Returning a value from a task on a queue is one thing (that can be done via callback, indeed), another is waiting for the value to be ready on the other side. One-off queues solve both.

from queue import Queue
from gevent import spawn, get_hub

from . import clojure  # some cljbridge-inspired tooling

...

def spawn_real_thread(fn, *args, **kwargs):
    '''Spawns real OS thread, while loops are running on gevent'''
    pool = get_hub().threadpool
    return pool.spawn(fn, *args, **kwargs)


def start_clojure_thread(aliases=None, with_repl=True, post_init=None):
    '''Make sure repl deps are there, if you want to start it'''

    # Must be done on the main thread
    jvm_params = clojure.prepare_jvm_params(aliases)

    def clj_thread():
        clojure.init(jvm_params)
        if with_repl:
            clojure.start_repl("0.0.0.0", 50000)

        if post_init:
            post_init()

        while True:
            time.sleep(3600)

    spawn_real_thread(clj_thread)
    spawn(main_thread_worker)


__MAIN_THREAD_QUEUE = Queue()

def call_on_main_thread(fn, *args, **kwargs):
    return do_on_main_thread(True, fn, *args, **kwargs)


def do_on_main_thread(wait, fn, *args, **kwargs):
    retq = Queue() if wait else None
    # logger.info("[dmt] attempting put: %r", __MAIN_THREAD_QUEUE)
    __MAIN_THREAD_QUEUE.put((fn, None, args, kwargs, retq))
    # logger.info("[dmt] put done")

    if not wait:
        return None

    # logger.info("[dmt] waiting for result")
    res, exc = retq.get()
    retq.task_done()

    if exc is not None:
        raise exc

    return res


def main_thread_worker():
    while True:
        # logger.warning("[dmq] waiting on queue: %r", __MAIN_THREAD_QUEUE)
        fn, coro, args, kwargs, retq = __MAIN_THREAD_QUEUE.get()

        # logger.info("got task: %s", (fn or coro, args, kwargs, retq))
        if fn:
            _run_function(fn, args, kwargs, retq)
        else:
            assert not (args or kwargs), "Not expecting args/kwargs for coros"
            _run_coro(coro, retq)

        # logger.info("[dmq] spawned glet")
        __MAIN_THREAD_QUEUE.task_done()


def _run_function(fn, args, kwargs, retq):
    '''Call this from main thread'''
    def glet():
        try:
            res = fn(*args, **kwargs)
            # logger.warning("res: %s", res)
            if retq:
                retq.put((res, None))
        except Exception as e:
            if retq:
                retq.put((None, e))
            else:
                logger.exception("[glt] error: %s", e)

    spawn(glet)

And this is how I call it from clojure:

(defmacro call-on-main-thread [& forms]
  `(app.clojure_gevent_support/call_on_main_thread
     #(do ~@forms)))

(defn get-location-titles [loc-id]
  (-> (call-on-main-thread
        (geo/titles_by_location loc-id))  ;; geo is a python module
      ->jvm))

Have you considered using the actual tap function?
i.e., (require-python '[__main__ :bind-ns true]) (py/set-attr! __main__ "pytap" tap>)

I guess it won't be different, but I'll try. This is what I've done initially (save-query-params will be called from the main python app (on the main thread, no queue)):

(defonce qparams (atom [])

(defn save-query-params [new-query-params]
  (swap! qparams conj new-query-params))

Accessing content of the atom leads to segfault.

@jjtolton
Copy link
Contributor

There are a lot of moving parts here (by design, of course) so it's hard to pinpoint the issue precisely. Would you please paste the segfault log, i.e., /var/www/hs_err_pid1685.log (or put it in a gist)? Without being able to reproduce local, there might not be very much I can assist with. I also want to point out we are in heavy gettier territory here, since we are in particularly uncharted water involving several systems whose interactions are not well understood.

@residentsummer
Copy link
Author

I believe the problem lies not in the interaction of these moving parts, but in my lack of understanding of how objects should be shared between python world and jvm world when using libpython-clj.

I've cut down everything irrelevant (the queue I mentioned before just to draw a full picture of what I'm trying to do) and made a repo to reproduce the issue: libpython-repro

Important files are just these two:

I've also added a log from one of the crashes:
https://github.com/residentsummer/libpython-repro/blob/master/hs_err_pid1685.log

@jjtolton
Copy link
Contributor

jjtolton commented Nov 17, 2024

Thanks for that. I don't have a solution yet, but I was able to narrow the problem down considerably:

(ns eeel.repro
  (:require
   [libpython-clj2.python :as py :refer [py..]]
   [libpython-clj2.require :refer [require-python import-python]]
   ;;
   ))

(import-python)

(defonce taps (atom []))

(defn save-to-tap [obj]
  (swap! taps conj obj))

(comment
  
  (let [{{save-something "save_something"
          set-tap "set_tap"}
          :globals} (py/run-simple-string "
import json
from pprint import pformat

TAP = None

def set_tap(fn):
    global TAP
    TAP = fn

def save_something():
    '''To be called from clojure'''
    test_obj = json.loads('{\"test\": [1, 2, 3, 4]}')
    print(\"saved:\", pformat(test_obj))
    if TAP:
        TAP(test_obj)


")]
    (def save-something save-something)
    (def set-tap set-tap)
    )
  (set-tap save-to-tap)
  (save-something)
  @taps ;;=> [{'__str__': [{...}, [...], 3, 4]}]
  @taps ;; segfault
  )

@jjtolton
Copy link
Contributor

I don't think this has anything to do with async, it looks like the string is being double-freed.

@jjtolton
Copy link
Contributor

If I change the test to the following:

(ns eeel.repro
  (:require
   [libpython-clj2.python :as py :refer [py..]]
   [libpython-clj2.require :refer [require-python import-python]]
   ;;
   ))

(import-python)

(defonce taps (atom []))

(defn save-to-tap [obj]
  (swap! taps conj obj))

(comment

  (let [{{save-something "save_something"
          set-tap "set_tap"}
         :globals} (py/run-simple-string "
import json
from pprint import pformat

TAP = None

def set_tap(fn):
    global TAP
    TAP = fn

def save_something():
    '''To be called from clojure'''
    test_obj = json.loads('{\"test\": [1, 2, 3, 4]}')
    print(\"saved:\", test_obj)
    if TAP:
        TAP(test_obj)


")]
    (def save-something save-something)
    (def set-tap set-tap))
  (set-tap tap>)
  (add-tap println)
  (save-something))

repeatedly invoking (save-something) has interesting results:

saved: {'test': [1, 2, 3, 4]}
{'Py_Repr': [{...}, [...]]}
saved: {'test': [1, 2, 3, 4]}
saved: {'test': [1, 2, 3, 4]}
{'Py_Repr': [{...}, [...]]}
saved: {'test': [1, 2, 3, 4]}
{'Py_Repr': [{...}, [...]]}
saved: {'test': [1, 2, 3, 4]}
saved: {'test': [1, 2, 3, 4]}
{'Py_Repr': [{...}, [...]]}

@residentsummer
Copy link
Author

Thanks for looking into it. Before crashing it usually prints some garbage, when inspecting contents of taps atom. This is, I believe, due to objects being GC'ed on python side - the pointers, stored in the atom point to now-free regions of memory:

starting clojure thread
starting clojure app
clojure -main called
saved: {'test': [1, 2, 3, 4]}
taps: [{'__str__': [{...}, [...], 3, 4]}]
saved: {}
taps: [{} {}]
saved: {'test': []}

What I do not understand is why test objects look corrupted before sending them to TAP... (see saved: lines)

@jjtolton
Copy link
Contributor

Yeah I understand. Ideally that wouldn't happen -- but I'd like to offer a practical workaround. I tend to follow a "render unto Caesar that which is Caesar's" approach with mixed runtime programming, and follow the path of least resistance.

Perhaps consider the following:

import json
from pprint import pformat
from collections import deque

taps = deque()

TAP = lambda x: taps.append(x)


def save_something():
    '''To be called from clojure'''
    test_obj = json.loads('{\"test\": [1, 2, 3, 4]}')
    print(\"saved:\", test_obj)
    if TAP:
        TAP(test_obj)

Full example:

(ns eeel.repro
  (:require
   [libpython-clj2.python :as py :refer [py..]]
   [libpython-clj2.require :refer [require-python import-python]]
   ;;
   ))

(import-python)

(defonce taps (atom []))

(defn save-to-tap [obj]
  (swap! taps conj obj))

(comment

  (let [{{save-something "save_something"
          set-tap "set_tap"}
         :globals} (py/run-simple-string "
import json
from pprint import pformat
from collections import deque

taps = deque()

TAP = lambda x: taps.append(x)


def save_something():
    '''To be called from clojure'''
    test_obj = json.loads('{\"test\": [1, 2, 3, 4]}')
    print(\"saved:\", test_obj)
    if TAP:
        TAP(test_obj)


")]
    (def save-something save-something)
    (def set-tap set-tap))
  
  (save-something)
  (save-something)
  (save-something)
  (require-python '[__main__ :bind-ns true])
  __main__/taps ;; => deque([{'test': [1, 2, 3, 4]}, {'test': [1, 2, 3, 4]}, {'test': [1, 2, 3, 4]}])
  )

We'd have to dig into the serialization FFI to find out why the string is not being marshalled correctly, but this will probably get you where you are going a little better since you are using Python's machinery instead of going through the data marshalling code.

@residentsummer
Copy link
Author

I think there is no crash with real tap> because object got printed while being called from python's save_something. It's still running and holds a reference to the object.

@jjtolton
Copy link
Contributor

The real tap> crashes too sometimes. Not super clear why.

@jjtolton
Copy link
Contributor

Thanks for looking into it. Before crashing it usually prints some garbage, when inspecting contents of taps atom. This is, I believe, due to objects being GC'ed on python side - the pointers, stored in the atom point to now-free regions of memory:

I'm still not sure the root cause, but you seem to be correct that when the Python reference is lost, we are dereferencing a null pointer. When I hang on to the references, the @taps works fine:

(ns eeel.repro
  (:require
   [libpython-clj2.python :as py :refer [py..]]
   [libpython-clj2.require :refer [require-python import-python]]
   ;;
   ))

(import-python)

(defonce taps (atom []))

(defn save-to-tap [obj]
  (swap! taps conj obj))

(comment

  (let [{{save-something "save_something"
          set-tap "set_tap"}
         :globals} (py/run-simple-string "
import json
from pprint import pformat
from collections import deque

taps = deque()

_TAP = None
TAP = lambda x: (taps.append(x), _TAP(x))

def set_tap(fn):
    global _TAP
    _TAP = fn

def save_something():
    '''To be called from clojure'''
    test_obj = json.loads('{\"test\": [1, 2, 3, 4]}')
    print(\"saved:\", test_obj)
    if TAP:
        TAP(test_obj)


")]
    (def save-something save-something)
    (def set-tap set-tap))
  
  (set-tap save-to-tap)
  (save-something)
  (save-something)
  (save-something)
  (require-python '[__main__ :bind-ns true])
  __main__/taps ;; => deque([{'test': [1, 2, 3, 4]}, {'test': [1, 2, 3, 4]}, {'test': [1, 2, 3, 4]}])
  @taps ;; [{'test': [1, 2, 3, 4]} {'test': [1, 2, 3, 4]} {'test': [1, 2, 3, 4]}]
  )

@residentsummer
Copy link
Author

Yeah, there is a workaround (similar to your dequeue approach) in the repo - save-to-tap-with-borrow. But it means manual tracking of objects and I'd like to avoid that. Ideally :)

I've also tried doing ffi/incref before putting object into an atom. There is no segfault with it, but I'm sure it means memory leak. Surprisingly, ffi/incref-and-track still leads to segfault. Probably because "track" part has something to do with stack-context feature of libpython-clj and object got released instantly...

@jjtolton
Copy link
Contributor

jjtolton commented Nov 17, 2024

Hmmm something tells me you are doing more than "tap for visual inspection" with this workflow, then, because otherwise manually clearing the deque at the REPL wouldn't be a lot of work, and you could write context-managers to do the work as well, seeing as you'd have to manually clear the @taps anyway.

@residentsummer
Copy link
Author

The issue is that we might not have a control over the container, that holds objects. In my simple example we do have it, but imagine we would like to send tapped object to portal.

If we're "borrowing" (or putting into dequeue) them (either on python side, or in clojure with incref), all is fine until the moment we've cleared the values from portal. After that objects are, essentially, "leaked". Because there is no way for us to know when values are released from the UI.

@jjtolton
Copy link
Contributor

Ahhh, I see. You need them to be in Clojure because they are hooked up to Clojure tools you don't control, such as portal, and you don't know when the reference will be released.

I am curious why you would want to send a raw python object to something like Portal? I don't think it was designed to inspect those. You'd most likely be better off converting to a JVM object anyway.

Are you using Portal for monitoring in production, or something? It's hard for me to imagine that in a dev environment you could accumulate so many references that you'd eat up all the system memory holding onto references, and you could use flags to disable the behavior in production.

@jjtolton
Copy link
Contributor

I'll be honest this is one area where I'm not exactly clear what the correct behavior is. Python is right to GC those strings, Clojure is right to hold a reference to them. I suppose the ideal would be to inform Python's garbage collector that Clojure is still holding references, then hook into the JVM garbage collector to notify Python's garbage collector when it GCs a JVM-managed python reference -- but I'm not sure that the "juice is worth the squeeze" there.

@jjtolton
Copy link
Contributor

I thought I remembered there being a somewhat simple trick to get around this 🤔

Something like setting a variable in Python, getting the data in Clojure, then clearing the variable in Python.

@jjtolton
Copy link
Contributor

Well, given the options, this still seems like the best bet:

(ns eeel.repro
  (:require
   [libpython-clj2.python :as py :refer [py..]]
   [libpython-clj2.require :refer [require-python import-python]]
   ;;
   ))

(import-python)

(defonce taps (atom []))
(defonce taps1 (atom []))

(defn save-to-tap [obj]
  (swap! taps conj obj))

(comment

  (let [{{save-something "save_something"
          set-tap "set_tap"}
         :globals} (py/run-simple-string "
import json
from pprint import pformat
from collections import deque

_TAP = None
to_jvm = None

def TAP(x):
    _TAP(to_jvm(x))
  
def set_tap(fn):
    global _TAP
    _TAP = fn

def save_something():
    '''To be called from clojure'''
    test_obj = json.loads('{\"test\": [1, 2, 3, 4]}')
    print(\"saved:\", test_obj)
    if TAP:
        TAP(test_obj)
")]
    (def save-something save-something)
    (def set-tap set-tap))
  (require-python '[__main__ :bind-ns true])
  (python/setattr __main__ "to_jvm" py/->jvm)
  
  (set-tap save-to-tap)
  (save-something)
  (save-something)
  (save-something)
  @taps ;; => [{"test" [1 2 3 4]} {"test" [1 2 3 4]} {"test" [1 2 3 4]}]
  )

If you need something more complex than what ->jvm can do, check out the pydafy interface for extensible datafication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants