Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested signal handling leads to excessive slowdown of sig_occurred() #215

Open
user202729 opened this issue Dec 30, 2024 · 13 comments
Open

Comments

@user202729
Copy link
Contributor

From sagemath/sage#39224

SageMath session:

sage: %%cython
....: from cysignals.signals cimport sig_on, sig_off, sig_occurred
....: from libc.stdlib cimport abort
....: def f():
....:     try:
....:         sig_on()
....:         abort()
....:         sig_off()
....:     except:
....:         raise ValueError
....: def g():
....:     for i in range(50):
....:         ignore = sig_occurred()
sage: %time g()  # fast
CPU times: user 4 μs, sys: 0 ns, total: 4 μs
Wall time: 6.44 μs
sage: f()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /tmp/tmpgqxu9lji/spyx/_tmp_tmpfkjflhcs_tmp_f70vs9fj_pyx/_tmp_tmpfkjflhcs_tmp_f70vs9fj_pyx_0.pyx:5, in _tmp_tmpfkjflhcs_tmp_f70vs9fj_pyx_0.f()
      4 try:
----> 5     sig_on()
      6     abort()

RuntimeError: Aborted

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 f()

File /tmp/tmpgqxu9lji/spyx/_tmp_tmpfkjflhcs_tmp_f70vs9fj_pyx/_tmp_tmpfkjflhcs_tmp_f70vs9fj_pyx_0.pyx:9, in _tmp_tmpfkjflhcs_tmp_f70vs9fj_pyx_0.f()
      7         sig_off()
      8     except:
----> 9         raise ValueError
     10 def g():
     11     for i in range(50):

ValueError: 
sage: %time g()  # slow
CPU times: user 3.84 s, sys: 133 μs, total: 3.84 s
Wall time: 3.85 s
sage: raise RuntimeError
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 1
----> 1 raise RuntimeError

RuntimeError: 
sage: %time g()  # fast again
CPU times: user 79.9 ms, sys: 0 ns, total: 79.9 ms
Wall time: 79.9 ms

Note the invocation of g() that is marked # slow.

@user202729
Copy link
Contributor Author

user202729 commented Dec 30, 2024

I guess recursive .__context__ can be accessed. (.__cause__ is the same as .__context__ in the case of raise ValueError from e, but in nested exception then .__cause__ is None but .__context__ is not.)

That said, is there any cleaner way than the current (fragile) method of using reference counting and run the garbage collector? It looks like it will only be a matter of time until someone comes across a case where the garbage collector is run too many times consecutively which leads to unexplained slowdown.

@user202729
Copy link
Contributor Author

user202729 commented Dec 30, 2024

Actually nested signal handling is not needed, you might as well do something more innocuous such as

%%cython
from libc.stdlib cimport abort
from cysignals.signals cimport sig_on
from cysignals.tests import print_sig_occurred
from sage.all import Integer

def f():
    a = [Integer(i) for i in range(100)]
    try:
        sig_on()
        abort()
    except RuntimeError:
        del a

But sig_occurred is introduced for a single purpose in SageMath sagemath/sage#24986 .

@user202729
Copy link
Contributor Author

How about we throttle the garbage collector to run at most e.g. 10% of the (wall) time? That way, worst case something heavy is done in a place where sig_occurred() must honestly return nonnull and a lot of integers are being destroyed, there's ("only") 10% slowdown. And maybe a user warning.

@tornaria
Copy link
Contributor

Here's a way to reproduce something what appears to be the root cause of this in python, using cysignals 1.11.4:

$ python
Python 3.12.7 (main, Oct  3 2024, 14:05:50) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from cysignals.tests import test_abort, print_sig_occurred
>>> try:
...     test_abort()
... except:
...     raise ValueError
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "src/cysignals/tests.pyx", line 643, in cysignals.tests.test_abort
RuntimeError: Aborted

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
ValueError
>>> print_sig_occurred()
RuntimeError: Aborted

Clearly, by the time I typed print_sig_occurred(), the exception had long been handled and gone.

This also seems to be the cause of sagemath/sage#38749 (comment)

Using python 3.13.1 gives the same.

@tornaria
Copy link
Contributor

No nesting of exceptions is necessary:

$ sage
┌────────────────────────────────────────────────────────────────────┐
│ SageMath version 10.5, Release Date: 2024-12-04                    │
│ Using Python 3.13.1. Type "help()" for help.                       │
└────────────────────────────────────────────────────────────────────┘
sage: from cysignals.tests import test_abort, print_sig_occurred
sage: test_abort()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 1
----> 1 test_abort()

File tests.pyx:634, in cysignals.tests.test_abort()

RuntimeError: Aborted
sage: raise ValueError
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 raise ValueError

ValueError: 
sage: %time for _ in range(50): print_sig_occurred()
RuntimeError: Aborted
RuntimeError: Aborted
[...]
RuntimeError: Aborted
RuntimeError: Aborted
CPU times: user 2.04 s, sys: 617 μs, total: 2.04 s
Wall time: 2.05 s

Indeed, test_abort() raises the exception and leaves it at cysigs.exc_value. Doing raise ValueError before calling sig_occurred() hides it (the refcount is 2 but it's not in sys.last_value).

Does not happen with pure python, but it does happen with ipython:

$ ipython
Python 3.13.1 (main, Dec 14 2024, 12:44:03) [GCC 13.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.31.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from cysignals.tests import test_abort, print_sig_occurred

In [2]: test_abort()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 1
----> 1 test_abort()

File tests.pyx:634, in cysignals.tests.test_abort()

RuntimeError: Aborted

In [3]: raise ValueError
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 raise ValueError

ValueError: 

In [4]: %time for _ in range(50): print_sig_occurred()
RuntimeError: Aborted
RuntimeError: Aborted
[...]
RuntimeError: Aborted
RuntimeError: Aborted
CPU times: user 672 ms, sys: 1.12 ms, total: 673 ms
Wall time: 676 ms

I suppose sage / ipython is storing the exception somewhere, which keeps the refcount from reaching 1.

@tornaria
Copy link
Contributor

Here is a PoC fix:

--- a/src/cysignals/signals.pyx
+++ b/src/cysignals/signals.pyx
@@ -28,6 +28,7 @@ from libc.stdio cimport freopen, stdin
 from cpython.ref cimport Py_XINCREF, Py_CLEAR
 from cpython.exc cimport (PyErr_Occurred, PyErr_NormalizeException,
         PyErr_Fetch, PyErr_Restore)
+from cpython.exc cimport PyErr_GetRaisedException, PyErr_SetRaisedException
 from cpython.version cimport PY_MAJOR_VERSION
 
 cimport cython
@@ -360,6 +361,15 @@ cdef void verify_exc_value() noexcept:
     Check that ``cysigs.exc_value`` is still the exception being raised.
     Clear ``cysigs.exc_value`` if not.
     """
+    cdef PyObject * raised = PyErr_GetRaisedException()
+
+    if cysigs.exc_value is not raised:
+        Py_CLEAR(cysigs.exc_value)
+
+    PyErr_SetRaisedException(raised)
+
+    return
+
     if cysigs.exc_value.ob_refcnt == 1:
         # No other references => exception is certainly gone
         Py_CLEAR(cysigs.exc_value)

(note: as is, this requires python 3.12+ and a few tests fail)

BEFORE:

$ sage
┌────────────────────────────────────────────────────────────────────┐
│ SageMath version 10.5, Release Date: 2024-12-04                    │
│ Using Python 3.13.1. Type "help()" for help.                       │
└────────────────────────────────────────────────────────────────────┘
sage: R.<x> = Zmod(4)[] ; f = x^2 + 3*x + 1
sage: %time str(f)
CPU times: user 838 μs, sys: 0 ns, total: 838 μs
Wall time: 847 μs
'x^2 + 3*x + 1'
sage: from cysignals.tests import test_abort
sage: test_abort()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 1
----> 1 test_abort()

File tests.pyx:634, in cysignals.tests.test_abort()

RuntimeError: Aborted
sage: raise ValueError
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 raise ValueError

ValueError: 
sage: %time str(f)
CPU times: user 711 ms, sys: 0 ns, total: 711 ms
Wall time: 713 ms
'x^2 + 3*x + 1'

AFTER:

$ sage
┌────────────────────────────────────────────────────────────────────┐
│ SageMath version 10.5, Release Date: 2024-12-04                    │
│ Using Python 3.13.1. Type "help()" for help.                       │
└────────────────────────────────────────────────────────────────────┘
sage: R.<x> = Zmod(4)[] ; f = x^2 + 3*x + 1
sage: %time str(f)
CPU times: user 841 μs, sys: 0 ns, total: 841 μs
Wall time: 847 μs
'x^2 + 3*x + 1'
sage: from cysignals.tests import test_abort
sage: test_abort()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 1
----> 1 test_abort()

File tests.pyx:634, in cysignals.tests.test_abort()

RuntimeError: Aborted
sage: raise ValueError
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 raise ValueError

ValueError: 
sage: %time str(f)
CPU times: user 78 μs, sys: 13 μs, total: 91 μs
Wall time: 93.5 μs
'x^2 + 3*x + 1'

@user202729
Copy link
Contributor Author

user202729 commented Jan 23, 2025

The problem with your fix is it negates the whole point of sig_occurred.

The reason why it's introduced is the following: (test with an environment with ipython setuptools cython)

%load_ext cython
##
%%cython
from cysignals.signals cimport sig_on, sig_off, sig_occurred
from cysignals.tests import print_sig_occurred
from libc.stdlib cimport abort

cdef class Integer:
    def __dealloc__(self):
        print_sig_occurred()
        print("deleted")

#class Integer:
#   def __del__(self):
#       print_sig_occurred()
#       print("deleted")

def mutate(a):
    abort()

def f1():
    a = Integer()
    sig_on()
    mutate(a)
    sig_off()

def g1():
    print("before call f1")
    f1()
    print("after call f1")
    # CANNOT print "no current exception" here

g1()

in SageMath, if mutate(a) is interrupted with e.g. a ctrl-C, a might be left in an inconsistent state, so fast_tp_dealloc checks for whether a signal occurred to avoid destroying a.

Of course, the solution implemented in SageMath is suboptimal in that the optimal way is to do for example change f1 to for example

def f1():
    a = Integer()
    try:
        sig_on()
        mutate(a.value)
        sig_off()
    except:
        a.value = None  # avoid calling mpz_clear(a.value) when a is deleted
        raise

but since there's a lot of usages of Integer in SageMath code base this may be more difficult than it sounds.

Of course in the long term we prefer this solution… but it is really hard to make sure it is bug-free while doing the migration.


Alternatively (potentially faster than the above solution because it avoids exception handler, and it's also shorter):

def f1():
    a = Integer()
    sig_on()
    cdef mpz a1 = a.value
    a.value = None  # if interrupted here, a.value is None so destructor is harmless
    mutate(a1)
    a.value = a1
    sig_off()

or put into helper function (then there's only 2 extra lines):

cdef take_mpz(a: Integer):
    """
    This function must be used before calling a mpz function that might be interrupted
    """
    cdef mpz a1 = a.value
    a.value = None
    return a1

def f1():
    a = Integer()
    sig_on()
    cdef mpz a1 = take_mpz(a)
    mutate(a1)
    a.value = a1
    sig_off()

@tornaria
Copy link
Contributor

I understand what you mean. But how are we sure that every integer that was used since sig_on() will be deallocated before the exception is handled? This is what I meant by "scope" before: when should sig_occurred() stop reporting an exception has happened?

def f1():
    a = Integer()
    sig_on()
    mutate(a)
    sig_off()

In this case, I think cython will ensure a is deallocated before the function ends. But what if a came from outside? For example:

def f2(a):
    sig_on()
    mutate(a)
    sig_off()

def f1():
    a = Integer()
    try:
        f2(a)
    except:
        pass

@user202729
Copy link
Contributor Author

  • See the original issue and how difficult it was to debug it Inconsistent mpz_t state after interrupted sig_realloc() sage#24986
  • SageMath Integer objects are supposed to be immutable. Any function that mutates Integer objects must be an internal function (and the Integer object likely something it has just created anew), so your case is excessively unlikely.
  • Yes indeed, in your case the sig_occurred() fails; but according to the original issue above, the sig_occurred() is at most an ugly patch, and they just try their best effort there.

tl;dr… I guess we can just go with implementing the proper solution if you want. But:

@user202729
Copy link
Contributor Author

The first bottleneck is there are so many occurrences of it. This is a conservative estimate (though in some cases the function call does not actually mutate the first argument):

[src]$ rg '\bmpz\w*\(\w*\.value' | wc -l
748

@user202729
Copy link
Contributor Author

user202729 commented Jan 24, 2025

I'm trying to find ways to reproduce the issue by e.g. adding delay at random places in the memory allocation function.

I was thinking if it's possible to at least reproduce the issues on other systems, it is more likely that an incorrect implementation will raise an error.

proof of concept:

a.zip

apply the patch, then run sage -python b.py. On my machine it causes the following issue

0
0
0
0
0
0
0
0
0
1 0x60c6cc5fa5b0
free(): double free detected in tcache 2
Traceback (most recent call last):
  File "/tmp/b.py", line 14, in <module>
    Integer(2)**(10**8+randint(1,1000))
    ~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~
  File "integer.pyx", line 2199, in sage.rings.integer.Integer.__pow__
  File "coerce.pyx", line 1230, in sage.structure.coerce.CoercionModel.bin_op
  File "coerce_actions.pyx", line 925, in sage.structure.coerce_actions.IntegerPowAction._act_
  File "integer.pyx", line 2293, in sage.rings.integer.Integer._pow_long
RuntimeError: Aborted
free(): double free detected in tcache 2
------------------------------------------------------------------------
/home/user202729/miniforge3/envs/sage-dev/lib/python3.11/site-packages/cysignals/signals.cpython-311-x86_64-linux-gnu.so(+0x795d) [0x762428ee395d]
/home/user202729/miniforge3/envs/sage-dev/lib/python3.11/site-packages/cysignals/signals.cpython-311-x86_64-linux-gnu.so(+0x7a1b) [0x762428ee3a1b]
/home/user202729/miniforge3/envs/sage-dev/lib/python3.11/site-packages/cysignals/signals.cpython-311-x86_64-linux-gnu.so(+0x9f84) [0x762428ee5f84]
/usr/lib/libc.so.6(+0x3d1d0) [0x7624295921d0]
/usr/lib/libc.so.6(+0x963f4) [0x7624295eb3f4]
/usr/lib/libc.so.6(gsignal+0x20) [0x762429592120]
/usr/lib/libc.so.6(abort+0xdf) [0x7624295794c3]
/usr/lib/libc.so.6(+0x25354) [0x76242957a354]
/usr/lib/libc.so.6(+0xa0765) [0x7624295f5765]
/usr/lib/libc.so.6(+0xa2e1f) [0x7624295f7e1f]
/usr/lib/libc.so.6(__libc_free+0x7e) [0x7624295fa5ce]
/home/user202729/sage/build/cp311/src/sage/ext/memory.cpython-311-x86_64-linux-gnu.so(+0x4dda) [0x762428ef9dda]
/home/user202729/sage/build/cp311/src/sage/rings/integer.cpython-311-x86_64-linux-gnu.so(+0x16cd9) [0x762428841cd9]
python3(+0x1db6de) [0x60c6aa5ea6de]
python3(+0x2c6223) [0x60c6aa6d5223]
python3(Py_FinalizeEx+0x51) [0x60c6aa6c6191]
python3(Py_RunMain+0x18d) [0x60c6aa6d203d]
python3(Py_BytesMain+0x37) [0x60c6aa699617]
/usr/lib/libc.so.6(+0x25e08) [0x76242957ae08]
/usr/lib/libc.so.6(__libc_start_main+0x8c) [0x76242957aecc]
python3(+0x28a4ca) [0x60c6aa6994ca]
------------------------------------------------------------------------
Alarm clock

strangely enough neither export MALLOC_CHECK_=0 nor

cdef extern from "<malloc.h>":
    int mallopt(int param, int value) nogil
    const int M_CHECK_ACTION

mallopt(M_CHECK_ACTION, 0)

disables the check. But I guess it's good either way.

@user202729
Copy link
Contributor Author

Looks like something goes wrong in Python 3.12?

2025-01-22T07:43:04.4747065Z **********************************************************************
2025-01-22T07:43:04.4747589Z File "src/sage/structure/coerce_actions.pyx", line 812, in sage.structure.coerce_actions.IntegerMulAction._act_
2025-01-22T07:43:04.4747702Z Failed example:
2025-01-22T07:43:04.4747829Z     print_sig_occurred()
2025-01-22T07:43:04.4747935Z Expected:
2025-01-22T07:43:04.4748058Z     No current exception
2025-01-22T07:43:04.4748162Z Got:
2025-01-22T07:43:04.4748278Z     AlarmInterrupt
2025-01-22T07:43:04.4748414Z **********************************************************************

( sagemath/sage#39142 )

Or it could be issue with only that pull request, I don't know. Anyway it is evident that something has changed between Python 3.11 and Python 3.12 related to the reference counting.

@user202729
Copy link
Contributor Author

user202729 commented Jan 24, 2025

It's worse than I thought…

sys.last_type
sys.last_value
sys.last_traceback
These three variables are deprecated; use sys.last_exc instead. They hold the legacy representation of sys.last_exc, as returned from exc_info() above.

https://docs.python.org/3/library/sys.html

sys.last_exc
This variable is not always defined; it is set to the exception instance when an exception is not handled and the interpreter prints an error message and a stack traceback. Its intended use is to allow an interactive user to import a debugger module and engage in post-mortem debugging without having to re-execute the command that caused the error. (Typical use is import pdb; pdb.pm() to enter the post-mortem debugger; see pdb module for more information.)

https://docs.python.org/3/library/sys.html#sys.last_exc

so if you catch the exception (without ever re-raising it) then that mechanism fails.

Looks like it somehow gets worse in Python 3.12 sagemath/sage#39142 (coincidentally sys.last_exc is added at exactly this version)

I wonder if ipython/ipython#14684 is the cause.


I wonder why flint functions seem to be immune to this.

##
%%cython
from libc.stdint cimport uintptr_t
from libc.stdio cimport printf
from posix.unistd cimport usleep, useconds_t

cdef extern from "flint/flint.h":
    void __flint_set_memory_functions (
        void *(*alloc_func) (size_t),
        void *(*calloc_func) (size_t, size_t),
        void *(*realloc_func) (void *, size_t),
        void (*free_func) (void *)
    )

    void __flint_get_memory_functions (
        void *(**alloc_func) (size_t),
        void *(**calloc_func) (size_t, size_t),
        void *(**realloc_func) (void *, size_t),
        void (**free_func) (void *)
    )

cdef class FlintMemoryFunctions:
    cdef void *(*alloc_func) (size_t) noexcept
    cdef void *(*calloc_func) (size_t, size_t) noexcept
    cdef void *(*realloc_func) (void *, size_t) noexcept
    cdef void (*free_func) (void *) noexcept

    def set_from_flint_current(self):
        __flint_get_memory_functions(&self.alloc_func, &self.calloc_func, &self.realloc_func, &self.free_func)

    def set_to_flint(self):
        __flint_set_memory_functions(self.alloc_func, self.calloc_func, self.realloc_func, self.free_func)

    def __str__(self):
        return f"<FlintMemoryFunctions: alloc_func={<uintptr_t>self.alloc_func:#x}, calloc_func={<uintptr_t>self.calloc_func:#x}, realloc_func={<uintptr_t>self.realloc_func:#x}, free_func={<uintptr_t>self.free_func:#x}>"

    def __repr__(self):
        return str(self)

# ======== wrapper alloc functions to print debug ========

cdef FlintMemoryFunctions backup = None
cdef useconds_t alloc_sleep = 0, calloc_sleep = 0, realloc_sleep = 0, free_sleep = 0

cdef void *new_alloc_func(size_t size) noexcept:
    printf("alloc %d\n", <int>size)
    usleep(alloc_sleep)
    return backup.alloc_func(size)

cdef void *new_calloc_func(size_t nmemb, size_t size) noexcept:
    printf("calloc %d %d\n", <int>nmemb, <int>size)
    usleep(calloc_sleep)
    return backup.calloc_func(nmemb, size)

cdef void *new_realloc_func(void *ptr, size_t size) noexcept:
    printf("realloc %p %d\n", ptr, <int>size)
    usleep(realloc_sleep)
    return backup.realloc_func(ptr, size)

cdef void new_free_func(void *ptr) noexcept:
    printf("free %p\n", ptr)
    usleep(free_sleep)
    backup.free_func(ptr)

def enable_debug():
    global backup
    assert backup is None
    backup = FlintMemoryFunctions()
    backup.set_from_flint_current()
    __flint_set_memory_functions(new_alloc_func, new_calloc_func, new_realloc_func, new_free_func)

def set_sleep(useconds_t alloc=0, useconds_t calloc=0, useconds_t realloc=0, useconds_t free=0):
    global alloc_sleep, calloc_sleep, realloc_sleep, free_sleep
    alloc_sleep = alloc
    calloc_sleep = calloc
    realloc_sleep = realloc
    free_sleep = free

def disable_debug():
    global backup
    assert backup is not None
    backup.set_to_flint()
    backup = None

##

enable_debug()
disable_debug()

##

set_sleep(calloc=1000000)
set_sleep()

##

R.<x> = QQ[]

a = R([1, 2, 3])
a = R([1, 2, 3, 4, 5])
del a

##

def f():
    set_sleep()
    a = matrix.zero(ZZ, 100)
    b = matrix.zero(ZZ, 100)
    set_sleep(calloc=1000000)
    alarm(0.5)
    c = a + b
    cancel_alarm()

f()

import gc
gc.collect()

## ======== test ========

from sage.libs.flint.flint cimport flint_malloc, flint_realloc, flint_calloc, flint_free
from libc.stdio cimport printf
printf("%p %p %p %p\n", flint_malloc, flint_calloc, flint_realloc, flint_free)


printf("%p %p %p %p\n", alloc_func, calloc_func, realloc_func, free_func)

from libc.stdlib cimport malloc, calloc, realloc, free
printf("%p %p %p %p\n", malloc, calloc, realloc, free)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants