Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LRUCache.__copy__ is unsound #3852

Closed
asottile-sentry opened this issue Dec 4, 2024 · 18 comments · Fixed by #3883
Closed

LRUCache.__copy__ is unsound #3852

asottile-sentry opened this issue Dec 4, 2024 · 18 comments · Fixed by #3883
Assignees
Labels
Component: SDK Core Dealing with the core of the SDK

Comments

@asottile-sentry
Copy link
Member

How do you use Sentry?

Sentry Saas (sentry.io)

Version

latest git revision

Steps to Reproduce

unfortunately I haven't completely narrowed down a "minimal" reproduction but I originally found this from reading the implementation and tracing the output -- attached is a script and data which produces a crash due to a particular ordering of copys and sets.

the root cause of the problem is that __copy__ is incorrect as written -- it shallow copies the internal datastructures which incorrectly pollutes newly created caches

in the ~best case this produces incorrect values poisoning the parent copy:

>>> cache1 = LRUCache(max_size=2)
>>> cache1.set(1, True)
>>> cache2 = cache1.__copy__()
>>> cache2.set(1, False)
>>> cache1.get(1)
False

at worst case it causes crashes (see attached data)

from sentry_sdk._lru_cache import LRUCache


def main() -> int:
    caches = {}

    with open('lru_cache.log') as f:
        for line in f:
            line = line.replace(': ', ' ')

            match line.split():
                case [cid, '__init__']:
                    caches[cid] = LRUCache(max_size=100)
                case [cid, 'set', k, *v]:
                    caches[cid].set(k, ' '.join(v))
                case [cid, '__copy__', '->', cid2]:
                    caches[cid2] = caches[cid].__copy__()
                case unreachable:
                    raise AssertionError(unreachable)


if __name__ == '__main__':
    raise SystemExit(main())

lru_cache.log

cc @cmanallen -- introduced in dd1117d

Expected Result

should not crash, should not have internal data integrity problems

Actual Result

the script eventually crashes with:

$ ./venvnew/bin/python3 t3.py 
Traceback (most recent call last):
  File "/Users/asottile/workspace/sentry/t3.py", line 23, in <module>
    raise SystemExit(main())
                     ^^^^^^
  File "/Users/asottile/workspace/sentry/t3.py", line 15, in main
    caches[cid].set(k, ' '.join(v))
  File "/Users/asottile/workspace/sentry/venvnew/lib/python3.12/site-packages/sentry_sdk/_lru_cache.py", line 132, in set
    del self.cache[old_key]
        ~~~~~~~~~~^^^^^^^^^
KeyError: 'organizations:continuous-profiling-beta'
@asottile-sentry
Copy link
Member Author

fwiw the original trace was generated via this diff:

$ diff -u {venvnew,.venv}/lib/python3.12/site-packages/sentry_sdk/_lru_cache.py 
--- venvnew/lib/python3.12/site-packages/sentry_sdk/_lru_cache.py	2024-12-04 17:17:44
+++ .venv/lib/python3.12/site-packages/sentry_sdk/_lru_cache.py	2024-12-04 17:16:18
@@ -62,6 +62,7 @@
 
 """
 
+from uuid import uuid4
 from copy import copy
 
 SENTINEL = object()
@@ -74,8 +75,16 @@
 VALUE = 3
 
 
+
 class LRUCache:
-    def __init__(self, max_size):
+    def _log(self, m):
+        with open('lru_cache.log', 'a+') as f:
+            f.write(f'{self._u}: {m}\n')
+
+    def __init__(self, max_size, _log=True):
+        self._u = uuid4()
+        if _log:
+            self._log('__init__')
         assert max_size > 0
 
         self.max_size = max_size
@@ -92,13 +101,15 @@
         self.hits = self.misses = 0
 
     def __copy__(self):
-        cache = LRUCache(self.max_size)
+        cache = LRUCache(self.max_size, _log=False)
+        self._log(f'__copy__ -> {cache._u}')
         cache.full = self.full
         cache.cache = copy(self.cache)
         cache.root = copy(self.root)
         return cache
 
     def set(self, key, value):
+        self._log(f'set {key} {value}')
         link = self.cache.get(key, SENTINEL)
 
         if link is not SENTINEL:
@@ -129,7 +140,11 @@
 
             self.root[KEY] = self.root[VALUE] = None
 
-            del self.cache[old_key]
+            try:
+                del self.cache[old_key]
+            except KeyError:
+                self._log('crash!!!')
+                raise
 
             self.cache[key] = old_root
 
@@ -141,6 +156,7 @@
             self.full = len(self.cache) >= self.max_size
 
     def get(self, key, default=None):
+        self._log(f'get {key}')
         link = self.cache.get(key, SENTINEL)
 
         if link is SENTINEL:

and then running the first 504 tests of shard 7 in the getsentry/sentry testsuite

@asottile-sentry
Copy link
Member Author

asottile-sentry commented Dec 4, 2024

assuming python3.6+ only you may be able to utilize builtin dict as the lru type:

_SENTINEL = object()

class LRUCache:
    def __init__(self, max_size: int):
        if max_size <= 0:
            raise AssertionError(f'invalid max_size: {max_size}')
        self.max_size = max_size
        self._data = {}
        self.hits = self.misses = 0
        self.full = False

    def __copy__(self):
        new = LRUCache(max_size=self.max_size)
        new.hits = self.hits
        new.misses = self.misses
        new.full = self.full
        new._data = self._data.copy()
        return new

    def set(self, key, value):
        current = self._data.pop(key, _SENTINEL)
        if current is not _SENTINEL:
            self._data[key] = value
        elif self.full:
            self._data.pop(next(iter(self._data)))
            self._data[key] = value
        else:
            self._data[key] = value
        self.full = len(self._data) >= self.max_size

    def get(self, key, default=None):
        try:
            ret = self._data.pop(key)
        except KeyError:
            self.misses += 1
            ret = default
        else:
            self.hits += 1
            self._data[key] = ret

        return ret

    def get_all(self):
        return list(self._data.items())

(this passes the current set of tests -- though I'd argue that the AssertionError should actually be ValueError)

@antonpirker antonpirker added Type: Bug Component: SDK Core Dealing with the core of the SDK labels Dec 5, 2024
@sl0thentr0py
Copy link
Member

cc @Zylphrex @cmanallen

@antonpirker
Copy link
Member

This is related to this: #3862

I will make a patch release with the other fix because this feature is going open beta today.

Then we can see the suggested solution above to maybe implement it as the long term fix of lru cache.

@cmanallen
Copy link
Member

Thanks for finding this @asottile-sentry. Sorry I didn't see it sooner. It was lost in my GitHub notifications.

@cmanallen
Copy link
Member

Fixed by: #3861

@asottile-sentry
Copy link
Member Author

I think there's probably still another bug of a similar flavor since self.cache is shallow copied but the backing data (self.root) is deep copied (meaning the cache has references to the original object's data)

@asottile-sentry
Copy link
Member Author

yeah in fact the reproduction in this issue still happens

@cmanallen
Copy link
Member

@asottile-sentry Ah, sorry I was on a work trip last week and didn't see this. I've removed all the bogus copy dunder methods and I'm just deepcopying the buffer on scope fork. Really can't remember why I used the dunder methods in the first place. Thanks for re-opening and testing this. I've included your reproduction steps as part of the test coverage so this should hopefully be resolved. Of course let me know if you find anything else and I will do my best to fix it.

In regards to your new LRU proposal I think it looks great. I'm in favor of the change. But I'm not the maintainer of this repo and, considering my track record on this bug, I'm probably not the right person for the job.

@ffelixg
Copy link
Contributor

ffelixg commented Dec 18, 2024

I don't think deepcopying the cache/root is the way to go here, since we don't really want to copy the keys/values for example.

Copying is also what caused #3862 because the root attribute and root object that is referenced by the circular list is no longer the same object after copying.

Instead I propose to just walk the list and create new root / cache structures along the way, see my PR. This also makes the previously introduced cap on iterations inside get_all unnecessary.

The issue inside the sentry CI also seems to go away:
before https://github.com/ffelixg/sentry/actions/runs/12401448531/job/34620865765
after https://github.com/ffelixg/sentry/actions/runs/12402829105/job/34625165848

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Dec 18, 2024
@antonpirker antonpirker self-assigned this Dec 19, 2024
@antonpirker
Copy link
Member

Hey @ffelixg thanks for that PR. I have added some tests and also changed the get_all() back, so we can have a PR that includes the whole fix.

@antonpirker
Copy link
Member

antonpirker commented Dec 19, 2024

@asottile-sentry and @cmanallen the fix from @ffelixg looks sound to me. As this is tricky code, I would love to have your reviews as well, please have a look at this: #3883

@asottile-sentry
Copy link
Member Author

why are we so resistant to using a much simpler and likely much more performant lru implementation

@cmanallen
Copy link
Member

@asottile-sentry PRs welcome.

@antonpirker @ffelixg Thank you Felix for the contribution! I've left some review comments. tl;dr I think its the wrong approach but the changes I suggested are small. I have a bias to my PR but I believe I was objective in my review.

@cmanallen
Copy link
Member

@asottile-sentry "PRs welcome" might come across as snarky 🙈 my bad. I support your approach. If you decide to open a PR I'll review it for you!

@ffelixg
Copy link
Contributor

ffelixg commented Dec 19, 2024

@asottile-sentry I also think your implementation is better. I just found the issue in the original implementation interesting after watching your youtube video. 🙂

I'm not sure if this lru_cache is used externally or just used internally. If it's just internal and internal usage is fine with a deep instead of a shallow copy then my PR is a bit pointless, since deepcopy seems to work out of the box.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Dec 19, 2024
@antonpirker
Copy link
Member

why are we so resistant to using a much simpler and likely much more performant lru implementation

Mostly because it is a comment and not a PR 😅

@antonpirker
Copy link
Member

But you are right, your solution is way simpler. I have plugged it into @ffelixg PR and will merge it.

Thanks everyone for your time.

antonpirker added a commit that referenced this issue Dec 20, 2024
A simpler and better LRU Cache implementation that prevents data leaking between copied caches.

Fixes #3852

---------

Co-authored-by: Anton Pirker <[email protected]>
antonpirker added a commit that referenced this issue Dec 23, 2024
- Removes manual overrides of copy behavior and leaves it up to the caller.
    - E.g. a future use case may require a non-deepcopy. If we override copy they would have to remove the dunder copy, update every implementation which relies copy, before finally creating their own copy implementation.
- Deepcopies the flag buffer.
    - Though we do not cache mutable references yet we may soon and so this foot gun should be removed from possibility.
- Removes "copy" test coverage from `test_lru_cache.py`. We're no longer assuming copy usage and leave it up to the caller.
    - The existing test in `tests/test_scope.py` covers the cache pollution case [originally mentioned here](#3852).
    - The mutable cache pollution case is not covered because we do not currently cache mutable objects.

In general a generic class should assume as few implementation details as possible.  If we leave the existing copy method someone may assume copy semantics and rely on it in a way that is inappropriate.

Closes: #3886

Co-authored-by: Anton Pirker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: SDK Core Dealing with the core of the SDK
Projects
Archived in project
5 participants