-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix join being denied after being invited over federation #18075
base: develop
Are you sure you want to change the base?
Fix join being denied after being invited over federation #18075
Conversation
Reproduction test for element-hq/synapse#18075
Reproduction test for element-hq/synapse#18075
# You can only join the room if you are invited or are already in the room. | ||
if not (caller_in_room or caller_invited): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is equivalent to before but I found this easier to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same via De Morgan's theorem.
This reverts commit a32c1ba.
This is just normal for how someone finds out about an invite over federation See #18075 (comment)
…ready participating in Regression tests for element-hq/synapse#18075
@@ -566,6 +566,7 @@ def _is_membership_change_allowed( | |||
logger.debug( | |||
"_is_membership_change_allowed: %s", | |||
{ | |||
"caller_membership": caller.membership if caller else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just makes it easier to debug
conditional_membership_string = "" | ||
if self.get("type") == EventTypes.Member: | ||
conditional_membership_string = f"membership={self.membership}, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this little extra piece to our string representation of membership events to get more context when debugging
def is_mine_id(self, string: str) -> bool: | ||
"""Determines whether a user ID or room alias originates from this homeserver. | ||
|
||
Returns: | ||
`True` if the hostname part of the user ID or room alias matches this | ||
homeserver. | ||
`False` otherwise, or if the user ID or room alias is malformed. | ||
""" | ||
localpart_hostname = string.split(":", 1) | ||
if len(localpart_hostname) < 2: | ||
return False | ||
return localpart_hostname[1] == self._hostname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, we would access a hs.is_mind_id(...)
but we don't have easy access to hs
here. Better way than this?
# After persistence, we always need to notify replication there may be new | ||
# data (backfilled or not) because TODO. | ||
self._notifier.notify_replication() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-backfilled events, we already call _notify_persisted_event
(just below) -> on_new_room_events
-> notify_new_room_events
-> notify_replication
Essentially, I want to fill in the context here: We never notify clients about backfilled events but it's important to let all the workers know about any new event (backfilled or not) because TODO
logger = logging.getLogger(__name__) | ||
|
||
|
||
class DeviceListResyncTestCase(unittest.HomeserverTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved these tests from tests/test_federation.py
) | ||
|
||
|
||
class MessageAcceptTests(unittest.FederatingHomeserverTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved these tests from tests/test_federation.py
and cleaned them up to use FederatingHomeserverTestCase
"type": "m.room.server_acls", | ||
"sender": "@a:b", | ||
"content": content, | ||
class StripUnsignedFromEventsTestCase(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved these tests from tests/test_federation.py
FIXME: This implementation is not robust against other code tight-looping and | ||
preventing the signals propagating and timing out the test. You may need to add | ||
`time.sleep(0.1)` to your code in order to allow this timeout to work correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the code was tight-looping I was seeing these problems:
- The
TestTimeout
exception would sometimes be masked by an error within block. This happened when everything was torn down but the request was still in-flight and came back with a 500 which was shown instead of the root cause timeout that occurred. - With heavy workloads, it's possible that it doesn't let Python check for signals and the code never times out (experienced this with my
while True:
usage).
The code for this implementation seems to be from https://stackoverflow.com/questions/34743448/how-to-specify-test-timeout-for-python-unittest/49567288#49567288
I tried playing around with an alternative implementation that uses threads but I couldn't make it work completely as it kept the test process from exiting.
Alternative test_timeout
implementation using threads
class TestTimeout(Exception):
pass
class test_timeout:
def __init__(self, seconds: float, error_message: Optional[str] = None):
self.seconds = seconds
self.error_message = error_message = (
f"Test timed out after {seconds}s: {error_message}"
)
self._timer_thread = None
self._timeout_occurred = False
def _timeout_handler(self):
self._timeout_occurred = True
raise TestTimeout(self.error_message)
def __enter__(self) -> None:
self._timeout_occurred = False
self._timer_thread = threading.Timer(self.seconds, self._timeout_handler)
# Make sure the timer thread doesn't prevent the process from exiting
self._timer_thread.daemon = True
self._timer_thread.start()
return self
def __exit__(
self,
exc_type: Optional[Type[BaseException]],
exc_val: Optional[BaseException],
exc_tb: Optional[TracebackType],
):
self._timer_thread.cancel()
if exc_type is TestTimeout:
# Propagate the timeout exception
return False
# Suppress all other exceptions
return not self._timeout_occurred
</details>
But seems plausible
state_map: StateMap[EventBase] | ||
|
||
|
||
class OutOfBandMembershipTests(unittest.FederatingHomeserverTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the main tests that stress the scenarios that this PR is trying to fix. I decided to add these trial
tests (in addition to the Complement tests) because we don't seem to have any real coverage for the remote invite scenarios within the Synapse codebase.
These trial
tests are also much faster to run and allow us to exactly control how the remote server is interacting with us (and timing).
We're now better at rejecting this
verify_key_id: {"key": encode_verify_key_base64(verify_key)} | ||
} | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this to use something more real as it was failing before because hostname
didn't match the actual hs.hostname
(mismatch in is_mine_id
)
from synapse.util.retryutils import NotRetryingDestination | ||
|
||
from tests import unittest | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all of these tests to something under tests/federation/...
Fix join being denied after being invited over federation. This also happens for rejecting an invite. Basically, any out-of-band membership transition where we first get the membership as an
outlier
and then rely on federation filling us in to de-outlier it.This PR mainly addresses automated test flakiness, bots/scripts, and options within Synapse like
auto_accept_invites
that are able to react quickly (before federation is able to push us events), but also helps in generic scenarios where federation is lagging.I initially thought this might be a Synapse consistency issue (see issues labeled with
Z-Read-After-Write
) but it seems to be an event auth logic problem. Workers probably do increase the number of possible race condition scenarios that make this visible though (replication and cache invalidation lag).Probably fixes matrix-org/synapse#15012 (#15012)
Fix #15012
Problems:
event_auth
logic even though we expose them in/sync
.What happened before?
I wrote some Complement test that stresses this exact scenario and reproduces the problem: matrix-org/complement#757
We have
hs1
andhs2
running in monolith mode (no workers):@charlie1:hs2
is invited and joins the room:hs1
invites@charlie1:hs2
to a room which we receive onhs2
asPUT /_matrix/federation/v1/invite/{roomId}/{eventId}
(on_invite_request(...)
) and the invite membership is persisted as an outlier. Theroom_memberships
andlocal_current_membership
database tables are also updated which means they are visible down/sync
at this point.@charlie1:hs2
decides to join because it saw the invite down/sync
. Becausehs2
is not yet in the room, this happens as a remote joinmake_join
/send_join
which comes back with all of the auth events needed to auth successfully and now@charlie1:hs2
is successfully joined to the room.@charlie2:hs2
is invited and and tries to join the room:hs1
invites@charlie2:hs2
to the room which we receive onhs2
asPUT /_matrix/federation/v1/invite/{roomId}/{eventId}
(on_invite_request(...)
) and the invite membership is persisted as an outlier. Theroom_memberships
andlocal_current_membership
database tables are also updated which means they are visible down/sync
at this point.hs2
is already participating in the room, we also see the invite come over federation in a transaction and we start processing it (not done yet, see below)@charlie2:hs2
decides to join because it saw the invite down/sync
. Becausehs2
, is already in the room, this happens as a local join but we deny the event because ourevent_auth
logic thinks that we have no membership in the room ❌ (expected to be able to join because we saw the invite down `/sync)@charlie2:hs2
invite event from and de-outlier it.Logs for
hs2
:Dev notes
Other unrelated but semi-related races:
send_join
races with local users sending messages #17720Running tests
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)