Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix join being denied after being invited over federation #18075

Open
wants to merge 40 commits into
base: develop
Choose a base branch
from

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Jan 10, 2025

Fix join being denied after being invited over federation. This also happens for rejecting an invite. Basically, any out-of-band membership transition where we first get the membership as an outlier and then rely on federation filling us in to de-outlier it.

This PR mainly addresses automated test flakiness, bots/scripts, and options within Synapse like auto_accept_invites that are able to react quickly (before federation is able to push us events), but also helps in generic scenarios where federation is lagging.

I initially thought this might be a Synapse consistency issue (see issues labeled with Z-Read-After-Write) but it seems to be an event auth logic problem. Workers probably do increase the number of possible race condition scenarios that make this visible though (replication and cache invalidation lag).

Probably fixes matrix-org/synapse#15012 (#15012)
Fix #15012

Problems:

  1. We don't consider out-of-band membership (outliers) in our event_auth logic even though we expose them in /sync.
  2. (This PR doesn't address this point) Perhaps we should consider authing events in the persistence queue as events already in the queue could allow subsequent events to be allowed (events come through many channels: federation transaction, remote invite, remote join, local send). But this doesn't save us in the case where the event is more delayed over federation.

What happened before?

I wrote some Complement test that stresses this exact scenario and reproduces the problem: matrix-org/complement#757

COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestSynapseConsistency

We have hs1 and hs2 running in monolith mode (no workers):

  1. @charlie1:hs2 is invited and joins the room:
    1. hs1 invites @charlie1:hs2 to a room which we receive on hs2 as PUT /_matrix/federation/v1/invite/{roomId}/{eventId} (on_invite_request(...)) and the invite membership is persisted as an outlier. The room_memberships and local_current_membership database tables are also updated which means they are visible down /sync at this point.
    2. @charlie1:hs2 decides to join because it saw the invite down /sync. Because hs2 is not yet in the room, this happens as a remote join make_join/send_join which comes back with all of the auth events needed to auth successfully and now @charlie1:hs2 is successfully joined to the room.
  2. @charlie2:hs2 is invited and and tries to join the room:
    1. hs1 invites @charlie2:hs2 to the room which we receive on hs2 as PUT /_matrix/federation/v1/invite/{roomId}/{eventId} (on_invite_request(...)) and the invite membership is persisted as an outlier. The room_memberships and local_current_membership database tables are also updated which means they are visible down /sync at this point.
    2. Because hs2 is already participating in the room, we also see the invite come over federation in a transaction and we start processing it (not done yet, see below)
    3. @charlie2:hs2 decides to join because it saw the invite down /sync. Because hs2, is already in the room, this happens as a local join but we deny the event because our event_auth logic thinks that we have no membership in the room ❌ (expected to be able to join because we saw the invite down `/sync)
    4. We finally finish processing the @charlie2:hs2 invite event from and de-outlier it.
      • If this finished before we tried to join we would have been fine but this is the race condition that makes this situation visible.

Logs for hs2:

🗳️ on_invite_request: handling event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=False>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
🔦 _store_room_members_txn update local_current_membership: <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
📨 Notifying about new event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
✅ on_invite_request: handled event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
🧲 do_invite_join for @user-2-charlie1:hs2 in !sfZVBdLUezpPWetrol:hs1
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$bwv8LxFnqfpsw_rhR7OrTjtz09gaJ23MqstKOcs7ygA, type=m.room.member, state_key=@user-1-alice:hs1, membership=join, outlier=True>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$oju1ts3G3pz5O62IesrxX5is4LxAwU3WPr4xvid5ijI, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=join, outlier=False>
📨 Notifying about new event <FrozenEventV3 event_id=$oju1ts3G3pz5O62IesrxX5is4LxAwU3WPr4xvid5ijI, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=join, outlier=False>

...

🗳️ on_invite_request: handling event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
🔦 _store_room_members_txn update local_current_membership: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
📨 Notifying about new event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
✅ on_invite_request: handled event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
📬 handling received PDU in room !sfZVBdLUezpPWetrol:hs1: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
📮 handle_new_client_event: handling <FrozenEventV3 event_id=$WNVDTQrxy5tCdPQHMyHyIn7tE4NWqKsZ8Bn8R4WbBSA, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=join, outlier=False>
❌ Denying new event <FrozenEventV3 event_id=$WNVDTQrxy5tCdPQHMyHyIn7tE4NWqKsZ8Bn8R4WbBSA, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=join, outlier=False> because 403: You are not invited to this room.
synapse.http.server - 130 - INFO - POST-16 - <SynapseRequest at 0x7f460c91fbf0 method='POST' uri='/_matrix/client/v3/join/%21sfZVBdLUezpPWetrol:hs1?server_name=hs1' clientproto='HTTP/1.0' site='8080'> SynapseError: 403 - You are not invited to this room.
📨 Notifying about new event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
✅ handled received PDU in room !sfZVBdLUezpPWetrol:hs1: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>

Dev notes

Other unrelated but semi-related races:


handle_new_client_event
_persist_events
persist_and_notify_client_events
	persist_events
	notifier.on_new_room_events(
_process_incoming_pdus_in_room_inner
on_receive_pdu
_process_received_pdu
_run_push_actions_and_persist_event
persist_events_and_notify
	_send_events
		ReplicationFederationSendEventsRestServlet._handle_request
		persist_events_and_notify
	persist_events
		enqueue
		_persist_event_batch
		_persist_events_and_state_updates
		_persist_events_txn
		_update_metadata_tables_txn
		_store_room_members_txn
	_notify_persisted_event
		_notifier.on_new_room_events(
on_invite_request
persist_events_and_notify
update_membership_locked
	_remote_join
		do_invite_join
		process_remote_join
		persist_events_and_notify
	_local_membership_update
_notifier.on_new_room_events(
notify_new_room_events
	_notify_pending_new_room_events
		on_new_event
	notify_replication

handle_new_client_event
check_auth_rules_from_context
check_state_dependent_auth_rules
_is_membership_change_allowed

Running tests

SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.federation.test_federation_out_of_band_membership.OutOfBandMembershipTests
COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestFederationRoomsInvite

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct
    (run the linters)

MadLittleMods added a commit to MadLittleMods/fork-complement that referenced this pull request Jan 10, 2025
MadLittleMods added a commit to MadLittleMods/fork-complement that referenced this pull request Jan 10, 2025
@MadLittleMods MadLittleMods changed the title Fix Synapse consistency issue with join being denied after being invited over federation Fix join being denied after being invited over federation Jan 10, 2025
Comment on lines +681 to +682
# You can only join the room if you are invited or are already in the room.
if not (caller_in_room or caller_invited):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is equivalent to before but I found this easier to grok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same via De Morgan's theorem.

@@ -566,6 +566,7 @@ def _is_membership_change_allowed(
logger.debug(
"_is_membership_change_allowed: %s",
{
"caller_membership": caller.membership if caller else None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just makes it easier to debug

Comment on lines +328 to +330
conditional_membership_string = ""
if self.get("type") == EventTypes.Member:
conditional_membership_string = f"membership={self.membership}, "
Copy link
Contributor Author

@MadLittleMods MadLittleMods Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this little extra piece to our string representation of membership events to get more context when debugging

Comment on lines +112 to +123
def is_mine_id(self, string: str) -> bool:
"""Determines whether a user ID or room alias originates from this homeserver.

Returns:
`True` if the hostname part of the user ID or room alias matches this
homeserver.
`False` otherwise, or if the user ID or room alias is malformed.
"""
localpart_hostname = string.split(":", 1)
if len(localpart_hostname) < 2:
return False
return localpart_hostname[1] == self._hostname
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we would access a hs.is_mind_id(...) but we don't have easy access to hs here. Better way than this?

Comment on lines +2275 to 2277
# After persistence, we always need to notify replication there may be new
# data (backfilled or not) because TODO.
self._notifier.notify_replication()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-backfilled events, we already call _notify_persisted_event (just below) -> on_new_room_events -> notify_new_room_events -> notify_replication

Essentially, I want to fill in the context here: We never notify clients about backfilled events but it's important to let all the workers know about any new event (backfilled or not) because TODO

logger = logging.getLogger(__name__)


class DeviceListResyncTestCase(unittest.HomeserverTestCase):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved these tests from tests/test_federation.py

)


class MessageAcceptTests(unittest.FederatingHomeserverTestCase):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved these tests from tests/test_federation.py and cleaned them up to use FederatingHomeserverTestCase

"type": "m.room.server_acls",
"sender": "@a:b",
"content": content,
class StripUnsignedFromEventsTestCase(unittest.TestCase):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved these tests from tests/test_federation.py

Comment on lines +403 to +405
FIXME: This implementation is not robust against other code tight-looping and
preventing the signals propagating and timing out the test. You may need to add
`time.sleep(0.1)` to your code in order to allow this timeout to work correctly.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the code was tight-looping I was seeing these problems:

  • The TestTimeout exception would sometimes be masked by an error within block. This happened when everything was torn down but the request was still in-flight and came back with a 500 which was shown instead of the root cause timeout that occurred.
  • With heavy workloads, it's possible that it doesn't let Python check for signals and the code never times out (experienced this with my while True: usage).

The code for this implementation seems to be from https://stackoverflow.com/questions/34743448/how-to-specify-test-timeout-for-python-unittest/49567288#49567288

I tried playing around with an alternative implementation that uses threads but I couldn't make it work completely as it kept the test process from exiting.

Alternative test_timeoutimplementation using threads

Based on https://stackoverflow.com/questions/34743448/how-to-specify-test-timeout-for-python-unittest/79251951#79251951

class TestTimeout(Exception):
    pass


class test_timeout:
    def __init__(self, seconds: float, error_message: Optional[str] = None):
        self.seconds = seconds
        self.error_message = error_message = (
            f"Test timed out after {seconds}s: {error_message}"
        )

        self._timer_thread = None
        self._timeout_occurred = False

    def _timeout_handler(self):
        self._timeout_occurred = True
        raise TestTimeout(self.error_message)

    def __enter__(self) -> None:
        self._timeout_occurred = False
        self._timer_thread = threading.Timer(self.seconds, self._timeout_handler)
        # Make sure the timer thread doesn't prevent the process from exiting
        self._timer_thread.daemon = True
        self._timer_thread.start()
        return self

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ):
        self._timer_thread.cancel()
        if exc_type is TestTimeout:
            # Propagate the timeout exception
            return False
        # Suppress all other exceptions
        return not self._timeout_occurred
</details>

state_map: StateMap[EventBase]


class OutOfBandMembershipTests(unittest.FederatingHomeserverTestCase):
Copy link
Contributor Author

@MadLittleMods MadLittleMods Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the main tests that stress the scenarios that this PR is trying to fix. I decided to add these trial tests (in addition to the Complement tests) because we don't seem to have any real coverage for the remote invite scenarios within the Synapse codebase.

These trial tests are also much faster to run and allow us to exactly control how the remote server is interacting with us (and timing).

verify_key_id: {"key": encode_verify_key_base64(verify_key)}
}
},
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this to use something more real as it was failing before because hostname didn't match the actual hs.hostname (mismatch in is_mine_id)

from synapse.util.retryutils import NotRetryingDestination

from tests import unittest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all of these tests to something under tests/federation/...

@MadLittleMods MadLittleMods marked this pull request as ready for review January 18, 2025 02:31
@MadLittleMods MadLittleMods requested a review from a team as a code owner January 18, 2025 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants