-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Mitigate a race where /make_join could 403 for restricted rooms #15080
Mitigate a race where /make_join could 403 for restricted rooms #15080
Conversation
Previously, when creating a join event in /make_join, we would decide whether to include additional fields to satisfy restricted room checks based on the current state of the room. Then, when building the event, we would capture the forward extremities of the room to use as prev events. This is subject to race conditions. For example, when leaving and rejoining a room, the following sequence of events leads to a misleading 403 response: 1. /make_join reads the current state of the room and sees that the user is still in the room. It decides to omit the field required for restricted room joins. 2. The leave event is persisted and the room's forward extremities are updated. 3. /make_join builds the event, using the post-leave forward extremities. The event then fails the restricted room checks. To mitigate the race, we move the read of the forward extremities closer to the read of the current state. Ideally, we would compute the state based off the chosen prev events, but that can involve state resolution, which is expensive. Signed-off-by: Sean Quah <[email protected]>
state_ids = await self._state_storage_controller.get_current_state_ids( | ||
room_id | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to calculate the state off the prev events, but that looked like it would involve state resolution which is slow.
Is there any harm in always including the field required for restricted room joins? (i.e. would that help mitigate the race too?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, though I would interested to know what you think of #15080 (comment).
# To reduce the likelihood of this race, we capture the forward extremities | ||
# of the room (prev_event_ids) just before fetching the current state, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantry: there can be at most 20 prev events in a given event, so in the worst case the prev events are a subset of the forward extremities of the room. I don't think we need to say that here (but I couldn't resist pointing it out now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's annoying, since the prev events we feed in won't always be representative of the room state the event is evaluated against.
Though I note that right now we hit an assert if there are more than 10 prev events.
synapse/synapse/handlers/message.py
Lines 1164 to 1169 in 5febf88
if prev_event_ids is not None: | |
assert ( | |
len(prev_event_ids) <= 10 | |
), "Attempting to create an event with %i prev_events" % ( | |
len(prev_event_ids), | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. FWIW The spec says
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's annoying, since the prev events we feed in won't always be representative of the room state the event is evaluated against.
It turns out that get_prev_events_for_room
only returns up to 10 forward extremities, so the prev events we feed in to the event builder will be the ones used. I hadn't quite realised that get_prev_events_for_room
was subtly different to fetching the forward extremities.
synapse/synapse/storage/databases/main/event_federation.py
Lines 1087 to 1099 in d0c713c
def _get_prev_events_for_room_txn( | |
self, txn: LoggingTransaction, room_id: str | |
) -> List[str]: | |
# we just use the 10 newest events. Older events will become | |
# prev_events of future events. | |
sql = """ | |
SELECT e.event_id FROM event_forward_extremities AS f | |
INNER JOIN events AS e USING (event_id) | |
WHERE f.room_id = ? | |
ORDER BY e.depth DESC | |
LIMIT 10 | |
""" |
No idea. @clokep might have some thoughts on this? |
I think you cannot do that (or at least I wouldn't trust implementations to treat that reasonably). |
Sounds like this is good to merge then? |
Yep. I'm a little sad that there is still a race window, even if it is much smaller than before. |
Synapse 1.78.0 (2023-02-28) =========================== Bugfixes -------- - Fix a bug introduced in Synapse 1.76 where 5s delays would occasionally occur in deployments using workers. ([\#15150](matrix-org/synapse#15150)) Synapse 1.78.0rc1 (2023-02-21) ============================== Features -------- - Implement the experimental `exact_event_match` push rule condition from [MSC3758](matrix-org/matrix-spec-proposals#3758). ([\#14964](matrix-org/synapse#14964)) - Add account data to the command line [user data export tool](https://matrix-org.github.io/synapse/v1.78/usage/administration/admin_faq.html#how-can-i-export-user-data). ([\#14969](matrix-org/synapse#14969)) - Implement [MSC3873](matrix-org/matrix-spec-proposals#3873) to disambiguate push rule keys with dots in them. ([\#15004](matrix-org/synapse#15004)) - Allow Synapse to use a specific Redis [logical database](https://redis.io/commands/select/) in worker-mode deployments. ([\#15034](matrix-org/synapse#15034)) - Tag opentracing spans for federation requests with the name of the worker serving the request. ([\#15042](matrix-org/synapse#15042)) - Implement the experimental `exact_event_property_contains` push rule condition from [MSC3966](matrix-org/matrix-spec-proposals#3966). ([\#15045](matrix-org/synapse#15045)) - Remove spurious `dont_notify` action from the defaults for the `.m.rule.reaction` pushrule. ([\#15073](matrix-org/synapse#15073)) - Update the error code returned when user sends a duplicate annotation. ([\#15075](matrix-org/synapse#15075)) Bugfixes -------- - Prevent clients from reporting nonexistent events. ([\#13779](matrix-org/synapse#13779)) - Return spec-compliant JSON errors when unknown endpoints are requested. ([\#14605](matrix-org/synapse#14605)) - Fix a long-standing bug where the room aliases returned could be corrupted. ([\#15038](matrix-org/synapse#15038)) - Fix a bug introduced in Synapse 1.76.0 where partially-joined rooms could not be deleted using the [purge room API](https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#delete-room-api). ([\#15068](matrix-org/synapse#15068)) - Fix a long-standing bug where federated joins would fail if the first server in the list of servers to try is not in the room. ([\#15074](matrix-org/synapse#15074)) - Fix a bug introduced in Synapse v1.74.0 where searching with colons when using ICU for search term tokenisation would fail with an error. ([\#15079](matrix-org/synapse#15079)) - Reduce the likelihood of a rare race condition where rejoining a restricted room over federation would fail. ([\#15080](matrix-org/synapse#15080)) - Fix a bug introduced in Synapse 1.76 where workers would fail to start if the `health` listener was configured. ([\#15096](matrix-org/synapse#15096)) - Fix a bug introduced in Synapse 1.75 where the [portdb script](https://matrix-org.github.io/synapse/release-v1.78/postgres.html#porting-from-sqlite) would fail to run after a room had been faster-joined. ([\#15108](matrix-org/synapse#15108)) Improved Documentation ---------------------- - Document how to start Synapse with Poetry. Contributed by @thezaidbintariq. ([\#14892](matrix-org/synapse#14892), [\#15022](matrix-org/synapse#15022)) - Update delegation documentation to clarify that SRV DNS delegation does not eliminate all needs to serve files from .well-known locations. Contributed by @williamkray. ([\#14959](matrix-org/synapse#14959)) - Fix a mistake in registration_shared_secret_path docs. ([\#15078](matrix-org/synapse#15078)) - Refer to a more recent blog post on the [Database Maintenance Tools](https://matrix-org.github.io/synapse/latest/usage/administration/database_maintenance_tools.html) page. Contributed by @jahway603. ([\#15083](matrix-org/synapse#15083)) Internal Changes ---------------- - Re-type hint some collections as read-only. ([\#13755](matrix-org/synapse#13755)) - Faster joins: don't stall when another user joins during a partial-state room resync. ([\#14606](matrix-org/synapse#14606)) - Add a class `UnpersistedEventContext` to allow for the batching up of storing state groups. ([\#14675](matrix-org/synapse#14675)) - Add a check to ensure that locked dependencies have source distributions available. ([\#14742](matrix-org/synapse#14742)) - Tweak comment on `_is_local_room_accessible` as part of room visibility in `/hierarchy` to clarify the condition for a room being visible. ([\#14834](matrix-org/synapse#14834)) - Prevent `WARNING: there is already a transaction in progress` lines appearing in PostgreSQL's logs on some occasions. ([\#14840](matrix-org/synapse#14840)) - Use `StrCollection` to avoid potential bugs with `Collection[str]`. ([\#14929](matrix-org/synapse#14929)) - Improve performance of `/sync` in a few situations. ([\#14973](matrix-org/synapse#14973)) - Limit concurrent event creation for a room to avoid state resolution when sending bursts of events to a local room. ([\#14977](matrix-org/synapse#14977)) - Skip calculating unread push actions in /sync when enable_push is false. ([\#14980](matrix-org/synapse#14980)) - Add a schema dump symlinks inside `contrib`, to make it easier for IDEs to interrogate Synapse's database schema. ([\#14982](matrix-org/synapse#14982)) - Improve type hints. ([\#15008](matrix-org/synapse#15008), [\#15026](matrix-org/synapse#15026), [\#15027](matrix-org/synapse#15027), [\#15028](matrix-org/synapse#15028), [\#15031](matrix-org/synapse#15031), [\#15035](matrix-org/synapse#15035), [\#15052](matrix-org/synapse#15052), [\#15072](matrix-org/synapse#15072), [\#15084](matrix-org/synapse#15084)) - Update [MSC3952](matrix-org/matrix-spec-proposals#3952) support based on changes to the MSC. ([\#15037](matrix-org/synapse#15037)) - Avoid mutating a cached value in `get_user_devices_from_cache`. ([\#15040](matrix-org/synapse#15040)) - Fix a rare exception in logs on start up. ([\#15041](matrix-org/synapse#15041)) - Update pyo3-log to v0.8.1. ([\#15043](matrix-org/synapse#15043)) - Avoid mutating cached values in `_generate_sync_entry_for_account_data`. ([\#15047](matrix-org/synapse#15047)) - Refactor arguments of `try_unbind_threepid` and `_try_unbind_threepid_with_id_server` to not use dictionaries. ([\#15053](matrix-org/synapse#15053)) - Merge debug logging from the hotfixes branch. ([\#15054](matrix-org/synapse#15054)) - Faster joins: omit device list updates originating from partial state rooms in /sync responses without lazy loading of members enabled. ([\#15069](matrix-org/synapse#15069)) - Fix clashing database transaction name. ([\#15070](matrix-org/synapse#15070)) - Upper-bound frozendict dependency. This works around us being unable to test installing our wheels against Python 3.11 in CI. ([\#15114](matrix-org/synapse#15114)) - Tweak logging for when a worker waits for its view of a replication stream to catch up. ([\#15120](matrix-org/synapse#15120)) <details><summary>Locked dependency updates</summary> - Bump bleach from 5.0.1 to 6.0.0. ([\#15059](matrix-org/synapse#15059)) - Bump cryptography from 38.0.4 to 39.0.1. ([\#15020](matrix-org/synapse#15020)) - Bump ruff version from 0.0.230 to 0.0.237. ([\#15033](matrix-org/synapse#15033)) - Bump dtolnay/rust-toolchain from 9cd00a88a73addc8617065438eff914dd08d0955 to 25dc93b901a87e864900a8aec6c12e9aa794c0c3. ([\#15060](matrix-org/synapse#15060)) - Bump systemd-python from 234 to 235. ([\#15061](matrix-org/synapse#15061)) - Bump serde_json from 1.0.92 to 1.0.93. ([\#15062](matrix-org/synapse#15062)) - Bump types-requests from 2.28.11.8 to 2.28.11.12. ([\#15063](matrix-org/synapse#15063)) - Bump types-pillow from 9.4.0.5 to 9.4.0.10. ([\#15064](matrix-org/synapse#15064)) - Bump sentry-sdk from 1.13.0 to 1.15.0. ([\#15065](matrix-org/synapse#15065)) - Bump types-jsonschema from 4.17.0.3 to 4.17.0.5. ([\#15099](matrix-org/synapse#15099)) - Bump types-bleach from 5.0.3.1 to 6.0.0.0. ([\#15100](matrix-org/synapse#15100)) - Bump dtolnay/rust-toolchain from 25dc93b901a87e864900a8aec6c12e9aa794c0c3 to e12eda571dc9a5ee5d58eecf4738ec291c66f295. ([\#15101](matrix-org/synapse#15101)) - Bump dawidd6/action-download-artifact from 2.24.3 to 2.25.0. ([\#15102](matrix-org/synapse#15102)) - Bump types-pillow from 9.4.0.10 to 9.4.0.13. ([\#15104](matrix-org/synapse#15104)) - Bump types-setuptools from 67.1.0.0 to 67.3.0.1. ([\#15105](matrix-org/synapse#15105)) </details>
Previously, when creating a join event in /make_join, we would decide
whether to include additional fields to satisfy restricted room checks
based on the current state of the room. Then, when building the event,
we would capture the forward extremities of the room to use as prev
events.
This is subject to race conditions. For example, when leaving and
rejoining a room, the following sequence of events leads to a misleading
403 response:
is still in the room. It decides to omit the field required for
restricted room joins.
updated.
The event then fails the restricted room checks.
To mitigate the race, we move the read of the forward extremities closer
to the read of the current state. Ideally, we would compute the state
based off the chosen prev events, but that can involve state resolution,
which is expensive.
Signed-off-by: Sean Quah [email protected]
This race occurs in the flakes seen in #14986.
matrix-org/complement#614 fixes the flakes by waiting for the remote homeserver to process the leave event before rejoining.