Various changes needed to make network stack reset more reliable. #369

hlef · 2024-12-06T01:01:36Z

To perform a network stack reset, we must go through all locks and futexes used by all threads in the network stack and set them for destruction. This means waking up all waiters, making them leave, and making it impossible for other threads to block on these locks and futexes again.

Some of these locks are dynamically allocated.

To be able to go through them at reset time, we keep track of them in a linked list. At reset time we go through the list and set all locks and futexes for destruction.

One current limitation of the network stack is that, if this list gets corrupted, resetting into a clean state will be impossible. This is a relevant attack vector as it allows permanent DoS on the network stack.

In this PR, I introduce a mechanism to address this limitation.

Blocking in "steps": instead of attempting to acquire a lock with the timeout in one chunk, try to acquire the lock in a loop of steps of duration step, until the timeout expires. In this way we are guaranteed that the thread will become live again in a bounded amount of time (step), and try to re-acquire the lock: we can thus free the lock through heap-free-all, and have the "lost" thread crash when re-trying to acquire the lock.
Blocking with "conditions": however, it is not always possible to heap-free-all to free the lock, e.g., if the lock was allocated with a caller's quota. In this case, we can call the lock with a condition that checks, at each "step", if a reset is ongoing (e.g., very simply through checking a bit in the state machine word) to prevent the "lost" thread from blocking again.

This PR implements this in lock guards and event groups.

Add two new constructors to the `LockGuard` class. - Make it possible to specify a `step` argument: instead of attempting to acquire the lock with the timeout in one chunk, try to acquire the lock in a loop of steps of duration `step`, until the timeout expires. - Make it possible to specify a `condition`: stop attempting to acquire the lock if the condition becomes `true`, checked at every `step`. Both are necessary to improve the reliability of compartment reset for the TCP/IP stack. Specifically, steps allow us to handle the situation where a crash makes us lose the reference to a lock on which a thread is blocked. With steps we are guaranteed that the thread will become live again in a bounded amount of time (`step`), and try to re-acquire the lock: we can thus free the lock through heap-free-all, and have the "lost" thread crash when re-trying to acquire the lock. However, it is not always possible to heap-free-all to free the lock, e.g., if the lock was allocated with a caller's quota. In this case, we can call the lock with a condition that checks if a reset is ongoing (very simply through checking a bit in the state machine word) to prevent the "lost" thread from blocking again. Signed-off-by: Hugo Lefeuvre <[email protected]>

Event groups are used by the TCP/IP stack. Unfortunately, if a thread somehow corrupts the socket list, we will loose the ability to retrieve references to event groups, ultimately preventing us from unblocking threads blocked on event group locks and futexes. Fortunately, the TCP/IP stack can heap-free-all event groups. Thus we need any wait done on event group locks and futexes to be performed in steps (see recent additions to the `LockGuard` class) to ensure that blocking threads will crash within a bounded amount of time when a reset is triggered. Signed-off-by: Hugo Lefeuvre <[email protected]>

hlef · 2024-12-06T01:16:21Z

Note that the related network stak PR is CHERIoT-Platform/network-stack#54.

davidchisnall · 2024-12-06T10:50:21Z

sdk/include/locks.hh

+	/// Constructor, attempts to acquire the lock with a timeout, in steps
+	/// of a given duration, checking the `condition` function at every
+	/// iteration. If `condition` returns `true`, stop trying to acquire
+	/// the lock.


Please use /** for doc comments that are longer than a single line.

davidchisnall · 2024-12-06T10:51:49Z

sdk/include/locks.hh

+	[[nodiscard]] explicit LockGuard(Lock    &lock,
+	                                 Timeout *timeout,
+	                                 Ticks    step,
+	                                 auto condition) requires(TryLockable<Lock>)


Suggested change

auto condition) requires(TryLockable<Lock>)

auto condition= []() { return false; }) requires(TryLockable<Lock>)

Should eliminate the need for the other constructor overload.

davidchisnall · 2024-12-06T10:53:08Z

sdk/lib/event_group/event_group.cc

+ * will become live again within a bounded amount of time, and crash as it is
+ * trying to re-acquire the lock if we free it through heap-free-all.
+ */
+static constexpr const Ticks EventGroupLockTimeoutStep = MS_TO_TICKS(500);


It would be nice to make this configurable somehow. Waking up every 500ms will really hurt power efficiency for a load of things that are mostly asleep.

Maybe make it a parameter and, in the FreeRTOS wrapper, expose a define for configuring it? That way the network stack can set it to a smallish value.

Sounds good, agree.

hlef added 2 commits December 5, 2024 17:02

hlef force-pushed the hlefeuvre/reset-reliability branch from 39de80a to f4af9ad Compare December 6, 2024 01:02

hlef mentioned this pull request Dec 6, 2024

Remove reset attack surface CHERIoT-Platform/network-stack#54

Open

davidchisnall reviewed Dec 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various changes needed to make network stack reset more reliable. #369

Various changes needed to make network stack reset more reliable. #369

hlef commented Dec 6, 2024

hlef commented Dec 6, 2024

davidchisnall Dec 6, 2024

davidchisnall Dec 6, 2024

davidchisnall Dec 6, 2024

hlef Dec 6, 2024

	auto condition) requires(TryLockable<Lock>)
	auto condition= []() { return false; }) requires(TryLockable<Lock>)

Various changes needed to make network stack reset more reliable. #369

Are you sure you want to change the base?

Various changes needed to make network stack reset more reliable. #369

Conversation

hlef commented Dec 6, 2024

hlef commented Dec 6, 2024

davidchisnall Dec 6, 2024

Choose a reason for hiding this comment

davidchisnall Dec 6, 2024

Choose a reason for hiding this comment

davidchisnall Dec 6, 2024

Choose a reason for hiding this comment

hlef Dec 6, 2024

Choose a reason for hiding this comment