Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various changes needed to make network stack reset more reliable. #369

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hlef
Copy link
Contributor

@hlef hlef commented Dec 6, 2024

To perform a network stack reset, we must go through all locks and futexes used by all threads in the network stack and set them for destruction. This means waking up all waiters, making them leave, and making it impossible for other threads to block on these locks and futexes again.

Some of these locks are dynamically allocated.

To be able to go through them at reset time, we keep track of them in a linked list. At reset time we go through the list and set all locks and futexes for destruction.

One current limitation of the network stack is that, if this list gets corrupted, resetting into a clean state will be impossible. This is a relevant attack vector as it allows permanent DoS on the network stack.

In this PR, I introduce a mechanism to address this limitation.

  • Blocking in "steps": instead of attempting to acquire a lock with the timeout in one chunk, try to acquire the lock in a loop of steps of duration step, until the timeout expires. In this way we are guaranteed that the thread will become live again in a bounded amount of time (step), and try to re-acquire the lock: we can thus free the lock through heap-free-all, and have the "lost" thread crash when re-trying to acquire the lock.
  • Blocking with "conditions": however, it is not always possible to heap-free-all to free the lock, e.g., if the lock was allocated with a caller's quota. In this case, we can call the lock with a condition that checks, at each "step", if a reset is ongoing (e.g., very simply through checking a bit in the state machine word) to prevent the "lost" thread from blocking again.

This PR implements this in lock guards and event groups.

hlef added 2 commits December 5, 2024 17:02
Add two new constructors to the `LockGuard` class.

- Make it possible to specify a `step` argument: instead of attempting
  to acquire the lock with the timeout in one chunk, try to acquire the
  lock in a loop of steps of duration `step`, until the timeout expires.

- Make it possible to specify a `condition`: stop attempting to acquire
  the lock if the condition becomes `true`, checked at every `step`.

Both are necessary to improve the reliability of compartment reset for
the TCP/IP stack.

Specifically, steps allow us to handle the situation where a crash makes
us lose the reference to a lock on which a thread is blocked.  With
steps we are guaranteed that the thread will become live again in a
bounded amount of time (`step`), and try to re-acquire the lock: we can
thus free the lock through heap-free-all, and have the "lost" thread
crash when re-trying to acquire the lock.

However, it is not always possible to heap-free-all to free the lock,
e.g., if the lock was allocated with a caller's quota. In this case, we
can call the lock with a condition that checks if a reset is ongoing
(very simply through checking a bit in the state machine word) to
prevent the "lost" thread from blocking again.

Signed-off-by: Hugo Lefeuvre <[email protected]>
Event groups are used by the TCP/IP stack. Unfortunately, if a thread
somehow corrupts the socket list, we will loose the ability to retrieve
references to event groups, ultimately preventing us from unblocking
threads blocked on event group locks and futexes.

Fortunately, the TCP/IP stack can heap-free-all event groups.  Thus we
need any wait done on event group locks and futexes to be performed in
steps (see recent additions to the `LockGuard` class) to ensure that
blocking threads will crash within a bounded amount of time when a reset
is triggered.

Signed-off-by: Hugo Lefeuvre <[email protected]>
@hlef
Copy link
Contributor Author

hlef commented Dec 6, 2024

Note that the related network stak PR is CHERIoT-Platform/network-stack#54.

Comment on lines +285 to +288
/// Constructor, attempts to acquire the lock with a timeout, in steps
/// of a given duration, checking the `condition` function at every
/// iteration. If `condition` returns `true`, stop trying to acquire
/// the lock.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use /** for doc comments that are longer than a single line.

[[nodiscard]] explicit LockGuard(Lock &lock,
Timeout *timeout,
Ticks step,
auto condition) requires(TryLockable<Lock>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto condition) requires(TryLockable<Lock>)
auto condition= []() { return false; }) requires(TryLockable<Lock>)

Should eliminate the need for the other constructor overload.

* will become live again within a bounded amount of time, and crash as it is
* trying to re-acquire the lock if we free it through heap-free-all.
*/
static constexpr const Ticks EventGroupLockTimeoutStep = MS_TO_TICKS(500);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make this configurable somehow. Waking up every 500ms will really hurt power efficiency for a load of things that are mostly asleep.

Maybe make it a parameter and, in the FreeRTOS wrapper, expose a define for configuring it? That way the network stack can set it to a smallish value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants