Skip to content

Trigger CI #8831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: bpf-next_base
Choose a base branch
from

Conversation

jrife
Copy link
Contributor

@jrife jrife commented Apr 23, 2025

No description provided.

@jrife jrife force-pushed the jrife/socket-iterators-udp branch 2 times, most recently from 8702a77 to 31c00a5 Compare April 23, 2025 16:40
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 3 times, most recently from 8b59228 to 9c93586 Compare April 23, 2025 18:32
@jrife jrife force-pushed the jrife/socket-iterators-udp branch from 31c00a5 to f7c0501 Compare April 23, 2025 23:30
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 4 times, most recently from c7849f3 to a041a61 Compare April 25, 2025 16:28
@jrife jrife force-pushed the jrife/socket-iterators-udp branch from f7c0501 to ad6c848 Compare April 26, 2025 00:03
jrife added 3 commits April 25, 2025 17:29
Prepare for the next patch which needs to be able to choose either
GFP_USER or GFP_ATOMIC for calls to bpf_iter_udp_realloc_batch.

Signed-off-by: Jordan Rife <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Require that iter->batch always contains a full bucket snapshot. This
invariant is important to avoid skipping or repeating sockets during
iteration when combined with the next few patches. Before, there were
two cases where a call to bpf_iter_udp_batch may only capture part of a
bucket:

1. When bpf_iter_udp_realloc_batch() returns -ENOMEM [1].
2. When more sockets are added to the bucket while calling
   bpf_iter_udp_realloc_batch(), making the updated batch size
   insufficient [2].

In cases where the batch size only covers part of a bucket, it is
possible to forget which sockets were already visited, especially if we
have to process a bucket in more than two batches. This forces us to
choose between repeating or skipping sockets, so don't allow this:

1. Stop iteration and propagate -ENOMEM up to userspace if reallocation
   fails instead of continuing with a partial batch.
2. Try bpf_iter_udp_realloc_batch() with GFP_USER just as before, but if
   we still aren't able to capture the full bucket, call
   bpf_iter_udp_realloc_batch() again while holding the bucket lock to
   guarantee the bucket does not change. On the second attempt use
   GFP_NOWAIT since we hold onto the spin lock.

Introduce the udp_portaddr_for_each_entry_from macro and use it instead
of udp_portaddr_for_each_entry to make it possible to continue iteration
from an arbitrary socket. This is required for this patch in the
GFP_NOWAIT case to allow us to fill the rest of a batch starting from
the middle of a bucket and the later patch which skips sockets that were
already seen.

Testing all scenarios directly is a bit difficult, but I did some manual
testing to exercise the code paths where GFP_NOWAIT is used and where
ERR_PTR(err) is returned. I used the realloc test case included later
in this series to trigger a scenario where a realloc happens inside
bpf_iter_udp_batch and made a small code tweak to force the first
realloc attempt to allocate a too-small batch, thus requiring
another attempt with GFP_NOWAIT. Some printks showed both reallocs with
the tests passing:

Apr 25 23:16:24 crow kernel: go again GFP_USER
Apr 25 23:16:24 crow kernel: go again GFP_NOWAIT

With this setup, I also forced each of the bpf_iter_udp_realloc_batch
calls to return -ENOMEM to ensure that iteration ends and that the
read() in userspace fails.

[1]: https://lore.kernel.org/bpf/CABi4-ogUtMrH8-NVB6W8Xg_F_KDLq=yy-yu-tKr2udXE2Mu1Lg@mail.gmail.com/
[2]: https://lore.kernel.org/bpf/[email protected]/

Signed-off-by: Jordan Rife <[email protected]>
Prepare for the next patch that tracks cookies between iterations by
converting struct sock **batch to union bpf_udp_iter_batch_item *batch
inside struct bpf_udp_iter_state.

Signed-off-by: Jordan Rife <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
@jrife jrife force-pushed the jrife/socket-iterators-udp branch 4 times, most recently from c35da03 to a29135b Compare April 26, 2025 16:31
jrife added 3 commits April 26, 2025 09:39
Replace the offset-based approach for tracking progress through a bucket
in the UDP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.

This approach guarantees that all sockets that existed when iteration
began and continue to exist throughout will be visited exactly once.
Sockets that are added to the table during iteration may or may not be
seen, but if they are they will be seen exactly once.

Initialize iter->st_bucket_done to true and iter->state.bucket to -1 to
ensure that on the first call to bpf_iter_udp_batch, the resume_bucket
case is not hit. It's not strictly accurate that we are resuming from
bucket zero when we create the first batch, and this avoids adding
special case logic for just that bucket.

Explanation
-----------
(1) New sockets can only be appended or prepended to a bucket's list.
(2) bpf_iter_udp_batch always captures a "complete" snapshot of a bucket
    from some socket until the end of the bucket's list at some point
    in time; iter->batch[iter->end_sk-1] always represents the last
    socket that existed in the bucket at that point in time.

If iter->bucket == resume_bucket, bpf_iter_udp_resume will find the
first (scanning in iteration order) unvisited socket in resume_bucket's
list that existed last time. If it finds nothing (returns NULL), all
sockets that we captured in resume_bucket last time are gone. From (1)
and (2) we can be sure that anything left is either newly added or
already visited. In either case, our guarantee is met and it's safe to
move onto the next bucket. Similarly, if it finds a socket, from (1) we
can be sure that any socket before that socket in the list is already
visited or is newly added; continuing from that socket will capture
anything that we haven't yet visited and maybe some sockets that were
just added. Again, our guarantee is met.

(3) Over multiple iterations of the same bucket, sockets from the
    original snapshot carry forward until they have all been visited.

If iter->bucket != resume_bucket, we've never visited this bucket or its
sockets before, so we can just capture everything to create our initial
snapshot.

One special case is reallocation, particularly after the GFP_USER case
where we release the bucket lock and go back to the beginning of the
loop. The corner cases here are subtle. Prior to reallocation, (3) is
still true; iter->batch[iter->cur_sk] to iter->batch[iter->end_sk-1]
contains all sockets we haven't visited yet from the original snapshot
or possibly sockets that were added after our initial snapshot. After
the GFP_USER realloc:

If state->bucket == resume_bucket, bpf_iter_udp_resume runs again and
finds our starting point. The list might have changed in the short time
that we released resume_bucket's lock, but this is fine. It may seem
that, since we only captured part of resume_bucket, there may be some
sockets we would skip if bpf_iter_udp_resume returns NULL and we proceed
to the next bucket; however, from (1), all unvisited sockets from the
original snapshot of resume_bucket must currently be represented in
iter->batch. It's safe to move on and skip the other sockets which must
have been added afterwards.

If state->bucket != resume_bucket, in which case this is the first time
we're capturing this bucket, we just need to capture the whole bucket,
so we realloc until we can grab all the sockets. This is the simple
case.

Signed-off-by: Jordan Rife <[email protected]>
Extend the iter_udp_soreuse and iter_tcp_soreuse programs to write the
cookie of the current socket, so that we can track the identity of the
sockets that the iterator has seen so far. Update the existing do_test
function to account for this change to the iterator program output. At
the same time, teach both programs to work with AF_INET as well.

Signed-off-by: Jordan Rife <[email protected]>
Introduce a set of tests that exercise various bucket resume scenarios:

* remove_seen resumes iteration after removing a socket from the bucket
  that we've already processed. Before, with the offset-based approach,
  this test would have skipped an unseen socket after resuming
  iteration. With the cookie-based approach, we now see all sockets
  exactly once.
* remove_unseen exercises the condition where the next socket that we
  would have seen is removed from the bucket before we resume iteration.
  This tests the scenario where we need to scan past the first cookie in
  our remembered cookies list to find the socket from which to resume
  iteration.
* remove_all exercises the condition where all sockets we remembered
  were removed from the bucket to make sure iteration terminates and
  returns no more results.
* add_some exercises the condition where a few, but not enough to
  trigger a realloc, sockets are added to the head of the current bucket
  between reads. Before, with the offset-based approach, this test would
  have repeated sockets we've already seen. With the cookie-based
  approach, we now see all sockets exactly once.
* force_realloc exercises the condition that we need to realloc the
  batch on a subsequent read, since more sockets than can be held in the
  current batch array were added to the current bucket. This exercies
  the logic inside bpf_iter_udp_realloc_batch that copies cookies into
  the new batch to make sure nothing is skipped or repeated.

Signed-off-by: Jordan Rife <[email protected]>
@jrife jrife force-pushed the jrife/socket-iterators-udp branch from a29135b to bfe44fa Compare April 26, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant