Skip to content

Conversation

@kernel-patches-daemon-bpf-rc
Copy link

Pull request for series with
subject: bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
version: 10
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 2dfd8b8
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 2dfd8b8
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 55d5a51
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 5e3fee3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 5e3fee3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 5e3fee3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 5e3fee3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

q2ven added 2 commits October 3, 2025 19:59
If a socket has sk->sk_memcg with SK_MEMCG_EXCLUSIVE, it is decoupled
from the global protocol memory accounting.

This is controlled by net.core.memcg_exclusive sysctl, but it lacks
flexibility.

Let's support flagging (and clearing) SK_MEMCG_EXCLUSIVE via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.

  u32 flags = SK_BPF_MEMCG_EXCLUSIVE;

  bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
                 &flags, sizeof(flags));

As with net.core.memcg_exclusive, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).

SK_BPF_MEMCG_FLAGS is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:

  1. UDP charges memory under sk->sk_receive_queue.lock instead
     of lock_sock()

  2. For TCP child sockets, memory accounting is adjusted only in
     __inet_accept() which sk->sk_memcg allocation is deferred to

  3. Modifying the flag after skb is charged to sk requires such
     adjustment during bpf_setsockopt() and complicates the logic
     unnecessarily

We can support other hooks later if a real use case justifies that.

Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with SK_MEMCG_EXCLUSIVE.  This will be more
visible under tcp_mem pressure.

  # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
    kretprobe:__sk_mem_raise_allocated /@start[tid]/
    { @EnD[tid] = nsecs - @start[tid]; @times = hist(@EnD[tid]); delete(@start[tid]); }'
  # tcp_stream -6 -F 1000 -N -T 256

Without bpf prog:

  [128, 256)          3846 |                                                    |
  [256, 512)       1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
  [1K, 2K)          198207 |@@@@@@                                              |
  [2K, 4K)           31199 |@                                                   |

With bpf prog in the next patch:
  (must be attached before tcp_stream)
  # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create
  # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg

  [128, 256)          6413 |                                                    |
  [256, 512)       1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
  [1K, 2K)          117031 |@@@@                                                |
  [2K, 4K)           11773 |                                                    |

Signed-off-by: Kuniyuki Iwashima <[email protected]>
The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).

  1. Create socket pairs
  2. Send NR_PAGES (32) of data (TCP consumes around 35 pages,
     and UDP consuems 66 pages due to skb overhead)
  3. Read memory_allocated from sk->sk_prot->memory_allocated and
     sk->sk_prot->memory_per_cpu_fw_alloc
  4. Check if unread data is charged to memory_allocated

If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
changed, but we allow a small error (up to 10 pages) in case
other processes on the host use some amounts of TCP/UDP memory.

The amount of allocated pages are buffered to per-cpu variable
{tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
before reported to {tcp,udp}_memory_allocated.

At 3., memory_allocated is calculated from the 2 variables at
fentry of socket create function.

We drain the receive queue only for UDP before close() because UDP
recv queue is destroyed after RCU grace period.  When I printed
memory_allocated, UDP exclusive cases sometimes saw the non-exclusive
case's leftover, but it's still in the small error range (<10 pages).

  bpf_trace_printk: memory_allocated: 0   <-- TCP non-exclusive
  bpf_trace_printk: memory_allocated: 35
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ sysctl
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ bpf
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- UDP non-exclusive
  bpf_trace_printk: memory_allocated: 66
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ sysctl (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ bpf (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2

We prefer finishing tests faster than oversleeping for call_rcu()
 + sk_destruct().

The test completes within 2s on QEMU (64 CPUs) w/ KVM.

  # time ./test_progs -t sk_memcg
  #370/1   sk_memcg/TCP  :OK
  #370/2   sk_memcg/UDP  :OK
  #370/3   sk_memcg/TCPv6:OK
  #370/4   sk_memcg/UDPv6:OK
  #370     sk_memcg:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

  real	0m1.609s
  user	0m0.167s
  sys	0m0.461s

Signed-off-by: Kuniyuki Iwashima <[email protected]>
@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: cbf33b8
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438
version: 10

@kernel-patches-daemon-bpf-rc kernel-patches-daemon-bpf-rc bot force-pushed the bpf-net_base branch 3 times, most recently from 70944b4 to c891e41 Compare October 9, 2025 22:17
@kernel-patches-daemon-bpf-rc kernel-patches-daemon-bpf-rc bot force-pushed the bpf-net_base branch 8 times, most recently from 72af3c5 to c612c02 Compare October 17, 2025 23:22
@kernel-patches-daemon-bpf-rc kernel-patches-daemon-bpf-rc bot force-pushed the bpf-net_base branch 6 times, most recently from a5ba290 to 5c24747 Compare October 23, 2025 22:25
@kernel-patches-daemon-bpf-rc kernel-patches-daemon-bpf-rc bot force-pushed the bpf-net_base branch 4 times, most recently from 28726e6 to 68d8840 Compare November 11, 2025 19:50
@kernel-patches-daemon-bpf-rc kernel-patches-daemon-bpf-rc bot force-pushed the bpf-net_base branch 2 times, most recently from 043bbd6 to 45cfb7c Compare November 19, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants