Skip to content

Commit ce3c49d

Browse files
author
Alexei Starovoitov
committed
Merge branch 'bpf-fix-the-release-of-inner-map'
Hou Tao says: ==================== bpf: Fix the release of inner map From: Hou Tao <[email protected]> Hi, The patchset aims to fix the release of inner map in map array or map htab. The release of inner map is different with normal map. For normal map, the map is released after the bpf program which uses the map is destroyed, because the bpf program tracks the used maps. However bpf program can not track the used inner map because these inner map may be updated or deleted dynamically, and for now the ref-counter of inner map is decreased after the inner map is remove from outer map, so the inner map may be freed before the bpf program, which is accessing the inner map, exits and there will be use-after-free problem as demonstrated by patch #6. The patchset fixes the problem by deferring the release of inner map. The freeing of inner map is deferred according to the sleepable attributes of the bpf programs which own the outer map. Patch #1 fixes the warning when running the newly-added selftest under interpreter mode. Patch #2 adds more parameters to .map_fd_put_ptr() to prepare for the fix. Patch #3 fixes the incorrect value of need_defer when freeing the fd array. Patch #4 fixes the potential use-after-free problem by using call_rcu_tasks_trace() and call_rcu() to wait for one tasks trace RCU GP and one RCU GP unconditionally. Patch #5 optimizes the free of inner map by removing the unnecessary RCU GP waiting. Patch #6 adds a selftest to demonstrate the potential use-after-free problem. Patch #7 updates a selftest to update outer map in syscall bpf program. Please see individual patches for more details. And comments are always welcome. Change Log: v5: * patch #3: rename fd_array_map_delete_elem_with_deferred_free() to __fd_array_map_delete_elem() (Alexei) * patch #5: use atomic64_t instead of atomic_t to prevent potential overflow (Alexei) * patch #7: use ptr_to_u64() helper instead of force casting to initialize pointers in bpf_attr (Alexei) v4: https://lore.kernel.org/bpf/[email protected] * patch #2: don't use "deferred", use "need_defer" uniformly * patch #3: newly-added, fix the incorrect value of need_defer during fd array free. * patch #4: doesn't consider the case in which bpf map is not used by any bpf program and only use sleepable_refcnt to remove unnecessary tasks trace RCU GP (Alexei) * patch #4: remove memory barriers added due to cautiousness (Alexei) v3: https://lore.kernel.org/bpf/[email protected] * multiple variable renamings (Martin) * define BPF_MAP_RCU_GP/BPF_MAP_RCU_TT_GP as bit (Martin) * use call_rcu() and its variants instead of synchronize_rcu() (Martin) * remove unnecessary mask in bpf_map_free_deferred() (Martin) * place atomic_or() and the related smp_mb() together (Martin) * add patch #6 to demonstrate that updating outer map in syscall program is dead-lock free (Alexei) * update comments about the memory barrier in bpf_map_fd_put_ptr() * update commit message for patch #3 and #4 to describe more details v2: https://lore.kernel.org/bpf/[email protected] * defer the invocation of ops->map_free() instead of bpf_map_put() (Martin) * update selftest to make it being reproducible under JIT mode (Martin) * remove unnecessary preparatory patches v1: https://lore.kernel.org/bpf/[email protected] ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2 parents 153de60 + e3dd408 commit ce3c49d

File tree

13 files changed

+453
-41
lines changed

13 files changed

+453
-41
lines changed

include/linux/bpf.h

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,11 @@ struct bpf_map_ops {
106106
/* funcs called by prog_array and perf_event_array map */
107107
void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
108108
int fd);
109-
void (*map_fd_put_ptr)(void *ptr);
109+
/* If need_defer is true, the implementation should guarantee that
110+
* the to-be-put element is still alive before the bpf program, which
111+
* may manipulate it, exists.
112+
*/
113+
void (*map_fd_put_ptr)(struct bpf_map *map, void *ptr, bool need_defer);
110114
int (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf);
111115
u32 (*map_fd_sys_lookup_elem)(void *ptr);
112116
void (*map_seq_show_elem)(struct bpf_map *map, void *key,
@@ -272,7 +276,11 @@ struct bpf_map {
272276
*/
273277
atomic64_t refcnt ____cacheline_aligned;
274278
atomic64_t usercnt;
275-
struct work_struct work;
279+
/* rcu is used before freeing and work is only used during freeing */
280+
union {
281+
struct work_struct work;
282+
struct rcu_head rcu;
283+
};
276284
struct mutex freeze_mutex;
277285
atomic64_t writecnt;
278286
/* 'Ownership' of program-containing map is claimed by the first program
@@ -288,6 +296,9 @@ struct bpf_map {
288296
} owner;
289297
bool bypass_spec_v1;
290298
bool frozen; /* write-once; write-protected by freeze_mutex */
299+
bool free_after_mult_rcu_gp;
300+
bool free_after_rcu_gp;
301+
atomic64_t sleepable_refcnt;
291302
s64 __percpu *elem_count;
292303
};
293304

kernel/bpf/arraymap.c

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -867,11 +867,11 @@ int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
867867
}
868868

869869
if (old_ptr)
870-
map->ops->map_fd_put_ptr(old_ptr);
870+
map->ops->map_fd_put_ptr(map, old_ptr, true);
871871
return 0;
872872
}
873873

874-
static long fd_array_map_delete_elem(struct bpf_map *map, void *key)
874+
static long __fd_array_map_delete_elem(struct bpf_map *map, void *key, bool need_defer)
875875
{
876876
struct bpf_array *array = container_of(map, struct bpf_array, map);
877877
void *old_ptr;
@@ -890,13 +890,18 @@ static long fd_array_map_delete_elem(struct bpf_map *map, void *key)
890890
}
891891

892892
if (old_ptr) {
893-
map->ops->map_fd_put_ptr(old_ptr);
893+
map->ops->map_fd_put_ptr(map, old_ptr, need_defer);
894894
return 0;
895895
} else {
896896
return -ENOENT;
897897
}
898898
}
899899

900+
static long fd_array_map_delete_elem(struct bpf_map *map, void *key)
901+
{
902+
return __fd_array_map_delete_elem(map, key, true);
903+
}
904+
900905
static void *prog_fd_array_get_ptr(struct bpf_map *map,
901906
struct file *map_file, int fd)
902907
{
@@ -913,8 +918,9 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
913918
return prog;
914919
}
915920

916-
static void prog_fd_array_put_ptr(void *ptr)
921+
static void prog_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer)
917922
{
923+
/* bpf_prog is freed after one RCU or tasks trace grace period */
918924
bpf_prog_put(ptr);
919925
}
920926

@@ -924,13 +930,13 @@ static u32 prog_fd_array_sys_lookup_elem(void *ptr)
924930
}
925931

926932
/* decrement refcnt of all bpf_progs that are stored in this map */
927-
static void bpf_fd_array_map_clear(struct bpf_map *map)
933+
static void bpf_fd_array_map_clear(struct bpf_map *map, bool need_defer)
928934
{
929935
struct bpf_array *array = container_of(map, struct bpf_array, map);
930936
int i;
931937

932938
for (i = 0; i < array->map.max_entries; i++)
933-
fd_array_map_delete_elem(map, &i);
939+
__fd_array_map_delete_elem(map, &i, need_defer);
934940
}
935941

936942
static void prog_array_map_seq_show_elem(struct bpf_map *map, void *key,
@@ -1109,7 +1115,7 @@ static void prog_array_map_clear_deferred(struct work_struct *work)
11091115
{
11101116
struct bpf_map *map = container_of(work, struct bpf_array_aux,
11111117
work)->map;
1112-
bpf_fd_array_map_clear(map);
1118+
bpf_fd_array_map_clear(map, true);
11131119
bpf_map_put(map);
11141120
}
11151121

@@ -1239,8 +1245,9 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map,
12391245
return ee;
12401246
}
12411247

1242-
static void perf_event_fd_array_put_ptr(void *ptr)
1248+
static void perf_event_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer)
12431249
{
1250+
/* bpf_perf_event is freed after one RCU grace period */
12441251
bpf_event_entry_free_rcu(ptr);
12451252
}
12461253

@@ -1258,15 +1265,15 @@ static void perf_event_fd_array_release(struct bpf_map *map,
12581265
for (i = 0; i < array->map.max_entries; i++) {
12591266
ee = READ_ONCE(array->ptrs[i]);
12601267
if (ee && ee->map_file == map_file)
1261-
fd_array_map_delete_elem(map, &i);
1268+
__fd_array_map_delete_elem(map, &i, true);
12621269
}
12631270
rcu_read_unlock();
12641271
}
12651272

12661273
static void perf_event_fd_array_map_free(struct bpf_map *map)
12671274
{
12681275
if (map->map_flags & BPF_F_PRESERVE_ELEMS)
1269-
bpf_fd_array_map_clear(map);
1276+
bpf_fd_array_map_clear(map, false);
12701277
fd_array_map_free(map);
12711278
}
12721279

@@ -1294,15 +1301,15 @@ static void *cgroup_fd_array_get_ptr(struct bpf_map *map,
12941301
return cgroup_get_from_fd(fd);
12951302
}
12961303

1297-
static void cgroup_fd_array_put_ptr(void *ptr)
1304+
static void cgroup_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer)
12981305
{
12991306
/* cgroup_put free cgrp after a rcu grace period */
13001307
cgroup_put(ptr);
13011308
}
13021309

13031310
static void cgroup_fd_array_free(struct bpf_map *map)
13041311
{
1305-
bpf_fd_array_map_clear(map);
1312+
bpf_fd_array_map_clear(map, false);
13061313
fd_array_map_free(map);
13071314
}
13081315

@@ -1347,7 +1354,7 @@ static void array_of_map_free(struct bpf_map *map)
13471354
* is protected by fdget/fdput.
13481355
*/
13491356
bpf_map_meta_free(map->inner_map_meta);
1350-
bpf_fd_array_map_clear(map);
1357+
bpf_fd_array_map_clear(map, false);
13511358
fd_array_map_free(map);
13521359
}
13531360

kernel/bpf/core.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2664,12 +2664,16 @@ void __bpf_free_used_maps(struct bpf_prog_aux *aux,
26642664
struct bpf_map **used_maps, u32 len)
26652665
{
26662666
struct bpf_map *map;
2667+
bool sleepable;
26672668
u32 i;
26682669

2670+
sleepable = aux->sleepable;
26692671
for (i = 0; i < len; i++) {
26702672
map = used_maps[i];
26712673
if (map->ops->map_poke_untrack)
26722674
map->ops->map_poke_untrack(map, aux);
2675+
if (sleepable)
2676+
atomic64_dec(&map->sleepable_refcnt);
26732677
bpf_map_put(map);
26742678
}
26752679
}

kernel/bpf/hashtab.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -897,7 +897,7 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
897897

898898
if (map->ops->map_fd_put_ptr) {
899899
ptr = fd_htab_map_get_ptr(map, l);
900-
map->ops->map_fd_put_ptr(ptr);
900+
map->ops->map_fd_put_ptr(map, ptr, true);
901901
}
902902
}
903903

@@ -2484,7 +2484,7 @@ static void fd_htab_map_free(struct bpf_map *map)
24842484
hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
24852485
void *ptr = fd_htab_map_get_ptr(map, l);
24862486

2487-
map->ops->map_fd_put_ptr(ptr);
2487+
map->ops->map_fd_put_ptr(map, ptr, false);
24882488
}
24892489
}
24902490

@@ -2525,7 +2525,7 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
25252525

25262526
ret = htab_map_update_elem(map, key, &ptr, map_flags);
25272527
if (ret)
2528-
map->ops->map_fd_put_ptr(ptr);
2528+
map->ops->map_fd_put_ptr(map, ptr, false);
25292529

25302530
return ret;
25312531
}

kernel/bpf/helpers.c

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,13 @@
3232
*
3333
* Different map implementations will rely on rcu in map methods
3434
* lookup/update/delete, therefore eBPF programs must run under rcu lock
35-
* if program is allowed to access maps, so check rcu_read_lock_held in
36-
* all three functions.
35+
* if program is allowed to access maps, so check rcu_read_lock_held() or
36+
* rcu_read_lock_trace_held() in all three functions.
3737
*/
3838
BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
3939
{
40-
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
40+
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
41+
!rcu_read_lock_bh_held());
4142
return (unsigned long) map->ops->map_lookup_elem(map, key);
4243
}
4344

@@ -53,7 +54,8 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = {
5354
BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
5455
void *, value, u64, flags)
5556
{
56-
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
57+
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
58+
!rcu_read_lock_bh_held());
5759
return map->ops->map_update_elem(map, key, value, flags);
5860
}
5961

@@ -70,7 +72,8 @@ const struct bpf_func_proto bpf_map_update_elem_proto = {
7072

7173
BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key)
7274
{
73-
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
75+
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
76+
!rcu_read_lock_bh_held());
7477
return map->ops->map_delete_elem(map, key);
7578
}
7679

kernel/bpf/map_in_map.c

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -127,12 +127,21 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map,
127127
return inner_map;
128128
}
129129

130-
void bpf_map_fd_put_ptr(void *ptr)
130+
void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool need_defer)
131131
{
132-
/* ptr->ops->map_free() has to go through one
133-
* rcu grace period by itself.
132+
struct bpf_map *inner_map = ptr;
133+
134+
/* Defer the freeing of inner map according to the sleepable attribute
135+
* of bpf program which owns the outer map, so unnecessary waiting for
136+
* RCU tasks trace grace period can be avoided.
134137
*/
135-
bpf_map_put(ptr);
138+
if (need_defer) {
139+
if (atomic64_read(&map->sleepable_refcnt))
140+
WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
141+
else
142+
WRITE_ONCE(inner_map->free_after_rcu_gp, true);
143+
}
144+
bpf_map_put(inner_map);
136145
}
137146

138147
u32 bpf_map_fd_sys_lookup_elem(void *ptr)

kernel/bpf/map_in_map.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd);
1313
void bpf_map_meta_free(struct bpf_map *map_meta);
1414
void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file,
1515
int ufd);
16-
void bpf_map_fd_put_ptr(void *ptr);
16+
void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool need_defer);
1717
u32 bpf_map_fd_sys_lookup_elem(void *ptr);
1818

1919
#endif

kernel/bpf/syscall.c

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -719,6 +719,28 @@ static void bpf_map_put_uref(struct bpf_map *map)
719719
}
720720
}
721721

722+
static void bpf_map_free_in_work(struct bpf_map *map)
723+
{
724+
INIT_WORK(&map->work, bpf_map_free_deferred);
725+
/* Avoid spawning kworkers, since they all might contend
726+
* for the same mutex like slab_mutex.
727+
*/
728+
queue_work(system_unbound_wq, &map->work);
729+
}
730+
731+
static void bpf_map_free_rcu_gp(struct rcu_head *rcu)
732+
{
733+
bpf_map_free_in_work(container_of(rcu, struct bpf_map, rcu));
734+
}
735+
736+
static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
737+
{
738+
if (rcu_trace_implies_rcu_gp())
739+
bpf_map_free_rcu_gp(rcu);
740+
else
741+
call_rcu(rcu, bpf_map_free_rcu_gp);
742+
}
743+
722744
/* decrement map refcnt and schedule it for freeing via workqueue
723745
* (underlying map implementation ops->map_free() might sleep)
724746
*/
@@ -728,11 +750,14 @@ void bpf_map_put(struct bpf_map *map)
728750
/* bpf_map_free_id() must be called first */
729751
bpf_map_free_id(map);
730752
btf_put(map->btf);
731-
INIT_WORK(&map->work, bpf_map_free_deferred);
732-
/* Avoid spawning kworkers, since they all might contend
733-
* for the same mutex like slab_mutex.
734-
*/
735-
queue_work(system_unbound_wq, &map->work);
753+
754+
WARN_ON_ONCE(atomic64_read(&map->sleepable_refcnt));
755+
if (READ_ONCE(map->free_after_mult_rcu_gp))
756+
call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp);
757+
else if (READ_ONCE(map->free_after_rcu_gp))
758+
call_rcu(&map->rcu, bpf_map_free_rcu_gp);
759+
else
760+
bpf_map_free_in_work(map);
736761
}
737762
}
738763
EXPORT_SYMBOL_GPL(bpf_map_put);
@@ -5323,6 +5348,11 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
53235348
goto out_unlock;
53245349
}
53255350

5351+
/* The bpf program will not access the bpf map, but for the sake of
5352+
* simplicity, increase sleepable_refcnt for sleepable program as well.
5353+
*/
5354+
if (prog->aux->sleepable)
5355+
atomic64_inc(&map->sleepable_refcnt);
53265356
memcpy(used_maps_new, used_maps_old,
53275357
sizeof(used_maps_old[0]) * prog->aux->used_map_cnt);
53285358
used_maps_new[prog->aux->used_map_cnt] = map;

kernel/bpf/verifier.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17889,10 +17889,12 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
1788917889
return -E2BIG;
1789017890
}
1789117891

17892+
if (env->prog->aux->sleepable)
17893+
atomic64_inc(&map->sleepable_refcnt);
1789217894
/* hold the map. If the program is rejected by verifier,
1789317895
* the map will be released by release_maps() or it
1789417896
* will be used by the valid program until it's unloaded
17895-
* and all maps are released in free_used_maps()
17897+
* and all maps are released in bpf_free_used_maps()
1789617898
*/
1789717899
bpf_map_inc(map);
1789817900

0 commit comments

Comments
 (0)