Bpf/optimized usdt ci #8814

olsajiri · 2025-04-18T15:39:35Z

test

avoid merge conflicts with urgent Signed-off-by: Peter Zijlstra <[email protected]>

Ravi reported that the bpf_perf_link_attach() usage of perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the PERF_EVENT_IOC_SET_BPF case. Reported-by: Ravi Bangoria <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

Currently perf_event_release_kernel() will iterate the child events and attempt tear-down. However, it removes them from the child_list using list_move(), notably skipping the state management done by perf_child_detach(). Crucially, it fails to clear PERF_ATTACH_CHILD, which opens the door for a concurrent perf_remove_from_context() to race. This way child_list management stays fully serialized using child_mutex. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

Simplify the code by moving the duplicated wakeup condition into put_ctx(). Notably, wait_var_event() is in perf_event_free_task() and will have set ctx->task = TASK_TOMBSTONE. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

There is no good reason to have the free list anymore. It is possible to call free_event() after the locks have been dropped in the main loop. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

Both perf_event_free_task() and perf_event_exit_task_context() are very similar, except perf_event_exit_task_context() is a little more generic / makes less assumptions. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

The task passed to perf_event_exit_task() is not a child, it is current. Fix this confusing naming, since much of the rest of the code also relies on it being current. Specifically, both exec() and exit() callers use it with current as the argument. Notably, task_ctx_sched_out() doesn't make much sense outside of current. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

Previously it was only safe to call perf_pmu_unregister() if there were no active events of that pmu around -- which was impossible to guarantee since it races all sorts against perf_init_event(). Rework the whole thing by: - keeping track of all events for a given pmu - 'hiding' the pmu from perf_init_event() - waiting for the appropriate (s)rcu grace periods such that all prior references to the PMU will be completed - detaching all still existing events of that pmu (see first point) and moving them to a new REVOKED state. - actually freeing the pmu data. Where notably the new REVOKED state must inhibit all event actions from reaching code that wants to use event->pmu. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ravi Bangoria <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

More and more features require a dynamic event constraint, e.g., branch counter logging, auto counter reload, Arch PEBS, etc. Add a generic flag, PMU_FL_DYN_CONSTRAINT, to indicate the case. It avoids keeping adding the individual flag in intel_cpuc_prepare(). Add a variable dyn_constraint in the struct hw_perf_event to track the dynamic constraint of the event. Apply it if it's updated. Apply the generic dynamic constraint for branch counter logging. Many features on and after V6 require dynamic constraint. So unconditionally set the flag for V6+. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Thomas Falcon <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

When a machine supports PEBS v6, perf unconditionally searches the cpuc->event_list[] for every event and check if the late setup is required, which is unnecessary. The late setup is only required for special events, e.g., events support counters snapshotting feature. Add n_late_setup to track the num of events that needs the late setup. Other features, e.g., auto counter reload feature, require the late setup as well. Add a wrapper, intel_pmu_pebs_late_setup, for the events that support counters snapshotting feature. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Thomas Falcon <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

The auto counter reload feature requires an event flag to indicate an auto counter reload group, which can only be scheduled on specific counters that enumerated in CPUID. However, the hw_perf_event.flags has run out on X86. Two solutions were considered to address the issue. - Currently, 20 bits are reserved for the architecture-specific flags. Only the bit 31 is used for the generic flag. There is still plenty of space left. Reserve 8 more bits for the arch-specific flags. - Add a new X86 specific hw_perf_event.flags1 to support more flags. The former is implemented. Enough room is still left in the global generic flag. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Thomas Falcon <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

The counters that support the auto counter reload feature can be enumerated in the CPUID Leaf 0x23 sub-leaf 0x2. Add acr_cntr_mask to store the mask of counters which are reloadable. Add acr_cause_mask to store the mask of counters which can cause reload. Since the e-core and p-core may have different numbers of counters, track the masks in the struct x86_hybrid_pmu as well. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Thomas Falcon <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

The relative rates among two or more events are useful for performance analysis, e.g., a high branch miss rate may indicate a performance issue. Usually, the samples with a relative rate that exceeds some threshold are more useful. However, the traditional sampling takes samples of events separately. To get the relative rates among two or more events, a high sample rate is required, which can bring high overhead. Many samples taken in the non-hotspot area are also dropped (useless) in the post-process. The auto counter reload (ACR) feature takes samples when the relative rate of two or more events exceeds some threshold, which provides the fine-grained information at a low cost. To support the feature, two sets of MSRs are introduced. For a given counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s) can cause a reload of that counter. The reload value is stored in the IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C. The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto Counter Reload. In the hw_config(), an ACR event is specially configured, because the cause/reloadable counter mask has to be applied to the dyn_constraint. Besides the HW limit, e.g., not support perf metrics, PDist and etc, a SW limit is applied as well. ACR events in a group must be contiguous. It facilitates the later conversion from the event idx to the counter idx. Otherwise, the intel_pmu_acr_late_setup() has to traverse the whole event list again to find the "cause" event. Also, add a new flag PERF_X86_EVENT_ACR to indicate an ACR group, which is set to the group leader. The late setup() is also required for an ACR group. It's to convert the event idx to the counter idx, and saved it in hw.config1. The ACR configuration MSRs are only updated in the enable_event(). The disable_event() doesn't clear the ACR CFG register. Add acr_cfg_b/acr_cfg_c in the struct cpu_hw_events to cache the MSR values. It can avoid a MSR write if the value is not changed. Expose an acr_mask to the sysfs. The perf tool can utilize the new format to configure the relation of events in the group. The bit sequence of the acr_mask follows the events enabled order of the group. Example: Here is the snippet of the mispredict.c. Since the array has a random numbers, jumps are random and often mispredicted. The mispredicted rate depends on the compared value. For the Loop1, ~11% of all branches are mispredicted. For the Loop2, ~21% of all branches are mispredicted. main() { ... for (i = 0; i < N; i++) data[i] = rand() % 256; ... /* Loop 1 */ for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 64) sum += data[i]; ... ... /* Loop 2 */ for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; ... } Usually, a code with a high branch miss rate means a bad performance. To understand the branch miss rate of the codes, the traditional method usually samples both branches and branch-misses events. E.g., perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}" -c 1000000 -- ./mispredict [ perf record: Woken up 4 times to write data ] [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ] The 5106 samples are from both events and spread in both Loops. In the post-process stage, a user can know that the Loop 2 has a 21% branch miss rate. Then they can focus on the samples of branch-misses events for the Loop 2. With this patch, the user can generate the samples only when the branch miss rate > 20%. For example, perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu, cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}" -- ./mispredict (Two different periods are applied to branch-misses and branch-instructions. The ratio is set to 20%. If the branch-instructions is overflowed first, the branch-miss rate < 20%. No samples should be generated. All counters should be automatically reloaded. If the branch-misses is overflowed first, the branch-miss rate > 20%. A sample triggered by the branch-misses event should be generated. Just the counter of the branch-instructions should be automatically reloaded. The branch-misses event should only be automatically reloaded when the branch-instructions is overflowed. So the "cause" event is the branch-instructions event. The acr_mask is set to 0x2, since the event index in the group of branch-instructions is 1. The branch-instructions event is automatically reloaded no matter which events are overflowed. So the "cause" events are the branch-misses and the branch-instructions event. The acr_mask should be set to 0x3.) [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ] $perf report Percent │154: movl $0x0,-0x14(%rbp) │ ↓ jmp 1af │ for (i = j; i < N; i++) │15d: mov -0x10(%rbp),%eax │ mov %eax,-0x18(%rbp) │ ↓ jmp 1a2 │ if (data[i] >= 128) │165: mov -0x18(%rbp),%eax │ cltq │ lea 0x0(,%rax,4),%rdx │ mov -0x8(%rbp),%rax │ add %rdx,%rax │ mov (%rax),%eax │ ┌──cmp $0x7f,%eax 100.00 0.00 │ ├──jle 19e │ │sum += data[i]; The 2498 samples are all from the branch-misses events for the Loop 2. The number of samples and overhead is significantly reduced without losing any information. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Thomas Falcon <[email protected]> Link: https://lkml.kernel.org/r/[email protected]

Rename struct bts_buffer objects from 'buf' to 'bb' to improve the readability when accessing the structure's 'buf' member. For example, 'buf->buf[]' becomes 'bb->buf[]'. Indent line 327 using tabs to silence a checkpatch warning. No functional changes intended. Suggested-by: Ingo Molnar <[email protected]> Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

…owerPC platforms This change alters the PowerPC and x86 driver implementations to record the last sample period before the event is updated for the next period. A common pattern in PMU driver implementations is to have a "*_event_set_period" function which takes care of updating the various period-related fields in a perf_event structure. In most cases, the drivers choose to call this function after initializing a sample data structure with perf_sample_data_init. The x86 and PowerPC drivers deviate from this, choosing to update the period before initializing the sample data. When using an event with an alternate sample period, this causes an incorrect period to be written to the sample data that gets reported to userspace. Signed-off-by: Mark Barnett <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Use struct_size() to calculate the number of bytes to allocate for a new bts_buffer. Compared to offsetof(), struct_size() provides additional compile-time checks (e.g., __must_be_array()). Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]

…AX_ORDER When calling alloc_contig_range() with __GFP_COMP and the order of requested pfn range is pageblock_order, less than MAX_ORDER, I triggered WARNING as follows: PFN range: requested [2150105088, 2150105600), allocated [2150105088, 2150106112) WARNING: CPU: 3 PID: 580 at mm/page_alloc.c:6877 alloc_contig_range+0x280/0x340 alloc_contig_range() marks pageblocks of the requested pfn range to be isolated, migrate these pages if they are in use and will be freed to MIGRATE_ISOLATED freelist. Suppose two alloc_contig_range() calls at the same time and the requested pfn range are [0x80280000, 0x80280200) and [0x80280200, 0x80280400) respectively. Suppose the two memory range are in use, then alloc_contig_range() will migrate and free these pages to MIGRATE_ISOLATED freelist. __free_one_page() will merge MIGRATE_ISOLATE buddy to larger buddy, resulting in a MAX_ORDER buddy. Finally, find_large_buddy() in alloc_contig_range() returns a MAX_ORDER buddy and results in WARNING. To fix it, call free_contig_range() to free the excess pfn range. Link: https://lkml.kernel.org/r/[email protected] Fixes: e98337d ("mm/contig_alloc: support __GFP_COMP") Signed-off-by: Jinjiang Tu <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

When the last page in the zone is accepted, __accept_page() calls static_branch_dec(). This function takes cpu_hotplug_lock, which can lead to a deadlock if the allocation occurs during CPU bringup path as _cpu_up() also takes the lock. To prevent this deadlock, defer static_branch_dec() to a workqueue. Call static_branch_dec() only when the workqueue is not yet initialized. Workqueues are initialized before CPU bring up, so this will not conflict with the first scenario. Link: https://lkml.kernel.org/r/[email protected] Fixes: 55ad43e ("mm: add a helper to accept page") Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: Srikanth Aithal <[email protected]> Tested-by: Srikanth Aithal <[email protected]> Cc: Ashish Kalra <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Mike Rapoport (IBM)" <[email protected]> Cc: Thomas Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

…pages Alison reports an issue with fsdax when large extends end up using large ZONE_DEVICE folios: [ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00 [ 417.796982] #PF: supervisor read access in kernel mode [ 417.797540] #PF: error_code(0x0000) - not-present page [ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0 [ 417.798690] Oops: Oops: 0000 [kernel-patches#1] SMP NOPTI [ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ... [ 417.800150] Tainted: [O]=OOT_MODULE [ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250 [ 417.801948] Code: ... [ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206 [ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002 [ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae [ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000 [ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001 [ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580 [ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000 [ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0 [ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 417.811353] Call Trace: [ 417.811709] <TASK> [ 417.812038] folio_add_file_rmap_ptes+0x143/0x230 [ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0 [ 417.813132] insert_page+0x78/0xf0 [ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0 [ 417.814088] dax_fault_iter+0x484/0x7b0 [ 417.814542] dax_iomap_pte_fault+0x1ca/0x620 [ 417.815055] dax_iomap_fault+0x39/0x40 [ 417.815499] __xfs_write_fault+0x139/0x380 [ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60 [ 417.816483] xfs_write_fault+0x41/0x50 [ 417.816966] xfs_filemap_fault+0x3b/0xe0 [ 417.817424] __do_fault+0x31/0x180 [ 417.817859] __handle_mm_fault+0xee1/0x1a60 [ 417.818325] ? debug_smp_processor_id+0x17/0x20 [ 417.818844] handle_mm_fault+0xe1/0x2b0 [...] The issue is that when we split a large ZONE_DEVICE folio to order-0 ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it has folio->memcg_data set. And we never manually initialize folio->memcg_data in fsdax code, because we never expect it to be set at all. When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to use folio->memcg_data (because it's non-NULL) but it does not actually point at a memcg, resulting in the problem. Alison also observed that these folios sometimes have "locked" set, which is rather concerning (folios locked from the beginning ...). The reason is that the order for large folios is stored in page[1]->flags, which become the folio->flags of a new small folio. Let's fix it by adding a folio helper to clear order/_nr_pages for splitting purposes. Maybe we should reinitialize other large folio flags / folio members as well when splitting, because they might similarly cause harm once page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should not be set for fsdax, so at least page[1]->flags might be as expected with this fix. From a quick glimpse, initializing ->mapping, ->pgmap and ->share should re-initialize most things from a previous page[1] used by large folios that fsdax cares about. For example folio->private might not get reinitialized, but maybe that's not relevant -- no traces of it's use in fsdax code. Needs a closer look. Another thing that should be considered in the future is performing similar checks as we perform in free_tail_page_prepare() -- checking pincount etc. -- when freeing a large fsdax folio. Link: https://lkml.kernel.org/r/[email protected] Fixes: 4996fc5 ("mm: let _folio_nr_pages overlay memcg_data in first tail page") Fixes: 38607c6 ("fs/dax: properly refcount fs dax pages") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Alison Schofield <[email protected]> Closes: https://lkml.kernel.org/r/[email protected] Tested-by: Alison Schofield <[email protected]> Reviewed-by: Dan Williams <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

With permission, reduce the number of maintainers. Create a CREDITS entry for Joonsoo (Pekka already has one). Thanks for all the work! Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Harry Yoo <[email protected]> Acked-by: Christoph Lameter (Ampere) <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Brendan Jackman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

Add a subsection for the page allocator, including compaction as it's crucial for high-order allocations and works together with the anti-fragmentation features. Add reviewers (including myself) who voluteered. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Zi Yan <[email protected]> Acked-by: Brendan Jackman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Christoph Lameter (Ampere) <[email protected]> Cc: David Rientjes <[email protected]> Cc: Harry Yoo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

@memcg

commit 73f839b addressed an issue regarding the swap counter leak that occurred from an offline cgroup. However, commit 89ce924 modified the parameter from @swap_memcg to @memcg (presumably this alteration was introduced while resolving conflicts). Fix this problem by reverting this minor change. Link: https://lkml.kernel.org/r/[email protected] Fixes: 89ce924 ("mm: memcontrol: move memsw charge callbacks to v1") Signed-off-by: Muchun Song <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

MEMORY MAPPING does not list the mmap.h trace point file, but does list the mmap.c file. Couple the trace points with the users and authors of the trace points for notifications of updates. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Acked-by: SeongJae Park <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

The madvise code straddles both VMA and page table manipulation. As a result, separate it out into its own section and add maintainers/reviewers as appropriate. We additionally include the mman-common.h file as this contains the shared madvise flags and it is important we maintain this alongside madvise.c. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Jann Horn <[email protected]> Acked-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

…ble() Not like fault_in_readable() or fault_in_writeable(), in fault_in_safe_writeable() local variable 'start' is increased page by page to loop till the whole address range is handled. However, it mistakenly calculates the size of the handled range with 'uaddr - start'. Fix it here. Andreas said: : In gfs2, fault_in_iov_iter_writeable() is used in : gfs2_file_direct_read() and gfs2_file_read_iter(), so this potentially : affects buffered as well as direct reads. This bug could cause those : gfs2 functions to spin in a loop. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Fixes: fe673d3 ("mm: gup: make fault_in_safe_writeable() use fixup_user_fault()") Reviewed-by: Oscar Salvador <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Yanjun.Zhu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

During our testing with hugetlb subpool enabled, we observe that hstate->resv_huge_pages may underflow into negative values. Root cause analysis reveals a race condition in subpool reservation fallback handling as follow: hugetlb_reserve_pages() /* Attempt subpool reservation */ gbl_reserve = hugepage_subpool_get_pages(spool, chg); /* Global reservation may fail after subpool allocation */ if (hugetlb_acct_memory(h, gbl_reserve) < 0) goto out_put_pages; out_put_pages: /* This incorrectly restores reservation to subpool */ hugepage_subpool_put_pages(spool, chg); When hugetlb_acct_memory() fails after subpool allocation, the current implementation over-commits subpool reservations by returning the full 'chg' value instead of the actual allocated 'gbl_reserve' amount. This discrepancy propagates to global reservations during subsequent releases, eventually causing resv_huge_pages underflow. This problem can be trigger easily with the following steps: 1. reverse hugepage for hugeltb allocation 2. mount hugetlbfs with min_size to enable hugetlb subpool 3. alloc hugepages with two task(make sure the second will fail due to insufficient amount of hugepages) 4. with for a few seconds and repeat step 3 which will make hstate->resv_huge_pages to go below zero. To fix this problem, return corrent amount of pages to subpool during the fallback after hugepage_subpool_get_pages is called. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1c5ecae ("hugetlbfs: add minimum size accounting to subpools") Signed-off-by: Wupeng Ma <[email protected]> Tested-by: Joshua Hahn <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ma Wupeng <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

The Microsoft email address is bouncing: 550 5.4.1 Recipient address rejected: Access denied. So let's replace it with Matteo's current mail address. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ahmad Fatoum <[email protected]> Acked-by: Matteo Croce <[email protected]> Link: https://lore.kernel.org/all/BYAPR15MB2504E4B02DFFB1E55871955DA1062@BYAPR15MB2504.namprd15.prod.outlook.com/ Cc: Daniel Lezcano <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matteo Croce <[email protected]> Cc: Sascha Hauer <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

inode_to_wb() is used also for filesystems that don't support cgroup writeback. For these filesystems inode->i_wb is stable during the lifetime of the inode (it points to bdi->wb) and there's no need to hold locks protecting the inode->i_wb dereference. Improve the warning in inode_to_wb() to not trigger for these filesystems. Link: https://lkml.kernel.org/r/[email protected] Fixes: aaa2cac ("writeback: add lockdep annotation to inode_to_wb()") Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: Andreas Gruenbacher <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

commit 4eeec8c ("mm: move hugetlb specific things in folio to page[3]") shifted hugetlb specific stuff, and now mapping overlaps _hugetlb_cgroup field. Upon restoring the vmemmap for HVO, only the first two tail pages are reset, and this causes the check in free_tail_page_prepare() to fail as it finds an unexpected mapping value in some tails. Increment the number of pages to be reset to 4 (head + 3 tail pages) Link: https://lkml.kernel.org/r/[email protected] Fixes: 4eeec8c ("mm: move hugetlb specific things in folio to page[3]") Signed-off-by: Oscar Salvador <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

…ount stabilization In __folio_remove_rmap() for RMAP_LEVEL_PMD/RMAP_LEVEL_PUD and with CONFIG_PAGE_MAPCOUNT we first decrement the folio mapcount (and recompute mapped shared vs. mapped exclusively) to then adjust the entire mapcount. This means that another process might stumble in do_wp_page() over a PTE-mapped PMD folio that is indicated as "exclusively mapped", but still has an entire mapcount (PMD mapping), because it is racing with the process that is unmapping the folio (PMD mapping). Note that do_wp_page() will back off once it detects the remaining folio reference from the process that is in the process of unmapping the folio. This will trigger the early VM_WARN_ON_ONCE(folio_entire_mapcount(folio)) check in do_wp_page(), that can easily be reproduced by looping a couple of times over allocating a PMD THP, forking a child where we immediately unmap it again, and writing in the parent concurrently to the THP. [ 252.738129][T16470] ------------[ cut here ]------------ [ 252.739267][T16470] WARNING: CPU: 3 PID: 16470 at mm/memory.c:3738 do_wp_page+0x2a75/0x2c00 [ 252.740968][T16470] Modules linked in: [ 252.741958][T16470] CPU: 3 UID: 0 PID: 16470 Comm: ... ... [ 252.765841][T16470] <TASK> [ 252.766419][T16470] ? srso_alias_return_thunk+0x5/0xfbef5 [ 252.767558][T16470] ? rcu_is_watching+0x12/0x60 [ 252.768525][T16470] ? srso_alias_return_thunk+0x5/0xfbef5 [ 252.769645][T16470] ? srso_alias_return_thunk+0x5/0xfbef5 [ 252.770778][T16470] ? lock_acquire+0x33/0x80 [ 252.771697][T16470] ? __handle_mm_fault+0x5e8/0x3e40 [ 252.772735][T16470] ? __handle_mm_fault+0x5e8/0x3e40 [ 252.773781][T16470] __handle_mm_fault+0x1869/0x3e40 [ 252.774839][T16470] handle_mm_fault+0x22a/0x640 [ 252.775808][T16470] do_user_addr_fault+0x618/0x1000 [ 252.776847][T16470] exc_page_fault+0x68/0xd0 [ 252.777775][T16470] asm_exc_page_fault+0x26/0x30 While we could adjust the sequence in __folio_remove_rmap(), let's rater move the mapcount sanity checks after the mapcount vs. refcount stabilization phase. With this fix, a simple reproducer is happy. While at it, convert the two VM_WARN_ON_ONCE() we are moving to VM_WARN_ON_ONCE_FOLIO(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 1da190f ("mm: Copy-on-Write (COW) reuse support for PTE-mapped THP") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: [email protected] Closes: https://lkml.kernel.org/r/[email protected] Reviewed-by: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

Adding nbytes argument to uprobe_write and related functions as preparation for writing whole instructions in following changes. Also renaming opcode arguments to insn, which seems to fit better. Signed-off-by: Jiri Olsa <[email protected]>

…code The uprobe_write has special path to restore the original page when we write original instruction back. This happens when uprobe_write detects that we want to write anything else but breakpoint instruction. Moving the detection away and passing it to uprobe_write as argument, so it's possible to write different instructions (other than just breakpoint and rest). Signed-off-by: Jiri Olsa <[email protected]>

Currently unapply_uprobe takes mmap_read_lock, but it might call remove_breakpoint which eventually changes user pages. Current code writes either breakpoint or original instruction, so it can probably go away with that, but with the upcoming change that writes multiple instructions on the probed address we need to ensure that any update to mm's pages is exclusive. Signed-off-by: Jiri Olsa <[email protected]>

Adding support to add special mapping for for user space trampoline with following functions: uprobe_trampoline_get - find or add uprobe_trampoline uprobe_trampoline_put - remove or destroy uprobe_trampoline The user space trampoline is exported as arch specific user space special mapping through tramp_mapping, which is initialized in following changes with new uprobe syscall. The uprobe trampoline needs to be callable/reachable from the probed address, so while searching for available address we use is_reachable_by_call function to decide if the uprobe trampoline is callable from the probe address. All uprobe_trampoline objects are stored in uprobes_state object and are cleaned up when the process mm_struct goes down. Adding new arch hooks for that, because this change is x86_64 specific. Locking is provided by callers in following changes. Signed-off-by: Jiri Olsa <[email protected]>

Adding new uprobe syscall that calls uprobe handlers for given 'breakpoint' address. The idea is that the 'breakpoint' address calls the user space trampoline which executes the uprobe syscall. The syscall handler reads the return address of the initial call to retrieve the original 'breakpoint' address. With this address we find the related uprobe object and call its consumers. Adding the arch_uprobe_trampoline_mapping function that provides uprobe trampoline mapping. This mapping is backed with one global page initialized at __init time and shared by the all the mapping instances. We do not allow to execute uprobe syscall if the caller is not from uprobe trampoline mapping. The uprobe syscall ensures the consumer (bpf program) sees registers values in the state before the trampoline was called. Signed-off-by: Jiri Olsa <[email protected]>

Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path does following: - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline with uprobe syscall - overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we plan to use nop5 as USDT probe instruction (which currently uses single byte nop) and speed up the USDT probes. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done. The uprobe is un-optimized in arch specific set_orig_insn call. The instruction overwrite is x86 arch specific and needs to go through 3 updates: (on top of nop5 instruction) - write int3 into 1st byte - write last 4 bytes of the call instruction - update the call instruction opcode And cleanup goes though similar reverse stages: - overwrite call opcode with breakpoint (int3) - write last 4 bytes of the nop5 instruction - write the nop5 first instruction byte We do not unmap and release uprobe trampoline when it's no longer needed, because there's no easy way to make sure none of the threads is still inside the trampoline. But we do not waste memory, because there's just single page for all the uprobe trampoline mappings. We do waste frame on page mapping for every 4GB by keeping the uprobe trampoline page mapped, but that seems ok. We take the benefit from the fact that set_swbp and set_orig_insn are called under mmap_write_lock(mm), so we can use the current instruction as the state the uprobe is in - nop5/breakpoint/call trampoline - and decide the needed action (optimize/un-optimize) based on that. Attaching the speed up from benchs/run_bench_uprobes.sh script: current: usermode-count : 152.604 ± 0.044M/s syscall-count : 13.359 ± 0.042M/s --> uprobe-nop : 3.229 ± 0.002M/s uprobe-push : 3.086 ± 0.004M/s uprobe-ret : 1.114 ± 0.004M/s uprobe-nop5 : 1.121 ± 0.005M/s uretprobe-nop : 2.145 ± 0.002M/s uretprobe-push : 2.070 ± 0.001M/s uretprobe-ret : 0.931 ± 0.001M/s uretprobe-nop5 : 0.957 ± 0.001M/s after the change: usermode-count : 152.448 ± 0.244M/s syscall-count : 14.321 ± 0.059M/s uprobe-nop : 3.148 ± 0.007M/s uprobe-push : 2.976 ± 0.004M/s uprobe-ret : 1.068 ± 0.003M/s --> uprobe-nop5 : 7.038 ± 0.007M/s uretprobe-nop : 2.109 ± 0.004M/s uretprobe-push : 2.035 ± 0.001M/s uretprobe-ret : 0.908 ± 0.001M/s uretprobe-nop5 : 3.377 ± 0.009M/s I see bit more speed up on Intel (above) compared to AMD. The big nop5 speed up is partly due to emulating nop5 and partly due to optimization. The key speed up we do this for is the USDT switch from nop to nop5: uprobe-nop : 3.148 ± 0.007M/s uprobe-nop5 : 7.038 ± 0.007M/s Signed-off-by: Jiri Olsa <[email protected]>

Using 5-byte nop for x86 usdt probes so we can switch to optimized uprobe them. Signed-off-by: Jiri Olsa <[email protected]>

Adding __test_uprobe_syscall with non x86_64 stub to execute all the tests, so we don't need to keep adding non x86_64 stub functions for new tests. Signed-off-by: Jiri Olsa <[email protected]>

…multi Renaming uprobe_syscall_executed prog to test_uretprobe_multi to fit properly in the following changes that add more programs. Signed-off-by: Jiri Olsa <[email protected]>

Adding tests for optimized uprobe/usdt probes. Checking that we get expected trampoline and attached bpf programs get executed properly. Signed-off-by: Jiri Olsa <[email protected]>

Adding test that makes sure parallel execution of the uprobe and attach/detach of optimized uprobe on it works properly. Signed-off-by: Jiri Olsa <[email protected]>

Make sure that calling uprobe syscall from outside uprobe trampoline results in sigill signal. Signed-off-by: Jiri Olsa <[email protected]>

Adding optimized usdt variant for basic usdt test to check that usdt arguments are properly passed in optimized code path. Signed-off-by: Jiri Olsa <[email protected]>

Changing uretprobe_regs_trigger to allow the test for both uprobe and uretprobe and renaming it to uprobe_regs_equal. We check that both uprobe and uretprobe probes (bpf programs) see expected registers with few exceptions. Signed-off-by: Jiri Olsa <[email protected]>

…robe Changing the test_uretprobe_regs_change test to test both uprobe and uretprobe by adding entry consumer handler to the testmod and making it to change one of the registers. Making sure that changed values both uprobe and uretprobe handlers propagate to the user space. Signed-off-by: Jiri Olsa <[email protected]>

Adding uprobe as another exception to the seccomp filter alongside with the uretprobe syscall. Same as the uretprobe the uprobe syscall is installed by kernel as replacement for the breakpoint exception and is limited to x86_64 arch and isn't expected to ever be supported in i386. Cc: Kees Cook <[email protected]> Cc: Eyal Birger <[email protected]> Signed-off-by: Jiri Olsa <[email protected]>

Adding uprobe checks into the current uretprobe tests. All the related tests are now executed with attached uprobe or uretprobe or without any probe. Renaming the test fixture to uprobe, because it seems better. Cc: Kees Cook <[email protected]> Cc: Eyal Birger <[email protected]> Signed-off-by: Jiri Olsa <[email protected]>

kernel-patches-daemon-bpf · 2025-04-27T19:14:08Z

Automatically cleaning up stale PR; feel free to reopen if needed

Peter Zijlstra and others added 30 commits April 8, 2025 20:55

Merge branch 'perf/urgent'

5dcba48

avoid merge conflicts with urgent Signed-off-by: Peter Zijlstra <[email protected]>

olsajiri added 17 commits April 18, 2025 17:33

uprobes: Add nbytes argument to uprobe_write

10f723e

Adding nbytes argument to uprobe_write and related functions as preparation for writing whole instructions in following changes. Also renaming opcode arguments to insn, which seems to fit better. Signed-off-by: Jiri Olsa <[email protected]>

selftests/bpf: Use 5-byte nop for x86 usdt probes

09d4655

Using 5-byte nop for x86 usdt probes so we can switch to optimized uprobe them. Signed-off-by: Jiri Olsa <[email protected]>

selftests/bpf: Reorg the uprobe_syscall test function

df23a2b

Adding __test_uprobe_syscall with non x86_64 stub to execute all the tests, so we don't need to keep adding non x86_64 stub functions for new tests. Signed-off-by: Jiri Olsa <[email protected]>

selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_…

7f739f7

…multi Renaming uprobe_syscall_executed prog to test_uretprobe_multi to fit properly in the following changes that add more programs. Signed-off-by: Jiri Olsa <[email protected]>

selftests/bpf: Add uprobe/usdt syscall tests

19b14f1

Adding tests for optimized uprobe/usdt probes. Checking that we get expected trampoline and attached bpf programs get executed properly. Signed-off-by: Jiri Olsa <[email protected]>

selftests/bpf: Add hit/attach/detach race optimized uprobe test

ba6c5f0

Adding test that makes sure parallel execution of the uprobe and attach/detach of optimized uprobe on it works properly. Signed-off-by: Jiri Olsa <[email protected]>

selftests/bpf: Add uprobe syscall sigill signal test

1893305

Make sure that calling uprobe syscall from outside uprobe trampoline results in sigill signal. Signed-off-by: Jiri Olsa <[email protected]>

selftests/bpf: Add optimized usdt variant for basic usdt test

4ddbd8a

Adding optimized usdt variant for basic usdt test to check that usdt arguments are properly passed in optimized code path. Signed-off-by: Jiri Olsa <[email protected]>

olsajiri force-pushed the bpf/optimized_usdt_ci branch from 77b2923 to 412abc6 Compare April 20, 2025 19:11

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 10 times, most recently from c7849f3 to a041a61 Compare April 25, 2025 16:28

kernel-patches-daemon-bpf bot closed this Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bpf/optimized usdt ci #8814

Bpf/optimized usdt ci #8814

Uh oh!

olsajiri commented Apr 18, 2025

Uh oh!

kernel-patches-daemon-bpf bot commented Apr 27, 2025

Uh oh!

Uh oh!

Bpf/optimized usdt ci #8814

Bpf/optimized usdt ci #8814

Uh oh!

Conversation

olsajiri commented Apr 18, 2025

Uh oh!

kernel-patches-daemon-bpf bot commented Apr 27, 2025

Uh oh!

Uh oh!