summaryrefslogtreecommitdiff
path: root/include/linux
diff options
context:
space:
mode:
authorAlexei Starovoitov <ast@kernel.org>2025-12-23 22:30:00 +0300
committerAlexei Starovoitov <ast@kernel.org>2025-12-23 22:30:00 +0300
commitf14cdb1367b947d373215e36cfe9c69768dbafc9 (patch)
tree8cd00595092219b41495ed300923593d4a651cff /include/linux
parentac1c5bc7c4c7e20e2070e6eaa673fc3e11619dbb (diff)
parentefecc9e825f4aa3fe616236152604a066a3e776d (diff)
downloadlinux-f14cdb1367b947d373215e36cfe9c69768dbafc9.tar.xz
Merge branch 'remove-kf_sleepable-from-arena-kfuncs'
Puranjay Mohan says: ==================== Remove KF_SLEEPABLE from arena kfuncs V7: https://lore.kernel.org/all/20251222190815.4112944-1-puranjay@kernel.org/ Changes in V7->v8: - Use clear_lo32(arena->user_vm_start) in place of user_vm_start in patch 3 V6: https://lore.kernel.org/all/20251217184438.3557859-1-puranjay@kernel.org/ Changes in v6->v7: - Fix a deadlock in patch 1, that was being fixed in patch 2. Move the fix to patch 1. - Call flush_cache_vmap() after setting up the mappings as it is required by some architectures. V5: https://lore.kernel.org/all/20251212044516.37513-1-puranjay@kernel.org/ Changes in v5->v6: Patch 1: - Add a missing ; to make sure this patch builds individually. (AI) V4: https://lore.kernel.org/all/20251212004350.6520-1-puranjay@kernel.org/ Changes in v4->v5: Patch 1: - Fix a memory leak in arena_alloc_pages(), it was being fixed in Patch 3 but, every patch should be complete in itself. (AI) Patch 3: - Don't do useless addition in arena_alloc_pages() (Alexei) - Add a comment about kmalloc_nolock() failure and expectations. v3: https://lore.kernel.org/all/20251117160150.62183-1-puranjay@kernel.org/ Changes in v3->v4: - Coding style changes related to comments in Patch 2/3 (Alexei) v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/ Changes in v2->v3: Patch 1: - Call range_tree_destroy() in error path of populate_pgtable_except_pte() in arena_map_alloc() (AI) Patch 2: - Fix double mutex_unlock() in the error path of arena_alloc_pages() (AI) - Fix coding style issues (Alexei) Patch 3: - Unlock spinlock before returning from arena_vm_fault() in case BPF_F_SEGV_ON_FAULT is set by user. (AI) - Use __llist_del_all() in place of llist_del_all for on-stack llist (free_pages) (Alexei) - Fix build issues on 32-bit systems where arena.c is not compiled. (kernel test robot) - Make bpf_arena_alloc_pages() polymorphic so it knows if it has been called in sleepable or non-sleepable context. This information is passed to arena_free_pages() in the error path. Patch 4: - Add a better comment for the big_alloc3() test that triggers kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works correctly above this limit. v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/ Changes in v1->v2: Patch 1: - Import tlbflush.h to fix build issue in loongarch. (kernel test robot) - Fix unused variable error in apply_range_clear_cb() (kernel test robot) - Call bpf_map_area_free() on error path of populate_pgtable_except_pte() (AI) - Use PAGE_SIZE in apply_to_existing_page_range() (AI) Patch 2: - Cap allocation made by kmalloc_nolock() for pages array to KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop to overcome this limit. (AI) Patch 3: - Do page_ref_add(page, 1); under the spinlock to mitigate a race (AI) Patch 4: - Add a new testcase big_alloc3() verifier_arena_large.c that tries to allocate a large number of pages at once, this is to trigger the kmalloc_nolock() limit in Patch 2 and see if the loop logic works correctly. This set allows arena kfuncs to be called from non-sleepable contexts. It is acheived by the following changes: The range_tree is now protected with a rqspinlock and not a mutex, this change is enough to make bpf_arena_reserve_pages() any context safe. bpf_arena_alloc_pages() had four points where it could sleep: 1. Mutex to protect range_tree: now replaced with rqspinlock 2. kvcalloc() for allocations: now replaced with kmalloc_nolock() 3. Allocating pages with bpf_map_alloc_pages(): this already calls alloc_pages_nolock() in non-sleepable contexts and therefore is safe. 4. Setting up kernel page tables with vm_area_map_pages(): vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. Now, at arena creation time populate all page table levels except the last level and when new pages need to be inserted call apply_to_page_range() again which will only do set_pte_at() for those pages and will not allocate memory. The above four changes make bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages() has to do the following steps: 1. Update the range_tree 2. vm_area_unmap_pages(): to unmap pages from kernel vm_area 3. flush the tlb: done in step 2, already. 4. zap_pages(): to unmap pages from user page tables 5. free pages. The third patch in this set makes bpf_arena_free_pages() polymorphic using the specialize_kfunc() mechanism. When called from a sleepable context, arena_free_pages() remains mostly unchanged except the following: 1. rqspinlock is taken now instead of the mutex for the range tree 2. Instead of using vm_area_unmap_pages() that can free intermediate page table levels, apply_to_existing_page_range() with a callback is used that only does pte_clear() on the last level and leaves the intermediate page table levels intact. This is needed to make sure that bpf_arena_alloc_pages() can safely do set_pte_at() without allocating intermediate page tables. When arena_free_pages() is called from a non-sleepable context or it fails to acquire the rqspinlock in the sleepable case, a lock-less list of struct arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock() is used to allocate this arena_free_span, this can fail but we need to make this trade-off for frees done from non-sleepable contexts. arena_free_pages() then raises an irq_work whose handler in turn schedules work that iterate this list and clears ptes, flushes tlbs, zap pages, and frees pages for the queued uaddr and page cnts. apply_range_clear_cb() with apply_to_existing_page_range() is used to clear PTEs and collect pages to be freed, struct llist_node pcp_llist; in the struct page is used to do this. ==================== Link: https://patch.msgid.link/20251222195022.431211-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Diffstat (limited to 'include/linux')
-rw-r--r--include/linux/bpf.h16
1 files changed, 16 insertions, 0 deletions
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index da6a00dd313f..4e7d72dfbcd4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -673,6 +673,22 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj);
int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags,
struct bpf_dynptr *ptr__uninit);
+#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, int node_id,
+ u64 flags);
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt);
+#else
+static inline void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt,
+ int node_id, u64 flags)
+{
+ return NULL;
+}
+
+static inline void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+}
+#endif
+
extern const struct bpf_map_ops bpf_map_offload_ops;
/* bpf_type_flag contains a set of flags that are applicable to the values of