From d85aecf2844ff02a0e5f077252b2461d4f10c9f0 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 24 Mar 2021 21:37:17 -0700 Subject: hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings The current implementation of hugetlb_cgroup for shared mappings could have different behavior. Consider the following two scenarios: 1.Assume initial css reference count of hugetlb_cgroup is 1: 1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference count is 2 associated with 1 file_region. 1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference count is 3 associated with 2 file_region. 1.3 coalesce_file_region will coalesce these two file_regions into one. So css reference count is 3 associated with 1 file_region now. 2.Assume initial css reference count of hugetlb_cgroup is 1 again: 2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference count is 2 associated with 1 file_region. Therefore, we might have one file_region while holding one or more css reference counts. This inconsistency could lead to imbalanced css_get() and css_put() pair. If we do css_put one by one (i.g. hole punch case), scenario 2 would put one more css reference. If we do css_put all together (i.g. truncate case), scenario 1 will leak one css reference. The imbalanced css_get() and css_put() pair would result in a non-zero reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup directory is removed __but__ associated resource is not freed. This might result in OOM or can not create a new hugetlb cgroup in a busy workload ultimately. In order to fix this, we have to make sure that one file_region must hold exactly one css reference. So in coalesce_file_region case, we should release one css reference before coalescence. Also only put css reference when the entire file_region is removed. The last thing to note is that the caller of region_add() will only hold one reference to h_cg->css for the whole contiguous reservation region. But this area might be scattered when there are already some file_regions reside in it. As a result, many file_regions may share only one h_cg->css reference. In order to ensure that one file_region must hold exactly one css reference, we should do css_get() for each file_region and release the reference held by caller when they are done. [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings] Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings") Reported-by: kernel test robot (auto build test ERROR) Signed-off-by: Miaohe Lin Reviewed-by: Mike Kravetz Cc: Aneesh Kumar K.V Cc: Wanpeng Li Cc: Mina Almasry Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hugetlb_cgroup.h | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 2ad6e92f124a..0bff345c4bc6 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -113,6 +113,11 @@ static inline bool hugetlb_cgroup_disabled(void) return !cgroup_subsys_enabled(hugetlb_cgrp_subsys); } +static inline void hugetlb_cgroup_put_rsvd_cgroup(struct hugetlb_cgroup *h_cg) +{ + css_put(&h_cg->css); +} + extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup **ptr); extern int hugetlb_cgroup_charge_cgroup_rsvd(int idx, unsigned long nr_pages, @@ -138,7 +143,8 @@ extern void hugetlb_cgroup_uncharge_counter(struct resv_map *resv, extern void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv, struct file_region *rg, - unsigned long nr_pages); + unsigned long nr_pages, + bool region_del); extern void hugetlb_cgroup_file_init(void) __init; extern void hugetlb_cgroup_migrate(struct page *oldhpage, @@ -147,7 +153,8 @@ extern void hugetlb_cgroup_migrate(struct page *oldhpage, #else static inline void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv, struct file_region *rg, - unsigned long nr_pages) + unsigned long nr_pages, + bool region_del) { } @@ -185,6 +192,10 @@ static inline bool hugetlb_cgroup_disabled(void) return true; } +static inline void hugetlb_cgroup_put_rsvd_cgroup(struct hugetlb_cgroup *h_cg) +{ +} + static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup **ptr) { -- cgit v1.2.3 From cf10bd4c4aff8dd64d1aa7f2a529d0c672bc16af Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Wed, 24 Mar 2021 21:37:20 -0700 Subject: kasan: fix per-page tags for non-page_alloc pages To allow performing tag checks on page_alloc addresses obtained via page_address(), tag-based KASAN modes store tags for page_alloc allocations in page->flags. Currently, the default tag value stored in page->flags is 0x00. Therefore, page_address() returns a 0x00ffff... address for pages that were not allocated via page_alloc. This might cause problems. A particular case we encountered is a conflict with KFENCE. If a KFENCE-allocated slab object is being freed via kfree(page_address(page) + offset), the address passed to kfree() will get tagged with 0x00 (as slab pages keep the default per-page tags). This leads to is_kfence_address() check failing, and a KFENCE object ending up in normal slab freelist, which causes memory corruptions. This patch changes the way KASAN stores tag in page-flags: they are now stored xor'ed with 0xff. This way, KASAN doesn't need to initialize per-page flags for every created page, which might be slow. With this change, page_address() returns natively-tagged (with 0xff) pointers for pages that didn't have tags set explicitly. This patch fixes the encountered conflict with KFENCE and prevents more similar issues that can occur in the future. Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Andrey Konovalov Reviewed-by: Marco Elver Cc: Catalin Marinas Cc: Will Deacon Cc: Vincenzo Frascino Cc: Dmitry Vyukov Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Peter Collingbourne Cc: Evgenii Stepanov Cc: Branislav Rankov Cc: Kevin Brodsky Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 64a71bf20536..8ba434287387 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1461,16 +1461,28 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) +/* + * KASAN per-page tags are stored xor'ed with 0xff. This allows to avoid + * setting tags for all pages to native kernel tag value 0xff, as the default + * value 0x00 maps to 0xff. + */ + static inline u8 page_kasan_tag(const struct page *page) { - if (kasan_enabled()) - return (page->flags >> KASAN_TAG_PGSHIFT) & KASAN_TAG_MASK; - return 0xff; + u8 tag = 0xff; + + if (kasan_enabled()) { + tag = (page->flags >> KASAN_TAG_PGSHIFT) & KASAN_TAG_MASK; + tag ^= 0xff; + } + + return tag; } static inline void page_kasan_tag_set(struct page *page, u8 tag) { if (kasan_enabled()) { + tag ^= 0xff; page->flags &= ~(KASAN_TAG_MASK << KASAN_TAG_PGSHIFT); page->flags |= (tag & KASAN_TAG_MASK) << KASAN_TAG_PGSHIFT; } -- cgit v1.2.3 From c2655835fd8cabdfe7dab737253de3ffb88da126 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 24 Mar 2021 21:37:23 -0700 Subject: mm/mmu_notifiers: ensure range_end() is paired with range_start() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If one or more notifiers fails .invalidate_range_start(), invoke .invalidate_range_end() for "all" notifiers. If there are multiple notifiers, those that did not fail are expecting _start() and _end() to be paired, e.g. KVM's mmu_notifier_count would become imbalanced. Disallow notifiers that can fail _start() from implementing _end() so that it's unnecessary to either track which notifiers rejected _start(), or had already succeeded prior to a failed _start(). Note, the existing behavior of calling _start() on all notifiers even after a previous notifier failed _start() was an unintented "feature". Make it canon now that the behavior is depended on for correctness. As of today, the bug is likely benign: 1. The only caller of the non-blocking notifier is OOM kill. 2. The only notifiers that can fail _start() are the i915 and Nouveau drivers. 3. The only notifiers that utilize _end() are the SGI UV GRU driver and KVM. 4. The GRU driver will never coincide with the i195/Nouveau drivers. 5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the _guest_, and the guest is already doomed due to being an OOM victim. Fix the bug now to play nice with future usage, e.g. KVM has a potential use case for blocking memslot updates in KVM while an invalidation is in-progress, and failure to unblock would result in said updates being blocked indefinitely and hanging. Found by inspection. Verified by adding a second notifier in KVM that periodically returns -EAGAIN on non-blockable ranges, triggering OOM, and observing that KVM exits with an elevated notifier count. Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") Signed-off-by: Sean Christopherson Suggested-by: Jason Gunthorpe Reviewed-by: Jason Gunthorpe Cc: David Rientjes Cc: Ben Gardon Cc: Michal Hocko Cc: "Jérôme Glisse" Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Dimitri Sivanich Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmu_notifier.h | 10 +++++----- mm/mmu_notifier.c | 23 +++++++++++++++++++++++ 2 files changed, 28 insertions(+), 5 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b8200782dede..1a6a9eb6d3fa 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -169,11 +169,11 @@ struct mmu_notifier_ops { * the last refcount is dropped. * * If blockable argument is set to false then the callback cannot - * sleep and has to return with -EAGAIN. 0 should be returned - * otherwise. Please note that if invalidate_range_start approves - * a non-blocking behavior then the same applies to - * invalidate_range_end. - * + * sleep and has to return with -EAGAIN if sleeping would be required. + * 0 should be returned otherwise. Please note that notifiers that can + * fail invalidate_range_start are not allowed to implement + * invalidate_range_end, as there is no mechanism for informing the + * notifier that its start failed. */ int (*invalidate_range_start)(struct mmu_notifier *subscription, const struct mmu_notifier_range *range); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 61ee40ed804e..459d195d2ff6 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -501,10 +501,33 @@ static int mn_hlist_invalidate_range_start( ""); WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN); + /* + * We call all the notifiers on any EAGAIN, + * there is no way for a notifier to know if + * its start method failed, thus a start that + * does EAGAIN can't also do end. + */ + WARN_ON(ops->invalidate_range_end); ret = _ret; } } } + + if (ret) { + /* + * Must be non-blocking to get here. If there are multiple + * notifiers and one or more failed start, any that succeeded + * start are expecting their end to be called. Do so now. + */ + hlist_for_each_entry_rcu(subscription, &subscriptions->list, + hlist, srcu_read_lock_held(&srcu)) { + if (!subscription->ops->invalidate_range_end) + continue; + + subscription->ops->invalidate_range_end(subscription, + range); + } + } srcu_read_unlock(&srcu, id); return ret; -- cgit v1.2.3 From a024b7c2850dddd01e65b8270f0971deaf272f27 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 24 Mar 2021 21:37:50 -0700 Subject: mm: memblock: fix section mismatch warning again Commit 34dc2efb39a2 ("memblock: fix section mismatch warning") marked memblock_bottom_up() and memblock_set_bottom_up() as __init, but they could be referenced from non-init functions like memblock_find_in_range_node() on architectures that enable CONFIG_ARCH_KEEP_MEMBLOCK. For such builds kernel test robot reports: WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up() The function memblock_find_in_range_node() references the function __init memblock_bottom_up(). This is often because memblock_find_in_range_node lacks a __init annotation or the annotation of memblock_bottom_up is wrong. Replace __init annotations with __init_memblock annotations so that the appropriate section will be selected depending on CONFIG_ARCH_KEEP_MEMBLOCK. Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org Fixes: 34dc2efb39a2 ("memblock: fix section mismatch warning") Signed-off-by: Mike Rapoport Reviewed-by: Arnd Bergmann Reported-by: kernel test robot Reviewed-by: David Hildenbrand Acked-by: Nick Desaulniers Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memblock.h b/include/linux/memblock.h index d13e3cd938b4..5984fff3f175 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -460,7 +460,7 @@ static inline void memblock_free_late(phys_addr_t base, phys_addr_t size) /* * Set the allocation direction to bottom-up or top-down. */ -static inline __init void memblock_set_bottom_up(bool enable) +static inline __init_memblock void memblock_set_bottom_up(bool enable) { memblock.bottom_up = enable; } @@ -470,7 +470,7 @@ static inline __init void memblock_set_bottom_up(bool enable) * if this is true, that said, memblock will allocate memory * in bottom-up direction. */ -static inline __init bool memblock_bottom_up(void) +static inline __init_memblock bool memblock_bottom_up(void) { return memblock.bottom_up; } -- cgit v1.2.3