summaryrefslogtreecommitdiff
path: root/mm/memcontrol.c
AgeCommit message (Collapse)AuthorFilesLines
2025-11-24memcg: fix data-race KCSAN bug in rstatsBreno Leitao1-5/+7
commit 78ec6f9df6642418411c534683da6133e0962ec7 upstream. A data-race issue in memcg rstat occurs when two distinct code paths access the same 4-byte region concurrently. KCSAN detection triggers the following BUG as a result. BUG: KCSAN: data-race in __count_memcg_events / mem_cgroup_css_rstat_flush write to 0xffffe8ffff98e300 of 4 bytes by task 5274 on cpu 17: mem_cgroup_css_rstat_flush (mm/memcontrol.c:5850) cgroup_rstat_flush_locked (kernel/cgroup/rstat.c:243 (discriminator 7)) cgroup_rstat_flush (./include/linux/spinlock.h:401 kernel/cgroup/rstat.c:278) mem_cgroup_flush_stats.part.0 (mm/memcontrol.c:767) memory_numa_stat_show (mm/memcontrol.c:6911) <snip> read to 0xffffe8ffff98e300 of 4 bytes by task 410848 on cpu 27: __count_memcg_events (mm/memcontrol.c:725 mm/memcontrol.c:962) count_memcg_event_mm.part.0 (./include/linux/memcontrol.h:1097 ./include/linux/memcontrol.h:1120) handle_mm_fault (mm/memory.c:5483 mm/memory.c:5622) <snip> value changed: 0x00000029 -> 0x00000000 The race occurs because two code paths access the same "stats_updates" location. Although "stats_updates" is a per-CPU variable, it is remotely accessed by another CPU at cgroup_rstat_flush_locked()->mem_cgroup_css_rstat_flush(), leading to the data race mentioned. Considering that memcg_rstat_updated() is in the hot code path, adding a lock to protect it may not be desirable, especially since this variable pertains solely to statistics. Therefore, annotating accesses to stats_updates with READ/WRITE_ONCE() can prevent KCSAN splats and potential partial reads/writes. Link: https://lkml.kernel.org/r/20240424125940.2410718-1-leitao@debian.org Fixes: 9cee7e8ef3e3 ("mm: memcg: optimize parent iteration in memcg_rstat_updated()") Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: optimize parent iteration in memcg_rstat_updated()Yosry Ahmed1-21/+35
commit 9cee7e8ef3e31ca25b40ca52b8585dc6935deff2 upstream. In memcg_rstat_updated(), we iterate the memcg being updated and its parents to update memcg->vmstats_percpu->stats_updates in the fast path (i.e. no atomic updates). According to my math, this is 3 memory loads (and potentially 3 cache misses) per memcg: - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). - Load the address of the parent memcg. Avoid most of the cache misses by caching a pointer from each struct memcg_vmstats_percpu to its parent on the corresponding CPU. In this case, for the first memcg we have 2 memory loads (same as above): - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). Then for each additional memcg, we need a single load to get the parent's stats_updates directly. This reduces the number of loads from O(3N) to O(2+N) -- where N is the number of memcgs we need to iterate. Additionally, stash a pointer to memcg->vmstats in each struct memcg_vmstats_percpu such that we can access the atomic counter that all CPUs fold into, memcg->vmstats->stats_updates. memcg_should_flush_stats() is changed to memcg_vmstats_needs_flush() to accept a struct memcg_vmstats pointer accordingly. In struct memcg_vmstats_percpu, make sure both pointers together with stats_updates live on the same cacheline. Finally, update mem_cgroup_alloc() to take in a parent pointer and initialize the new cache pointers on each CPU. The percpu loop in mem_cgroup_alloc() may look concerning, but there are multiple similar loops in the cgroup creation path (e.g. cgroup_rstat_init()), most of which are hidden within alloc_percpu(). According to Oliver's testing [1], this fixes multiple 30-38% regressions in vm-scalability, will-it-scale-tlb_flush2, and will-it-scale-fallocate1. This comes at a cost of 2 more pointers per CPU (<2KB on a machine with 128 CPUs). [1] https://lore.kernel.org/lkml/ZbDJsfsZt2ITyo61@xsang-OptiPlex-9020/ [yosryahmed@google.com: fix struct memcg_vmstats_percpu size and alignment] Link: https://lkml.kernel.org/r/20240203044612.1234216-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240124100023.660032-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Fixes: 8d59d2214c23 ("mm: memcg: make stats flushing threshold per-memcg") Tested-by: kernel test robot <oliver.sang@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202401221624.cb53a8ca-oliver.sang@intel.com Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: restore subtree stats flushingYosry Ahmed1-29/+39
[ Upstream commit 7d7ef0a4686abe43cd76a141b340a348f45ecdf2 ] Stats flushing for memcg currently follows the following rules: - Always flush the entire memcg hierarchy (i.e. flush the root). - Only one flusher is allowed at a time. If someone else tries to flush concurrently, they skip and return immediately. - A periodic flusher flushes all the stats every 2 seconds. The reason this approach is followed is because all flushes are serialized by a global rstat spinlock. On the memcg side, flushing is invoked from userspace reads as well as in-kernel flushers (e.g. reclaim, refault, etc). This approach aims to avoid serializing all flushers on the global lock, which can cause a significant performance hit under high concurrency. This approach has the following problems: - Occasionally a userspace read of the stats of a non-root cgroup will be too expensive as it has to flush the entire hierarchy [1]. - Sometimes the stats accuracy are compromised if there is an ongoing flush, and we skip and return before the subtree of interest is actually flushed, yielding stale stats (by up to 2s due to periodic flushing). This is more visible when reading stats from userspace, but can also affect in-kernel flushers. The latter problem is particulary a concern when userspace reads stats after an event occurs, but gets stats from before the event. Examples: - When memory usage / pressure spikes, a userspace OOM handler may look at the stats of different memcgs to select a victim based on various heuristics (e.g. how much private memory will be freed by killing this). Reading stale stats from before the usage spike in this case may cause a wrongful OOM kill. - A proactive reclaimer may read the stats after writing to memory.reclaim to measure the success of the reclaim operation. Stale stats from before reclaim may give a false negative. - Reading the stats of a parent and a child memcg may be inconsistent (child larger than parent), if the flush doesn't happen when the parent is read, but happens when the child is read. As for in-kernel flushers, they will occasionally get stale stats. No regressions are currently known from this, but if there are regressions, they would be very difficult to debug and link to the source of the problem. This patch aims to fix these problems by restoring subtree flushing, and removing the unified/coalesced flushing logic that skips flushing if there is an ongoing flush. This change would introduce a significant regression with global stats flushing thresholds. With per-memcg stats flushing thresholds, this seems to perform really well. The thresholds protect the underlying lock from unnecessary contention. This patch was tested in two ways to ensure the latency of flushing is up to par, on a machine with 384 cpus: - A synthetic test with 5000 concurrent workers in 500 cgroups doing allocations and reclaim, as well as 1000 readers for memory.stat (variation of [2]). No regressions were noticed in the total runtime. Note that significant regressions in this test are observed with global stats thresholds, but not with per-memcg thresholds. - A synthetic stress test for concurrently reading memcg stats while memory allocation/freeing workers are running in the background, provided by Wei Xu [3]. With 250k threads reading the stats every 100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01% of reads take more than 1ms, and no reads take more than 100ms. [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ [akpm@linux-foundation.org: fix mm/zswap.c] [yosryahmed@google.com: remove stats flushing mutex] Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: make stats flushing threshold per-memcgYosry Ahmed1-16/+34
[ Upstream commit 8d59d2214c2362e7a9d185d80b613e632581af7b ] A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were no further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: move vmstats structs definition above flushing codeYosry Ahmed1-74/+74
[ Upstream commit e0bf1dc859fdd08ef738824710770a30a8069433 ] The following patch will make use of those structs in the flushing code, so move their definitions (and a few other dependencies) a little bit up to reduce the diff noise in the following patch. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: change flush_next_time to flush_last_timeYosry Ahmed1-3/+4
[ Upstream commit 508bed884767a8eb394640bae9edcdf082816c43 ] Patch series "mm: memcg: subtree stats flushing and thresholds", v4. This series attempts to address shortages in today's approach for memcg stats flushing, namely occasionally stale or expensive stat reads. The series does so by changing the threshold that we use to decide whether to trigger a flush to be per memcg instead of global (patch 3), and then changing flushing to be per memcg (i.e. subtree flushes) instead of global (patch 5). This patch (of 5): flush_next_time is an inaccurate name. It's not the next time that periodic flushing will happen, it's rather the next time that ratelimited flushing can happen if the periodic flusher is late. Simplify its semantics by just storing the timestamp of the last flush instead, flush_last_time. Move the 2*FLUSH_TIME addition to mem_cgroup_flush_stats_ratelimited(), and add a comment explaining it. This way, all the ratelimiting semantics live in one place. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20231129032154.3710765-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: add per-memcg zswap writeback statDomenico Cerasuolo1-0/+1
[ Upstream commit 7108cc3f765cafd48a6a35f8add140beaecfa75b ] Since zswap now writes back pages from memcg-specific LRUs, we now need a new stat to show writebacks count for each memcg. [nphamcs@gmail.com: rename ZSWP_WB to ZSWPWB] Link: https://lkml.kernel.org/r/20231205193307.2432803-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-5-nphamcs@gmail.com Suggested-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: add THP swap out info for anonymous reclaimXin Hao1-0/+2
[ Upstream commit 811244a501b967b00fecb1ae906d5dc6329c91e0 ] At present, we support per-memcg reclaim strategy, however we do not know the number of transparent huge pages being reclaimed, as we know the transparent huge pages need to be splited before reclaim them, and they will bring some performance bottleneck effect. for example, when two memcg (A & B) are doing reclaim for anonymous pages at same time, and 'A' memcg is reclaiming a large number of transparent huge pages, we can better analyze that the performance bottleneck will be caused by 'A' memcg. therefore, in order to better analyze such problems, there add THP swap out info for per-memcg. [akpm@linux-foundation.orgL fix swap_writepage_fs(), per Johannes] Link: https://lkml.kernel.org/r/20230913213343.GB48476@cmpxchg.org Link: https://lkml.kernel.org/r/20230913164938.16918-1-vernhao@tencent.com Signed-off-by: Xin Hao <vernhao@tencent.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04memcg: always call cond_resched() after fn()Breno Leitao1-4/+2
commit 06717a7b6c86514dbd6ab322e8083ffaa4db5712 upstream. I am seeing soft lockup on certain machine types when a cgroup OOMs. This is happening because killing the process in certain machine might be very slow, which causes the soft lockup and RCU stalls. This happens usually when the cgroup has MANY processes and memory.oom.group is set. Example I am seeing in real production: [462012.244552] Memory cgroup out of memory: Killed process 3370438 (crosvm) .... .... [462037.318059] Memory cgroup out of memory: Killed process 4171372 (adb) .... [462037.348314] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [stat_manager-ag:1618982] .... Quick look at why this is so slow, it seems to be related to serial flush for certain machine types. For all the crashes I saw, the target CPU was at console_flush_all(). In the case above, there are thousands of processes in the cgroup, and it is soft locking up before it reaches the 1024 limit in the code (which would call the cond_resched()). So, cond_resched() in 1024 blocks is not sufficient. Remove the counter-based conditional rescheduling logic and call cond_resched() unconditionally after each task iteration, after fn() is called. This avoids the lockup independently of how slow fn() is. Link: https://lkml.kernel.org/r/20250523-memcg_fix-v1-1-ad3eafb60477@debian.org Fixes: ade81479c7dd ("memcg: fix soft lockup in the OOM process") Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Rik van Riel <riel@surriel.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michael van der Westhuizen <rmikey@meta.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-27memcg: fix soft lockup in the OOM processChen Ridong1-1/+6
[ Upstream commit ade81479c7dda1ce3eedb215c78bc615bbd04f06 ] A soft lockup issue was found in the product with about 56,000 tasks were in the OOM cgroup, it was traversing them when the soft lockup was triggered. watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [VM Thread:1503066] CPU: 2 PID: 1503066 Comm: VM Thread Kdump: loaded Tainted: G Hardware name: Huawei Cloud OpenStack Nova, BIOS RIP: 0010:console_unlock+0x343/0x540 RSP: 0000:ffffb751447db9a0 EFLAGS: 00000247 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000ffffffff RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000247 RBP: ffffffffafc71f90 R08: 0000000000000000 R09: 0000000000000040 R10: 0000000000000080 R11: 0000000000000000 R12: ffffffffafc74bd0 R13: ffffffffaf60a220 R14: 0000000000000247 R15: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2fe6ad91f0 CR3: 00000004b2076003 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: vprintk_emit+0x193/0x280 printk+0x52/0x6e dump_task+0x114/0x130 mem_cgroup_scan_tasks+0x76/0x100 dump_header+0x1fe/0x210 oom_kill_process+0xd1/0x100 out_of_memory+0x125/0x570 mem_cgroup_out_of_memory+0xb5/0xd0 try_charge+0x720/0x770 mem_cgroup_try_charge+0x86/0x180 mem_cgroup_try_charge_delay+0x1c/0x40 do_anonymous_page+0xb5/0x390 handle_mm_fault+0xc4/0x1f0 This is because thousands of processes are in the OOM cgroup, it takes a long time to traverse all of them. As a result, this lead to soft lockup in the OOM process. To fix this issue, call 'cond_resched' in the 'mem_cgroup_scan_tasks' function per 1000 iterations. For global OOM, call 'touch_softlockup_watchdog' per 1000 iterations to avoid this issue. Link: https://lkml.kernel.org/r/20241224025238.3768787-1-chenridong@huaweicloud.com Fixes: 9cbb78bb3143 ("mm, memcg: introduce own oom handler to iterate only over its own threads") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-17mm/thp: fix deferred split unqueue naming and lockingHugh Dickins1-3/+29
commit f8f931bba0f92052cf842b7e30917b1afcc77d5a upstream. Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Upstream commit itself does not apply cleanly, because there are fewer calls to folio_undo_large_rmappable() in this tree (in particular, folio migration does not migrate memcg charge), and mm/memcontrol-v1.c has not been split out of mm/memcontrol.c. ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-17mm: always initialise folio->_deferred_listMatthew Wilcox (Oracle)1-0/+3
commit b7b098cf00a2b65d5654a86dc8edf82f125289c1 upstream. Patch series "Various significant MM patches". These patches all interact in annoying ways which make it tricky to send them out in any way other than a big batch, even though there's not really an overarching theme to connect them. The big effects of this patch series are: - folio_test_hugetlb() becomes reliable, even when called without a page reference - We free up PG_slab, and we could always use more page flags - We no longer need to check PageSlab before calling page_mapcount() This patch (of 9): For compound pages which are at least order-2 (and hence have a deferred_list), initialise it and then we can check at free that the page is not part of a deferred list. We recently found this useful to rule out a source of corruption. [peterx@redhat.com: always initialise folio->_deferred_list] Link: https://lkml.kernel.org/r/20240417211836.2742593-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20240321142448.1645400-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Include three small changes from the upstream commit, for backport safety: replace list_del() by list_del_init() in split_huge_page_to_list(), like c010d47f107f ("mm: thp: split huge page to any lower order pages"); replace list_del() by list_del_init() in folio_undo_large_rmappable(), like 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration"); keep __free_pages() instead of folio_put() in __update_and_free_hugetlb_folio(). ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29memcg_write_event_control(): fix a user-triggerable oopsAl Viro1-2/+5
commit 046667c4d3196938e992fba0dfcde570aa85cd0e upstream. we are *not* guaranteed that anything past the terminating NUL is mapped (let alone initialized with anything sane). Fixes: 0dea116876ee ("cgroup: implement eventfd-based generic API for notifications") Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14memcg: protect concurrent access to mem_cgroup_idrShakeel Butt1-2/+20
commit 9972605a238339b85bd16b084eed5f18414d22db upstream. Commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") decoupled the memcg IDs from the CSS ID space to fix the cgroup creation failures. It introduced IDR to maintain the memcg ID space. The IDR depends on external synchronization mechanisms for modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace() happen within css callback and thus are protected through cgroup_mutex from concurrent modifications. However idr_remove() for mem_cgroup_idr was not protected against concurrency and can be run concurrently for different memcgs when they hit their refcnt to zero. Fix that. We have been seeing list_lru based kernel crashes at a low frequency in our fleet for a long time. These crashes were in different part of list_lru code including list_lru_add(), list_lru_del() and reparenting code. Upon further inspection, it looked like for a given object (dentry and inode), the super_block's list_lru didn't have list_lru_one for the memcg of that object. The initial suspicions were either the object is not allocated through kmem_cache_alloc_lru() or somehow memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but returned success. No evidence were found for these cases. Looking more deeply, we started seeing situations where valid memcg's id is not present in mem_cgroup_idr and in some cases multiple valid memcgs have same id and mem_cgroup_idr is pointing to one of them. So, the most reasonable explanation is that these situations can happen due to race between multiple idr_remove() calls or race between idr_alloc()/idr_replace() and idr_remove(). These races are causing multiple memcgs to acquire the same ID and then offlining of one of them would cleanup list_lrus on the system for all of them. Later access from other memcgs to the list_lru cause crashes due to missing list_lru_one. Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01mm: memcontrol: clarify swapaccount=0 deprecation warningJohannes Weiner1-3/+7
commit 118642d7f606fc9b9c92ee611275420320290ffb upstream. The swapaccount deprecation warning is throwing false positives. Since we deprecated the knob and defaulted to enabling, the only reports we've been getting are from folks that set swapaccount=1. While this is a nice affirmation that always-enabling was the right choice, we certainly don't want to warn when users request the supported mode. Only warn when disabling is requested, and clarify the warning. [colin.i.king@gmail.com: spelling: "commdandline" -> "commandline"] Link: https://lkml.kernel.org/r/20240215090544.1649201-1-colin.i.king@gmail.com Link: https://lkml.kernel.org/r/20240213081634.3652326-1-hannes@cmpxchg.org Fixes: b25806dcd3d5 ("mm: memcontrol: deprecate swapaccounting=0 mode") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reported-by: "Jonas Schäfer" <jonas@wielicki.name> Reported-by: Narcis Garcia <debianlists@actiu.net> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28mm: kmem: drop __GFP_NOFAIL when allocating objcg vectorsRoman Gushchin1-1/+2
commit 24948e3b7b12e0031a6edb4f49bbb9fb2ad1e4e9 upstream. Objcg vectors attached to slab pages to store slab object ownership information are allocated using gfp flags for the original slab allocation. Depending on slab page order and the size of slab objects, objcg vector can take several pages. If the original allocation was done with the __GFP_NOFAIL flag, it triggered a warning in the page allocation code. Indeed, order > 1 pages should not been allocated with the __GFP_NOFAIL flag. Fix this by simply dropping the __GFP_NOFAIL flag when allocating the objcg vector. It effectively allows to skip the accounting of a single slab object under a heavy memory pressure. An alternative would be to implement the mechanism to fallback to order-0 allocations for accounting metadata, which is also not perfect because it will increase performance penalty and memory footprint of the kernel memory accounting under memory pressure. Link: https://lkml.kernel.org/r/ZUp8ZFGxwmCx4ZFr@P9FQF9L96D.corp.robot.car Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: Christoph Lameter <cl@linux.com> Closes: https://lkml.kernel.org/r/6b42243e-f197-600a-5d22-56bd728a5ad8@gentwo.org Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-30mm, memcg: reconsider kmem.limit_in_bytes deprecationMichal Hocko1-0/+13
This reverts commits 86327e8eb94c ("memcg: drop kmem.limit_in_bytes") and partially reverts 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes") which have incrementally removed support for the kernel memory accounting hard limit. Unfortunately it has turned out that there is still userspace depending on the existence of memory.kmem.limit_in_bytes [1]. The underlying functionality is not really required but the non-existent file just confuses the userspace which fails in the result. The patch to fix this on the userspace side has been submitted but it is hard to predict how it will propagate through the maze of 3rd party consumers of the software. Now, reverting alone 86327e8eb94c is not an option because there is another set of userspace which cannot cope with ENOTSUPP returned when writing to the file. Therefore we have to go and revisit 58056f77502f as well. There are two ways to go ahead. Either we give up on the deprecation and fully revert 58056f77502f as well or we can keep kmem.limit_in_bytes but make the write a noop and warn about the fact. This should work for both known breaking workloads which depend on the existence but do not depend on the hard limit enforcement. Note to backporters to stable trees. a8c49af3be5f ("memcg: add per-memcg total kernel memory stat") introduced in 4.18 has added memcg_account_kmem so the accounting is not done by obj_cgroup_charge_pages directly for v1 anymore. Prior kernels need to add it explicitly (thanks to Johannes for pointing this out). [akpm@linux-foundation.org: fix build - remove unused local] Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net [1] Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz Fixes: 86327e8eb94c ("memcg: drop kmem.limit_in_bytes") Fixes: 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcementJohannes Weiner1-3/+3
Breno and Josef report a deadlock scenario from cgroup reclaim re-entering the filesystem: [ 361.546690] ====================================================== [ 361.559210] WARNING: possible circular locking dependency detected [ 361.571703] 6.5.0-0_fbk700_debug_rc0_kbuilder_13159_gbf787a128001 #1 Tainted: G S E [ 361.589704] ------------------------------------------------------ [ 361.602277] find/9315 is trying to acquire lock: [ 361.611625] ffff88837ba140c0 (&delayed_node->mutex){+.+.}-{4:4}, at: __btrfs_release_delayed_node+0x68/0x4f0 [ 361.631437] [ 361.631437] but task is already holding lock: [ 361.643243] ffff8881765b8678 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock+0x1e/0x40 [ 362.904457] mutex_lock_nested+0x1c/0x30 [ 362.912414] __btrfs_release_delayed_node+0x68/0x4f0 [ 362.922460] btrfs_evict_inode+0x301/0x770 [ 362.982726] evict+0x17c/0x380 [ 362.988944] prune_icache_sb+0x100/0x1d0 [ 363.005559] super_cache_scan+0x1f8/0x260 [ 363.013695] do_shrink_slab+0x2a2/0x540 [ 363.021489] shrink_slab_memcg+0x237/0x3d0 [ 363.050606] shrink_slab+0xa7/0x240 [ 363.083382] shrink_node_memcgs+0x262/0x3b0 [ 363.091870] shrink_node+0x1a4/0x720 [ 363.099150] shrink_zones+0x1f6/0x5d0 [ 363.148798] do_try_to_free_pages+0x19b/0x5e0 [ 363.157633] try_to_free_mem_cgroup_pages+0x266/0x370 [ 363.190575] reclaim_high+0x16f/0x1f0 [ 363.208409] mem_cgroup_handle_over_high+0x10b/0x270 [ 363.246678] try_charge_memcg+0xaf2/0xc70 [ 363.304151] charge_memcg+0xf0/0x350 [ 363.320070] __mem_cgroup_charge+0x28/0x40 [ 363.328371] __filemap_add_folio+0x870/0xd50 [ 363.371303] filemap_add_folio+0xdd/0x310 [ 363.399696] __filemap_get_folio+0x2fc/0x7d0 [ 363.419086] pagecache_get_page+0xe/0x30 [ 363.427048] alloc_extent_buffer+0x1cd/0x6a0 [ 363.435704] read_tree_block+0x43/0xc0 [ 363.443316] read_block_for_search+0x361/0x510 [ 363.466690] btrfs_search_slot+0xc8c/0x1520 This is caused by the mem_cgroup_handle_over_high() not respecting the gfp_mask of the allocation context. We used to only call this function on resume to userspace, where no locks were held. But c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") added a call from the allocation context without considering the gfp. Link: https://lkml.kernel.org/r/20230914152139.100822-1-hannes@cmpxchg.org Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-05memcontrol: ensure memcg acquired by id is properly set upJohannes Weiner1-5/+17
In the eviction recency check, we attempt to retrieve the memcg to which the folio belonged when it was evicted, by the memcg id stored in the shadow entry. However, there is a chance that the retrieved memcg is not the original memcg that has been killed, but a new one which happens to have the same id. This is a somewhat unfortunate, but acceptable and rare inaccuracy in the heuristics. However, if we retrieve this new memcg between its allocation and when it is properly attached to the memcg hierarchy, we could run into the following NULL pointer exception during the memcg hierarchy traversal done in mem_cgroup_get_nr_swap_pages(): [ 155757.793456] BUG: kernel NULL pointer dereference, address: 00000000000000c0 [ 155757.807568] #PF: supervisor read access in kernel mode [ 155757.818024] #PF: error_code(0x0000) - not-present page [ 155757.828482] PGD 401f77067 P4D 401f77067 PUD 401f76067 PMD 0 [ 155757.839985] Oops: 0000 [#1] SMP [ 155757.887870] RIP: 0010:mem_cgroup_get_nr_swap_pages+0x3d/0xb0 [ 155757.899377] Code: 29 19 4a 02 48 39 f9 74 63 48 8b 97 c0 00 00 00 48 8b b7 58 02 00 00 48 2b b7 c0 01 00 00 48 39 f0 48 0f 4d c6 48 39 d1 74 42 <48> 8b b2 c0 00 00 00 48 8b ba 58 02 00 00 48 2b ba c0 01 00 00 48 [ 155757.937125] RSP: 0018:ffffc9002ecdfbc8 EFLAGS: 00010286 [ 155757.947755] RAX: 00000000003a3b1c RBX: 000007ffffffffff RCX: ffff888280183000 [ 155757.962202] RDX: 0000000000000000 RSI: 0007ffffffffffff RDI: ffff888bbc2d1000 [ 155757.976648] RBP: 0000000000000001 R08: 000000000000000b R09: ffff888ad9cedba0 [ 155757.991094] R10: ffffea0039c07900 R11: 0000000000000010 R12: ffff888b23a7b000 [ 155758.005540] R13: 0000000000000000 R14: ffff888bbc2d1000 R15: 000007ffffc71354 [ 155758.019991] FS: 00007f6234c68640(0000) GS:ffff88903f9c0000(0000) knlGS:0000000000000000 [ 155758.036356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 155758.048023] CR2: 00000000000000c0 CR3: 0000000a83eb8004 CR4: 00000000007706e0 [ 155758.062473] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 155758.076924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 155758.091376] PKRU: 55555554 [ 155758.096957] Call Trace: [ 155758.102016] <TASK> [ 155758.106502] ? __die+0x78/0xc0 [ 155758.112793] ? page_fault_oops+0x286/0x380 [ 155758.121175] ? exc_page_fault+0x5d/0x110 [ 155758.129209] ? asm_exc_page_fault+0x22/0x30 [ 155758.137763] ? mem_cgroup_get_nr_swap_pages+0x3d/0xb0 [ 155758.148060] workingset_test_recent+0xda/0x1b0 [ 155758.157133] workingset_refault+0xca/0x1e0 [ 155758.165508] filemap_add_folio+0x4d/0x70 [ 155758.173538] page_cache_ra_unbounded+0xed/0x190 [ 155758.182919] page_cache_sync_ra+0xd6/0x1e0 [ 155758.191738] filemap_read+0x68d/0xdf0 [ 155758.199495] ? mlx5e_napi_poll+0x123/0x940 [ 155758.207981] ? __napi_schedule+0x55/0x90 [ 155758.216095] __x64_sys_pread64+0x1d6/0x2c0 [ 155758.224601] do_syscall_64+0x3d/0x80 [ 155758.232058] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 155758.242473] RIP: 0033:0x7f62c29153b5 [ 155758.249938] Code: e8 48 89 75 f0 89 7d f8 48 89 4d e0 e8 b4 e6 f7 ff 41 89 c0 4c 8b 55 e0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 e7 e6 f7 ff 48 8b [ 155758.288005] RSP: 002b:00007f6234c5ffd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 [ 155758.303474] RAX: ffffffffffffffda RBX: 00007f628c4e70c0 RCX: 00007f62c29153b5 [ 155758.318075] RDX: 000000000003c041 RSI: 00007f61d2986000 RDI: 0000000000000076 [ 155758.332678] RBP: 00007f6234c5fff0 R08: 0000000000000000 R09: 0000000064d5230c [ 155758.347452] R10: 000000000027d450 R11: 0000000000000293 R12: 000000000003c041 [ 155758.362044] R13: 00007f61d2986000 R14: 00007f629e11b060 R15: 000000000027d450 [ 155758.376661] </TASK> This patch fixes the issue by moving the memcg's id publication from the alloc stage to online stage, ensuring that any memcg acquired via id must be connected to the memcg tree. Link: https://lkml.kernel.org/r/20230823225430.166925-1-nphamcs@gmail.com Fixes: f78dfc7b77d5 ("workingset: fix confusion around eviction vs refault container") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Co-developed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25mm: fix get_mctgt_type() kernel-docMatthew Wilcox (Oracle)1-18/+13
Convert the return values to an ReST list and tidy up the wording while I'm touching it. [akpm@linux-foundation.org: changes suggested by Randy] [willy@infradead.org: another change suggested by Randy] Link: https://lkml.kernel.org/r/ZOUZtZizeQG7PcsM@casper.infradead.org Link: https://lkml.kernel.org/r/20230818200630.2719595-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25memcg: remove duplication detection for mem_cgroup_uncharge_swapLu Jialin1-3/+0
__mem_cgroup_uncharge_swap is only called in mem_cgroup_uncharge_swap, if mem cgroup is disabled, __mem_cgroup_uncharge_swap cannot be called. Therefore, there is no need to judge whether mem_cgroup is disabled or not. Link: https://lkml.kernel.org/r/20230819081302.1217098-1-lujialin4@huawei.com Signed-off-by: Lu Jialin <lujialin4@huawei.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25mm: memcg: use rstat for non-hierarchical statsYosry Ahmed1-29/+38
Currently, memcg uses rstat to maintain aggregated hierarchical stats. Counters are maintained for hierarchical stats at each memcg. Rstat tracks which cgroups have updates on which cpus to keep those counters fresh on the read-side. Non-hierarchical stats are currently not covered by rstat. Their per-cpu counters are summed up on every read, which is expensive. The original implementation did the same. At some point before rstat, non-hierarchical aggregated counters were introduced by commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"). However, those counters were updated on the performance critical write-side, which caused regressions, so they were later removed by commit 815744d75152 ("mm: memcontrol: don't batch updates of local VM stats and events"). See [1] for more detailed history. Kernel versions in between a983b5ebee57 & 815744d75152 (a year and a half) enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1. When moving to more recent kernels, a performance regression for reading non-hierarchical stats is observed. Now that we have rstat, we know exactly which percpu counters have updates for each stat. We can maintain non-hierarchical counters again, making reads much more efficient, without affecting the performance critical write-side. Hence, add non-hierarchical (i.e local) counters for the stats, and extend rstat flushing to keep those up-to-date. A caveat is that we now need a stats flush before reading local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or memcg_events_local(), where we previously only needed a flush to read hierarchical stats. Most contexts reading non-hierarchical stats are already doing a flush, add a flush to the only missing context in count_shadow_nodes(). With this patch, reading memory.stat from 1000 memcgs is 3x faster on a machine with 256 cpus on cgroup v1: # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done # time cat /sys/fs/cgroup/memory/cg*/memory.stat > /dev/null real 0m0.125s user 0m0.005s sys 0m0.120s After: real 0m0.032s user 0m0.005s sys 0m0.027s To make sure there are no regressions on cgroup v2, I ran an artificial reclaim/refault stress test [2] that creates (NR_CPUS * 2) cgroups, assigns them limits, runs a worker process in each cgroup that allocates tmpfs memory equal to quadruple the limit (to invoke reclaim continuously), and then reads back the entire file (to invoke refaults). All workers are run in parallel, and zram is used as a swapping backend. Both reclaim and refault have conditional stats flushing. I ran this on a machine with 112 cpus, once on mm-unstable, and once on mm-unstable with this patch reverted. (1) A few runs without this patch: # time ./stress_reclaim_refault.sh real 0m9.949s user 0m0.496s sys 14m44.974s # time ./stress_reclaim_refault.sh real 0m10.049s user 0m0.486s sys 14m55.791s # time ./stress_reclaim_refault.sh real 0m9.984s user 0m0.481s sys 14m53.841s (2) A few runs with this patch: # time ./stress_reclaim_refault.sh real 0m9.885s user 0m0.486s sys 14m48.753s # time ./stress_reclaim_refault.sh real 0m9.903s user 0m0.495s sys 14m48.339s # time ./stress_reclaim_refault.sh real 0m9.861s user 0m0.507s sys 14m49.317s No regressions are observed with this patch. There is actually a very slight improvement. If I have to guess, maybe it's because we avoid the percpu loop in count_shadow_nodes() when calling lruvec_page_state_local(), but I could not prove this using perf, it's probably in the noise. [1] https://lore.kernel.org/lkml/20230725201811.GA1231514@cmpxchg.org/ [2] https://lore.kernel.org/lkml/CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230803185046.1385770-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230726153223.821757-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-22mm: remove folio_test_transhuge()Matthew Wilcox (Oracle)1-1/+1
This function is misleading; people think it means "Is this a THP", when all it actually does is check whether this is a large folio. Remove it; the one remaining user should have been checking to see whether the folio is PMD sized or not. Link: https://lkml.kernel.org/r/20230816151201.3655946-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-22merge mm-hotfixes-stable into mm-stable to pick up depended-upon changesAndrew Morton1-0/+2
2023-08-21mm: remove redundant K() macro definitionZhangPeng1-1/+0
Patch series "cleanup with helper macro K()". Use helper macro K() to improve code readability. No functional modification involved. Remove redundant K() macro definition. This patch (of 7): Since commit eb8589b4f8c1 ("mm: move mem_init_print_info() to mm_init.c"), the K() macro definition has been moved to mm/internal.h. Therefore, the definitions in mm/memcontrol.c, mm/backing-dev.c and mm/oom_kill.c are redundant. Drop redundant definitions. [akpm@linux-foundation.org: oom_kill.c: remove "#undef K", per Kefeng] Link: https://lkml.kernel.org/r/20230804012559.2617515-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20230804012559.2617515-2-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/memcg: fix wrong function name above obj_cgroup_charge_zswap()Miaohe Lin1-1/+1
The correct function name is obj_cgroup_may_zswap(). Correct the comment. Link: https://lkml.kernel.org/r/20230803120021.762279-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm:vmscan: fix inaccurate reclaim during proactive reclaimEfly Young1-2/+2
Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and swap floods"), proactive reclaim will extreme overreclaim sometimes. But proactive reclaim still inaccurate and some extent overreclaim. Problematic case is easy to construct. Allocate lots of anonymous memory (e.g., 20G) in a memcg, then swapping by writing memory.recalim and there is a certain probability of overreclaim. For example, request 1G by writing memory.reclaim will eventually reclaim 1.7G or other values more than 1G. The reason is that reclaimer may have already reclaimed part of requested memory in one loop, but before adjust sc->nr_to_reclaim in outer loop, call shrink_lruvec() again will still follow the current sc->nr_to_reclaim to work. It will eventually lead to overreclaim. In theory, the amount of reclaimed would be in [request, 2 * request). Reclaimer usually tends to reclaim more than request. But either direct or kswapd reclaim have much smaller nr_to_reclaim targets, so it is less noticeable and not have much impact. Proactive reclaim can usually come in with a larger value, so the error is difficult to ignore. Considering proactive reclaim is usually low frequency, handle the batching into smaller chunks is a better approach. Link: https://lkml.kernel.org/r/20230721014116.3388-1-yangyifei03@kuaishou.com Signed-off-by: Efly Young <yangyifei03@kuaishou.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21memcg: convert get_obj_cgroup_from_page to get_obj_cgroup_from_folioMatthew Wilcox (Oracle)1-4/+4
As the one caller now has a folio, pass it in and use it. Removes three calls to compound_head(). Link: https://lkml.kernel.org/r/20230715042343.434588-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan1-0/+2
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memcg: minor cleanup for mc_handle_present_pte()Miaohe Lin1-3/+2
When pagetable lock is held, the page will always be page_mapped(). So remove unneeded page_mapped() check. Also the page can't be freed from under us in this case. So use get_page() to get extra page reference to simplify the code. No functional change intended. Link: https://lkml.kernel.org/r/20230717113644.3026478-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memcg: fix obsolete comment above MEM_CGROUP_MAX_RECLAIM_LOOPSMiaohe Lin1-1/+1
Since commit 5660048ccac8 ("mm: move memcg hierarchy reclaim to generic reclaim code"), mem_cgroup_hierarchical_reclaim() is already renamed to mem_cgroup_soft_reclaim(). Update the corresponding comment. Link: https://lkml.kernel.org/r/20230713121432.273381-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memcg: minor cleanup for MEM_CGROUP_ID_MAXMiaohe Lin1-0/+1
MEM_CGROUP_ID_MAX is only used when CONFIG_MEMCG is configured. So remove unneeded !CONFIG_MEMCG variant. Also it's only used in mem_cgroup_alloc(), so move it from memcontrol.h to memcontrol.c. And further define it as: #define MEM_CGROUP_ID_MAX ((1UL << MEM_CGROUP_ID_SHIFT) - 1) so if someone changes MEM_CGROUP_ID_SHIFT in the future, then MEM_CGROUP_ID_MAX will be updated accordingly, as suggested by Muchun. Link: https://lkml.kernel.org/r/20230708023304.1184111-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18memcg: drop kmem.limit_in_bytesMichal Hocko1-10/+0
kmem.limit_in_bytes (v1 way to limit kernel memory usage) has been deprecated since 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes") merged in 5.16. We haven't heard about any serious users since then but it seems that the mere presence of the file is causing more harm thatn good. We (SUSE) have had several bug reports from customers where Docker based containers started to fail because a write to kmem.limit_in_bytes has failed. This was unexpected because runc code only expects ENOENT (kmem disabled) or EBUSY (tasks already running within cgroup). So a new error code was unexpected and the whole container startup failed. This has been later addressed by https://github.com/opencontainers/runc/commit/52390d68040637dfc77f9fda6bbe70952423d380 so current Docker runtimes do not suffer from the problem anymore. There are still older version of Docker in use and likely hard to get rid of completely. Address this by wiping out the file completely and effectively get back to pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration. I would recommend backporting to stable trees which have picked up 58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes"). [mhocko@suse.com: restore _KMEM switch case] Link: https://lkml.kernel.org/r/ZKe5wxdbvPi5Cwd7@dhcp22.suse.cz Link: https://lkml.kernel.org/r/20230704115240.14672-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-24mm/memcontrol: do not tweak node in mem_cgroup_init()Haifeng Xu1-2/+1
mem_cgroup_init() request for allocations from each possible node, and it's used to be a problem because NODE_DATA is not allocated for offline node. Things have already changed since commit 09f49dca570a9 ("mm: handle uninitialized numa nodes gracefully"), so it's unnecessary to check for !node_online nodes here. How to test? qemu-system-x86_64 \ -kernel vmlinux \ -initrd full.rootfs.cpio.gz \ -append "console=ttyS0,115200 root=/dev/ram0 nokaslr earlyprintk=serial oops=panic panic_on_warn" \ -drive format=qcow2,file=vm_disk.qcow2,media=disk,if=ide \ -enable-kvm \ -cpu host \ -m 8G,slots=2,maxmem=16G \ -smp cores=4,threads=1,sockets=2 \ -object memory-backend-ram,id=mem0,size=4G \ -object memory-backend-ram,id=mem1,size=4G \ -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ -numa node,nodeid=2 \ -net nic,model=virtio,macaddr=52:54:00:12:34:58 \ -net user \ -nographic \ -rtc base=localtime \ -gdb tcp::6000 Guest state when booting: [ 0.048881] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff] [ 0.050489] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0x00000000-0x13fffffff] [ 0.052173] NODE_DATA(0) allocated [mem 0x13fffc000-0x13fffffff] [ 0.053164] NODE_DATA(1) allocated [mem 0x23fffa000-0x23fffdfff] [ 0.054187] Zone ranges: [ 0.054587] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.055551] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.056515] Normal [mem 0x0000000100000000-0x000000023fffffff] [ 0.057484] Movable zone start for each node [ 0.058149] Early memory node ranges [ 0.058705] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.059679] node 0: [mem 0x0000000000100000-0x00000000bffdffff] [ 0.060659] node 0: [mem 0x0000000100000000-0x000000013fffffff] [ 0.061649] node 1: [mem 0x0000000140000000-0x000000023fffffff] [ 0.062638] Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff] [ 0.063745] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff] [ 0.064855] DMA zone: 158 reserved pages exceeds freesize 0 [ 0.065746] Initializing node 2 as memoryless [ 0.066437] Initmem setup node 2 as memoryless [ 0.067132] DMA zone: 158 reserved pages exceeds freesize 0 [ 0.068037] On node 0, zone DMA: 1 pages in unavailable ranges [ 0.068265] On node 0, zone DMA: 97 pages in unavailable ranges [ 0.124755] On node 0, zone Normal: 32 pages in unavailable ranges cat /sys/devices/system/node/online 0-1 cat /sys/devices/system/node/possible 0-2 Link: https://lkml.kernel.org/r/20230619130442.2487-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm/memcg: remove return value of mem_cgroup_scan_tasks()ZhangPeng1-5/+4
No user checks the return value of mem_cgroup_scan_tasks(). Make the return value void. Link: https://lkml.kernel.org/r/20230616063030.977586-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm: kill lock|unlock_page_memcg()Kefeng Wang1-14/+4
Since commit c7c3dec1c9db ("mm: rmap: remove lock_page_memcg()"), no more user, kill lock_page_memcg() and unlock_page_memcg(). Link: https://lkml.kernel.org/r/20230614143612.62575-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm: ptep_get() conversionRyan Roberts1-2/+2
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm/various: give up if pte_offset_map[_lock]() failsHugh Dickins1-4/+4
Following the examples of nearby code, various functions can just give up if pte_offset_map() or pte_offset_map_lock() fails. And there's no need for a preliminary pmd_trans_unstable() or other such check, since such cases are now safely handled inside. Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: use helper macro FLUSH_TIMEMiaohe Lin1-1/+1
Use helper macro FLUSH_TIME to indicate the flush time to improve the readability a bit. No functional change intended. Link: https://lkml.kernel.org/r/20230603072116.1101690-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcgLars R. Damerow1-0/+13
This patch is similar to commit 8e20d4b33266 ("mm/memcontrol: export memcg->watermark via sysfs for v2 memcg"), but exports the swap counter's watermark. We allocate jobs to our compute farm using heuristics determined by memory and swap usage from previous jobs. Tracking the peak swap usage for new jobs is important for determining when jobs are exceeding their expected bounds, or when our baseline metrics are getting outdated. Our toolset was written to use the "memory.memsw.max_usage_in_bytes" file in cgroups v1, and altering it to poll cgroups v2's "memory.swap.current" would give less accurate results as well as add complication to the code. Having this watermark exposed in sysfs is much preferred. Link: https://lkml.kernel.org/r/20230524181734.125696-1-lars@pixar.com Signed-off-by: Lars R. Damerow <lars@pixar.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()T.J. Alumbaugh1-1/+1
lru_gen_soft_reclaim() gets the lruvec from the memcg and node ID to keep a cleaner interface on the caller side. Link: https://lkml.kernel.org/r/20230522112058.2965866-2-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm/memcontrol: fix typo in commentHaifeng Xu1-1/+1
Replace 'then' with 'than'. Link: https://lkml.kernel.org/r/20230522095233.4246-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg, oom: remove explicit wakeup in mem_cgroup_oom_synchronize()Haifeng Xu1-8/+1
Before commit 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path"), all memcg oom killers were delayed to page fault path. And the explicit wakeup is used in this case: thread A: ... if (locked) { // complete oom-kill, hold the lock mem_cgroup_oom_unlock(memcg); ... } ... thread B: ... if (locked && !memcg->oom_kill_disable) { ... } else { schedule(); // can't acquire the lock ... } ... The reason is that thread A kicks off the OOM-killer, which leads to wakeups from the uncharges of the exiting task. But thread B is not guaranteed to see them if it enters the OOM path after the OOM kills but before thread A releases the lock. Now only oom_kill_disable case is handled from the #PF path. In that case it is userspace to trigger the wake up not the #PF path itself. All potential paths to free some charges are responsible to call memcg_oom_recover() , so the explicit wakeup is not needed in the mem_cgroup_oom_synchronize() path which doesn't release any memory itself. Link: https://lkml.kernel.org/r/20230419030739.115845-2-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg, oom: remove unnecessary check in mem_cgroup_oom_synchronize()Haifeng Xu1-10/+3
mem_cgroup_oom_synchronize() is only used when the memcg oom handling is handed over to the edge of the #PF path. Since commit 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") this is the case only when the kernel memcg oom killer is disabled (current->memcg_in_oom is only set if memcg->oom_kill_disable). Therefore a check for oom_kill_disable in mem_cgroup_oom_synchronize() is not required. Link: https://lkml.kernel.org/r/20230419030739.115845-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: remove mem_cgroup_flush_stats_atomic()Yosry Ahmed1-19/+5
Previous patches removed all callers of mem_cgroup_flush_stats_atomic(). Remove the function and simplify the code. Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: calculate root usage from global stateYosry Ahmed1-19/+5
Currently, we approximate the root usage by adding the memcg stats for anon, file, and conditionally swap (for memsw). To read the memcg stats we need to invoke an rstat flush. rstat flushes can be expensive, they scale with the number of cpus and cgroups on the system. mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold() with irqs disabled, so such an expensive operation with irqs disabled can cause problems. Instead, approximate the root usage from global state. This is not 100% accurate, but the root usage has always been ill-defined anyway. Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: flush stats non-atomically in mem_cgroup_wb_stats()Yosry Ahmed1-5/+1
The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats() code path in wb_writeback() outside the lock section. We no longer need to flush the stats atomically. Flush the stats non-atomically. Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: dump memory.stat during cgroup OOM for v1Yosry Ahmed1-25/+35
Patch series "memcg: OOM log improvements", v2. This short patch series brings back some cgroup v1 stats in OOM logs that were unnecessarily changed before. It also makes memcg OOM logs less reliant on printk() internals. This patch (of 2): Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup OOM") made sure we dump all the stats in memory.stat during a cgroup OOM, but it also introduced a slight behavioral change. The code used to print the non-hierarchical v1 cgroup stats for the entire cgroup subtree, now it only prints the v2 cgroup stats for the cgroup under OOM. For cgroup v1 users, this introduces a few problems: (a) The non-hierarchical stats of the memcg under OOM are no longer shown. (b) A couple of v1-only stats (e.g. pgpgin, pgpgout) are no longer shown. (c) We show the list of cgroup v2 stats, even in cgroup v1. This list of stats is not tracked with v1 in mind. While most of the stats seem to be working on v1, there may be some stats that are not fully or correctly tracked. Although OOM log is not set in stone, we should not change it for no reason. When upgrading the kernel version to a version including commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup OOM"), these behavioral changes are noticed in cgroup v1. The fix is simple. Commit c8713d0b2312 ("mm: memcontrol: dump memory.stat during cgroup OOM") separated stats formatting from stats display for v2, to reuse the stats formatting in the OOM logs. Do the same for v1. Move the v2 specific formatting from memory_stat_format() to memcg_stat_format(), add memcg1_stat_format() for v1, and make memory_stat_format() select between them based on cgroup version. Since memory_stat_show() now works for both v1 & v2, drop memcg_stat_show(). Link: https://lkml.kernel.org/r/20230428132406.2540811-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230428132406.2540811-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Petr Mladek <pmladek@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()Yosry Ahmed1-13/+14
Currently, we format all the memcg stats into a buffer in mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs. However, this buffer is large in size. Although it is currently working as intended, ther is a dependency between the memcg stats buffer and the printk record size limit. If we add more stats in the future and the buffer becomes larger than the printk record size limit, or if the prink record size limit is reduced, the logs may be truncated. It is safer to use seq_buf_do_printk(), which will automatically break up the buffer at line breaks and issue small printk() calls. Refactor the code to move the seq_buf from memory_stat_format() to its callers, and use seq_buf_do_printk() to print the seq_buf in mem_cgroup_print_oom_meminfo(). Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->cachedRoman Gushchin1-6/+6
A memcg pointer in the percpu stock can be accessed by drain_all_stock() from another cpu in a lockless way. In theory it might lead to an issue, similar to the one which has been discovered with stock->cached_objcg, where the pointer was zeroed between the check for being NULL and dereferencing. In this case the issue is unlikely a real problem, but to make it bulletproof and similar to stock->cached_objcg, let's annotate all accesses to stock->cached with READ_ONCE()/WTRITE_ONCE(). Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>