<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/page_counter.h, branch v6.12.80</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2024-09-02T03:25:53+00:00</updated>
<entry>
<title>mm, memcg: cg2 memory{.swap,}.peak write handlers</title>
<updated>2024-09-02T03:25:53+00:00</updated>
<author>
<name>David Finkel</name>
<email>davidf@vimeo.com</email>
</author>
<published>2024-07-29T14:37:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c6f53ed8f213a66ae8bc40aa9112c32412c35a21'/>
<id>urn:sha1:c6f53ed8f213a66ae8bc40aa9112c32412c35a21</id>
<content type='text'>
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.


This patch (of 2):

Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark.  Restore parity
with those mechanisms, but with a less racy API.

For example:
 - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
   the high watermark.
 - writing "5" to the clear_refs pseudo-file in a processes's proc
   directory resets the peak RSS.

This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.

Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.

Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path.  Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.

Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.

This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item.  Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.

Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place.  To facilitate this, we track
the peak memory usage.  However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing.  Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.

As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.

Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com
Signed-off-by: David Finkel &lt;davidf@vimeo.com&gt;
Suggested-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Suggested-by: Waiman Long &lt;longman@redhat.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Shuah Khan &lt;shuah@kernel.org&gt;
Cc: Zefan Li &lt;lizefan.x@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: page_counters: initialize usage using ATOMIC_LONG_INIT() macro</title>
<updated>2024-09-02T03:25:51+00:00</updated>
<author>
<name>Roman Gushchin</name>
<email>roman.gushchin@linux.dev</email>
</author>
<published>2024-07-26T20:31:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=57979fabff554bf4d1edeeee69251b22ca9bf55e'/>
<id>urn:sha1:57979fabff554bf4d1edeeee69251b22ca9bf55e</id>
<content type='text'>
When a page_counter structure is initialized, there is no need to use an
atomic set operation to initialize the usage counter because at this point
the structure is not visible to anybody else.  ATOMIC_LONG_INIT() is what
should be used in such cases.

Link: https://lkml.kernel.org/r/20240726203110.1577216-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: page_counters: put page_counter_calculate_protection() under CONFIG_MEMCG</title>
<updated>2024-09-02T03:25:51+00:00</updated>
<author>
<name>Roman Gushchin</name>
<email>roman.gushchin@linux.dev</email>
</author>
<published>2024-07-26T20:31:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=941ce635234162c420c9185816c59f7d5dd1ad07'/>
<id>urn:sha1:941ce635234162c420c9185816c59f7d5dd1ad07</id>
<content type='text'>
Put page_counter_calculate_protection() under CONFIG_MEMCG.

The protection functionality (min/low limits) is not supported by any
other cgroup subsystem, so page_counter_calculate_protection() and related
static effective_protection() can be compiled out if CONFIG_MEMCG is not
enabled.

Link: https://lkml.kernel.org/r/20240726203110.1577216-3-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg: don't call propagate_protected_usage() needlessly</title>
<updated>2024-09-02T03:25:50+00:00</updated>
<author>
<name>Roman Gushchin</name>
<email>roman.gushchin@linux.dev</email>
</author>
<published>2024-07-26T20:31:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f77bd4b14ccfd38dfcfe67eecad517b8ec1b7f37'/>
<id>urn:sha1:f77bd4b14ccfd38dfcfe67eecad517b8ec1b7f37</id>
<content type='text'>
Patch series "mm: memcg: page counters optimizations", v3.

This patchset contains 3 independent small optimizations of page counters.


This patch (of 3):

Memory protection (min/low) requires a constant tracking of protected
memory usage.  propagate_protected_usage() is called on each page counters
update and does a number of operations even in cases when the actual
memory protection functionality is not supported (e.g.  hugetlb cgroups or
memcg swap counters).

It's obviously inefficient and leads to a waste of CPU cycles.  It can be
addressed by calling propagate_protected_usage() only for the counters
which do support memory guarantees.  As of now it's only memcg-&gt;memory -
the unified memory memcg counter.

Link: https://lkml.kernel.org/r/20240726203110.1577216-2-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/page_counter: move calculating protection values to page_counter</title>
<updated>2024-07-12T22:52:20+00:00</updated>
<author>
<name>Maarten Lankhorst</name>
<email>maarten.lankhorst@linux.intel.com</email>
</author>
<published>2024-07-03T11:25:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a8585ac68621983587f1701b7567978fcbcd9573'/>
<id>urn:sha1:a8585ac68621983587f1701b7567978fcbcd9573</id>
<content type='text'>
It's a lot of math, and there is nothing memcontrol specific about it. 
This makes it easier to use inside of the drm cgroup controller.

[akpm@linux-foundation.org: fix kerneldoc, per Jeff Johnson]
Link: https://lkml.kernel.org/r/20240703112510.36424-1-maarten.lankhorst@linux.intel.com
Signed-off-by: Maarten Lankhorst &lt;maarten.lankhorst@linux.intel.com&gt;
Acked-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Jeff Johnson &lt;quic_jjohnson@quicinc.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: reduce dependencies on &lt;linux/kernel.h&gt;</title>
<updated>2024-02-22T18:24:52+00:00</updated>
<author>
<name>Christophe JAILLET</name>
<email>christophe.jaillet@wanadoo.fr</email>
</author>
<published>2024-02-02T21:23:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=09dacb7875395b8761cde921fff767a7cd3ab862'/>
<id>urn:sha1:09dacb7875395b8761cde921fff767a7cd3ab862</id>
<content type='text'>
"page_counter.h" does not need &lt;linux/kernel.h&gt;. &lt;linux/limits.h&gt; is enough
to get LONG_MAX.

Files that include page_counter.h are limited. They have been compile
tested or checked.

$ git grep page_counter\.h
include/linux/hugetlb_cgroup.h: struct page_counter hugepage[HUGE_MAX_HSTATE];
	--&gt; all files that include it have been compile tested

include/linux/memcontrol.h:#include &lt;linux/page_counter.h&gt;
	--&gt; &lt;linux/kernel.h&gt; has been added, to be safe

include/net/sock.h:#include &lt;linux/page_counter.h&gt;
	--&gt; already include &lt;linux/kernel.h&gt;

mm/hugetlb_cgroup.c:#include &lt;linux/page_counter.h&gt;
mm/memcontrol.c:#include &lt;linux/page_counter.h&gt;
mm/page_counter.c:#include &lt;linux/page_counter.h&gt;
	--&gt; compile tested

Link: https://lkml.kernel.org/r/adfdbe21c4d06400d7bd802868762deb85cae8b6.1706908921.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET &lt;christophe.jaillet@wanadoo.fr&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: deduplicate cacheline padding code</title>
<updated>2022-09-27T02:46:29+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2022-08-26T23:06:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e6ad640bc404eb298dd1880113131768ddf5c6a8'/>
<id>urn:sha1:e6ad640bc404eb298dd1880113131768ddf5c6a8</id>
<content type='text'>
There are three users (mmzone.h, memcontrol.h, page_counter.h) using
similar code for forcing cacheline padding between fields of different
structures.  Dedup that code.

Link: https://lkml.kernel.org/r/20220826230642.566725-1-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Suggested-by: Feng Tang &lt;feng.tang@intel.com&gt;
Reviewed-by: Feng Tang &lt;feng.tang@intel.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: page_counter: rearrange struct page_counter fields</title>
<updated>2022-09-12T03:26:01+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2022-08-25T00:05:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=408587baee39753304dd572dacf374f75545d25a'/>
<id>urn:sha1:408587baee39753304dd572dacf374f75545d25a</id>
<content type='text'>
With memcg v2 enabled, memcg-&gt;memory.usage is a very hot member for the
workloads doing memcg charging on multiple CPUs concurrently. 
Particularly the network intensive workloads.  In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path.  This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.

To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		12413.7 Mbps (18.4% improvement)

With the patch, the throughput improved by 18.4%.

One side-effect of this patch is the increase in the size of struct
mem_cgroup.  For example with this patch on 64 bit build, the size of
struct mem_cgroup increased from 4032 bytes to 4416 bytes.  However for
the performance improvement, this additional size is worth it.  In
addition there are opportunities to reduce the size of struct mem_cgroup
like deprecation of kmem and tcpmem page counters and better packing.

Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Reviewed-by: Feng Tang &lt;feng.tang@intel.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: "Michal Koutný" &lt;mkoutny@suse.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: page_counter: re-layout structure to reduce false sharing</title>
<updated>2021-02-24T21:38:29+00:00</updated>
<author>
<name>Feng Tang</name>
<email>feng.tang@intel.com</email>
</author>
<published>2021-02-24T20:04:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=802f1d522d5fdaefc2b935141bc8fe03d43a99ab'/>
<id>urn:sha1:802f1d522d5fdaefc2b935141bc8fe03d43a99ab</id>
<content type='text'>
When checking a memory cgroup related performance regression [1], from the
perf c2c profiling data, we found high false sharing for accessing 'usage'
and 'parent'.

On 64 bit system, the 'usage' and 'parent' are close to each other, and
easy to be in one cacheline (for cacheline size == 64+ B).  'usage' is
usally written, while 'parent' is usually read as the cgroup's
hierarchical counting nature.

So move the 'parent' to the end of the structure to make sure they
are in different cache lines.

Following are some performance data with the patch, against v5.11-rc1.  [
In the data, A means a platform with 2 sockets 48C/96T, B is a platform of
4 sockests 72C/144T, and if a %stddev will be shown bigger than 2%,
P100/P50 means number of test tasks equals to 100%/50% of nr_cpu]

will-it-scale/malloc1
---------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P100	     15782 ±  2%      -0.1%      15765 ±  3%  will-it-scale.per_process_ops
A-P50	     21511            +8.9%      23432        will-it-scale.per_process_ops
B-P100	      9155            +2.2%       9357        will-it-scale.per_process_ops
B-P50	     10967            +7.1%      11751 ±  2%  will-it-scale.per_process_ops

will-it-scale/pagefault2
------------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P100	     79028            +3.0%      81411        will-it-scale.per_process_ops
A-P50	    183960 ±  2%      +4.4%     192078 ±  2%  will-it-scale.per_process_ops
B-P100	     85966            +9.9%      94467 ±  3%  will-it-scale.per_process_ops
B-P50	    198195            +9.8%     217526        will-it-scale.per_process_ops

fio (4k/1M is block size)
-------------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P50-r-4k     16881 ±  2%    +1.2%      17081 ±  2%  fio.read_bw_MBps
A-P50-w-4k      3931          +4.5%       4111 ±  2%  fio.write_bw_MBps
A-P50-r-1M     15178          -0.2%      15154        fio.read_bw_MBps
A-P50-w-1M      3924          +0.1%       3929        fio.write_bw_MBps

[1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/

Link: https://lkml.kernel.org/r/1611040814-33449-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang &lt;feng.tang@intel.com&gt;
Reviewed-by: Roman Gushchin &lt;guro@fb.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memcg: move cgroup high memory limit setting into struct page_counter</title>
<updated>2020-06-02T17:59:09+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2020-06-02T04:49:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d1663a907bd348f912b7f7088e83ca1b6fd3309f'/>
<id>urn:sha1:d1663a907bd348f912b7f7088e83ca1b6fd3309f</id>
<content type='text'>
High memory limit is currently recorded directly in struct mem_cgroup.
We are about to add a high limit for swap, move the field to struct
page_counter and add some helpers.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Chris Down &lt;chris@chrisdown.name&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Link: http://lkml.kernel.org/r/20200527195846.102707-4-kuba@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
