<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/percpu_counter.h, branch v6.19.11</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.19.11</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.19.11'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2023-10-18T21:34:14+00:00</updated>
<entry>
<title>percpu_counter: extend _limited_add() to negative amounts</title>
<updated>2023-10-18T21:34:14+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2023-10-12T04:40:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1431996bf9088ee59f8017637ab9a7f89909ae63'/>
<id>urn:sha1:1431996bf9088ee59f8017637ab9a7f89909ae63</id>
<content type='text'>
Though tmpfs does not need it, percpu_counter_limited_add() can be twice
as useful if it works sensibly with negative amounts (subs) - typically
decrements towards a limit of 0 or nearby: as suggested by Dave Chinner.

And in the course of that reworking, skip the percpu counter sum if it is
already obvious that the limit would be passed: as suggested by Tim Chen.

Extend the comment above __percpu_counter_limited_add(), defining the
behaviour with positive and negative amounts, allowing negative limits,
but not bothering about overflow beyond S64_MAX.

Link: https://lkml.kernel.org/r/8f86083b-c452-95d4-365b-f16a2e4ebcd4@google.com
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: Carlos Maiolino &lt;cem@kernel.org&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Cc: Darrick J. Wong &lt;djwong@kernel.org&gt;
Cc: Dave Chinner &lt;dchinner@redhat.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>shmem,percpu_counter: add _limited_add(fbc, limit, amount)</title>
<updated>2023-10-18T21:34:14+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2023-09-30T03:42:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=beb9868628445306958fd7b2da1cd369a4a381cc'/>
<id>urn:sha1:beb9868628445306958fd7b2da1cd369a4a381cc</id>
<content type='text'>
Percpu counter's compare and add are separate functions: without locking
around them (which would defeat their purpose), it has been possible to
overflow the intended limit.  Imagine all the other CPUs fallocating tmpfs
huge pages to the limit, in between this CPU's compare and its add.

I have not seen reports of that happening; but tmpfs's recent addition of
dquot_alloc_block_nodirty() in between the compare and the add makes it
even more likely, and I'd be uncomfortable to leave it unfixed.

Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.

I believe this implementation is correct, and slightly more efficient than
the combination of compare and add (taking the lock once rather than twice
when nearing full - the last 128MiB of a tmpfs volume on a machine with
128 CPUs and 4KiB pages); but it does beg for a better design - when
nearing full, there is no new batching, but the costly percpu counter sum
across CPUs still has to be done, while locked.

Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well
as cpu_online_mask: but shouldn't __percpu_counter_compare() and
__percpu_counter_limited_add() then be adding a num_dying_cpus() to
num_online_cpus(), when they calculate the maximum which could be held
across CPUs?  But the times when it matters would be vanishingly rare.

Link: https://lkml.kernel.org/r/bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Cc: Dave Chinner &lt;dchinner@redhat.com&gt;
Cc: Darrick J. Wong &lt;djwong@kernel.org&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: Carlos Maiolino &lt;cem@kernel.org&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pcpcntr: add group allocation/free</title>
<updated>2023-08-25T15:06:53+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2023-08-23T05:06:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c439d5e8a0deb7310b5bb4e5f2fe47c40ff5297f'/>
<id>urn:sha1:c439d5e8a0deb7310b5bb4e5f2fe47c40ff5297f</id>
<content type='text'>
Allocations and frees are globally serialized on the pcpu lock (and the
CPU hotplug lock if enabled, which is the case on Debian).

At least one frequent consumer allocates 4 back-to-back counters (and
frees them in the same manner), exacerbating the problem.

While this does not fully remedy scalability issues, it is a step
towards that goal and provides immediate relief.

Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Reviewed-by: Dennis Zhou &lt;dennis@kernel.org&gt;
Reviewed-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Link: https://lore.kernel.org/r/20230823050609.2228718-2-mjguzik@gmail.com
[Dennis: reflowed a few lines]
Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
</content>
</entry>
<entry>
<title>pcpcntr: remove percpu_counter_sum_all()</title>
<updated>2023-03-19T17:02:04+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2023-03-16T00:31:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e9b60c7f97130795c7aa81a649ae4b93a172a277'/>
<id>urn:sha1:e9b60c7f97130795c7aa81a649ae4b93a172a277</id>
<content type='text'>
percpu_counter_sum_all() is now redundant as the race condition it
was invented to handle is now dealt with by percpu_counter_sum()
directly and all users of percpu_counter_sum_all() have been
removed.

Remove it.

This effectively reverts the changes made in f689054aace2
("percpu_counter: add percpu_counter_sum_all interface") except for
the cpumask iteration that fixes percpu_counter_sum() made earlier
in this series.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
</content>
</entry>
<entry>
<title>include/linux/percpu_counter.h: race in uniprocessor percpu_counter_add()</title>
<updated>2023-02-03T06:50:01+00:00</updated>
<author>
<name>Manfred Spraul</name>
<email>manfred@colorfullife.com</email>
</author>
<published>2022-12-16T15:04:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=88ad32a799ddc92eafd2ae204cb43f04ac20a05c'/>
<id>urn:sha1:88ad32a799ddc92eafd2ae204cb43f04ac20a05c</id>
<content type='text'>
The percpu interface is supposed to be preempt and irq safe.

But:
The uniprocessor implementation of percpu_counter_add() is not irq safe:
if an interrupt happens during the +=, then the result is undefined.

Therefore: switch from preempt_disable() to local_irq_save().
This prevents interrupts from interrupting the +=, and as a side effect
prevents preemption.

Link: https://lkml.kernel.org/r/20221216150441.200533-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul &lt;manfred@colorfullife.com&gt;
Cc: "Sun, Jiebin" &lt;jiebin.sun@intel.com&gt;
Cc: &lt;1vier1@web.de&gt;
Cc: Alexander Sverdlin &lt;alexander.sverdlin@siemens.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>percpu_counter: add percpu_counter_sum_all interface</title>
<updated>2022-11-30T23:58:40+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2022-11-09T01:20:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f689054aace2ff13af2e9a44a74fbba650ca31ba'/>
<id>urn:sha1:f689054aace2ff13af2e9a44a74fbba650ca31ba</id>
<content type='text'>
The percpu_counter is used for scenarios where performance is more
important than the accuracy.  For percpu_counter users, who want more
accurate information in their slowpath, percpu_counter_sum is provided
which traverses all the online CPUs to accumulate the data.  The reason it
only needs to traverse online CPUs is because percpu_counter does
implement CPU offline callback which syncs the local data of the offlined
CPU.

However there is a small race window between the online CPUs traversal of
percpu_counter_sum and the CPU offline callback.  The offline callback has
to traverse all the percpu_counters on the system to flush the CPU local
data which can be a lot.  During that time, the CPU which is going offline
has already been published as offline to all the readers.  So, as the
offline callback is running, percpu_counter_sum can be called for one
counter which has some state on the CPU going offline.  Since
percpu_counter_sum only traverses online CPUs, it will skip that specific
CPU and the offline callback might not have flushed the state for that
specific percpu_counter on that offlined CPU.

Normally this is not an issue because percpu_counter users can deal with
some inaccuracy for small time window.  However a new user i.e.  mm_struct
on the cleanup path wants to check the exact state of the percpu_counter
through check_mm().  For such users, this patch introduces
percpu_counter_sum_all() which traverses all possible CPUs and it is used
in fork.c:check_mm() to avoid the potential race.

This issue is exposed by the later patch "mm: convert mm's rss stats into
percpu_counter".

Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Reported-by: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Tested-by: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: convert mm's rss stats into percpu_counter</title>
<updated>2022-11-30T23:58:40+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2022-10-24T05:28:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f1a7941243c102a44e8847e3b94ff4ff3ec56f25'/>
<id>urn:sha1:f1a7941243c102a44e8847e3b94ff4ff3ec56f25</id>
<content type='text'>
Currently mm_struct maintains rss_stats which are updated on page fault
and the unmapping codepaths.  For page fault codepath the updates are
cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. 
The reason for caching is performance for multithreaded applications
otherwise the rss_stats updates may become hotspot for such applications.

However this optimization comes with the cost of error margin in the rss
stats.  The rss_stats for applications with large number of threads can be
very skewed.  At worst the error margin is (nr_threads * 64) and we have a
lot of applications with 100s of threads, so the error margin can be very
high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.

Recently we started seeing the unbounded errors for rss_stats for specific
applications which use TCP rx0cp.  It seems like vm_insert_pages()
codepath does not sync rss_stats at all.

This patch converts the rss_stats into percpu_counter to convert the error
margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
this conversion enable us to get the accurate stats for situations where
accuracy is more important than the cpu cost.

This patch does not make such tradeoffs - we can just use
percpu_counter_add_local() for the updates and percpu_counter_sum() (or
percpu_counter_sync() + percpu_counter_read) for the readers.  At the
moment the readers are either procfs interface, oom_killer and memory
reclaim which I think are not performance critical and should be ok with
slow read.  However I think we can make that change in a separate patch.

Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>percpu: add percpu_counter_add_local and percpu_counter_sub_local</title>
<updated>2022-10-03T21:21:43+00:00</updated>
<author>
<name>Jiebin Sun</name>
<email>jiebin.sun@intel.com</email>
</author>
<published>2022-09-13T19:25:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5d0ce3595ab75330a15cec914096efbbb8b41e4a'/>
<id>urn:sha1:5d0ce3595ab75330a15cec914096efbbb8b41e4a</id>
<content type='text'>
Patch series "/msg: mitigate the lock contention in ipc/msg", v6.

Here are two patches to mitigate the lock contention in ipc/msg.

The 1st patch is to add the new interface percpu_counter_add_local and
percpu_counter_sub_local.  The batch size in percpu_counter_add_batch
should be very large in heavy writing and rare reading case.  Add the
"_local" version, and mostly it will do local adding, reduce the global
updating and mitigate lock contention in writing.

The 2nd patch is to use percpu_counter instead of atomic update in
ipc/msg.  The msg_bytes and msg_hdrs atomic counters are frequently
updated when IPC msg queue is in heavy use, causing heavy cache bounce and
overhead.  Change them to percpu_counter greatly improve the performance. 
Since there is one percpu struct per namespace, additional memory cost is
minimal.  Reading of the count done in msgctl call, which is infrequent. 
So the need to sum up the counts in each CPU is infrequent.


This patch (of 2):

The batch size in percpu_counter_add_batch should be very large in
heavy writing and rare reading case. Add the "_local" version, and
mostly it will do local adding, reduce the global updating and
mitigate lock contention in writing.

Link: https://lkml.kernel.org/r/20220913192538.3023708-1-jiebin.sun@intel.com
Link: https://lkml.kernel.org/r/20220913192538.3023708-2-jiebin.sun@intel.com
Signed-off-by: Jiebin Sun &lt;jiebin.sun@intel.com&gt;
Reviewed-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Alexander Mikhalitsyn &lt;alexander.mikhalitsyn@virtuozzo.com&gt;
Cc: Alexey Gladkov &lt;legion@kernel.org&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Dennis Zhou &lt;dennis@kernel.org&gt;
Cc: "Eric W . Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Manfred Spraul &lt;manfred@colorfullife.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vasily Averin &lt;vasily.averin@linux.dev&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>percpu_counter: add percpu_counter_sync()</title>
<updated>2020-08-07T18:33:26+00:00</updated>
<author>
<name>Feng Tang</name>
<email>feng.tang@intel.com</email>
</author>
<published>2020-08-07T06:23:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0a4954a850b0c4d0a5d18b1a55d6e5a653e362b5'/>
<id>urn:sha1:0a4954a850b0c4d0a5d18b1a55d6e5a653e362b5</id>
<content type='text'>
percpu_counter's accuracy is related to its batch size.  For a
percpu_counter with a big batch, its deviation could be big, so when the
counter's batch is runtime changed to a smaller value for better accuracy,
there could also be requirment to reduce the big deviation.

So add a percpu-counter sync function to be run on each CPU.

Reported-by: kernel test robot &lt;rong.a.chen@intel.com&gt;
Signed-off-by: Feng Tang &lt;feng.tang@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Dennis Zhou &lt;dennis@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Qian Cai &lt;cai@lca.pw&gt;
Cc: Andi Kleen &lt;andi.kleen@intel.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Haiyang Zhang &lt;haiyangz@microsoft.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: "K. Y. Srinivasan" &lt;kys@microsoft.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Link: http://lkml.kernel.org/r/1594389708-60781-4-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>percpu_counter: fix a data race at vm_committed_as</title>
<updated>2020-04-07T17:43:43+00:00</updated>
<author>
<name>Qian Cai</name>
<email>cai@lca.pw</email>
</author>
<published>2020-04-07T03:10:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7e2345200262e4a6056580f0231cccdaffc825f3'/>
<id>urn:sha1:7e2345200262e4a6056580f0231cccdaffc825f3</id>
<content type='text'>
"vm_committed_as.count" could be accessed concurrently as reported by
KCSAN,

 BUG: KCSAN: data-race in __vm_enough_memory / percpu_counter_add_batch

 write to 0xffffffff9451c538 of 8 bytes by task 65879 on cpu 35:
  percpu_counter_add_batch+0x83/0xd0
  percpu_counter_add_batch at lib/percpu_counter.c:91
  __vm_enough_memory+0xb9/0x260
  dup_mm+0x3a4/0x8f0
  copy_process+0x2458/0x3240
  _do_fork+0xaa/0x9f0
  __do_sys_clone+0x125/0x160
  __x64_sys_clone+0x70/0x90
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 read to 0xffffffff9451c538 of 8 bytes by task 66773 on cpu 19:
  __vm_enough_memory+0x199/0x260
  percpu_counter_read_positive at include/linux/percpu_counter.h:81
  (inlined by) __vm_enough_memory at mm/util.c:839
  mmap_region+0x1b2/0xa10
  do_mmap+0x45c/0x700
  vm_mmap_pgoff+0xc0/0x130
  ksys_mmap_pgoff+0x6e/0x300
  __x64_sys_mmap+0x33/0x40
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The read is outside percpu_counter::lock critical section which results in
a data race.  Fix it by adding a READ_ONCE() in
percpu_counter_read_positive() which could also service as the existing
compiler memory barrier.

Signed-off-by: Qian Cai &lt;cai@lca.pw&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Marco Elver &lt;elver@google.com&gt;
Link: http://lkml.kernel.org/r/1582302724-2804-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
