<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/sched, branch linux-6.5.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-6.5.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-6.5.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2023-09-23T09:14:17+00:00</updated>
<entry>
<title>kernel/fork: beware of __put_task_struct() calling context</title>
<updated>2023-09-23T09:14:17+00:00</updated>
<author>
<name>Wander Lairson Costa</name>
<email>wander@redhat.com</email>
</author>
<published>2023-06-14T12:23:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4cb7b22a2e1bec8c94573b00c2726842a1eda403'/>
<id>urn:sha1:4cb7b22a2e1bec8c94573b00c2726842a1eda403</id>
<content type='text'>
[ Upstream commit d243b34459cea30cfe5f3a9b2feb44e7daff9938 ]

Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
locks. Therefore, it can't be called from an non-preemptible context.

One practical example is splat inside inactive_task_timer(), which is
called in a interrupt context:

  CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
   Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
   Call Trace:
   dump_stack_lvl+0x57/0x7d
   mark_lock_irq.cold+0x33/0xba
   mark_lock+0x1e7/0x400
   mark_usage+0x11d/0x140
   __lock_acquire+0x30d/0x930
   lock_acquire.part.0+0x9c/0x210
   rt_spin_lock+0x27/0xe0
   refill_obj_stock+0x3d/0x3a0
   kmem_cache_free+0x357/0x560
   inactive_task_timer+0x1ad/0x340
   __run_hrtimer+0x8a/0x1a0
   __hrtimer_run_queues+0x91/0x130
   hrtimer_interrupt+0x10f/0x220
   __sysvec_apic_timer_interrupt+0x7b/0xd0
   sysvec_apic_timer_interrupt+0x4f/0xd0
   asm_sysvec_apic_timer_interrupt+0x12/0x20
   RIP: 0033:0x7fff196bf6f5

Instead of calling __put_task_struct() directly, we defer it using
call_rcu(). A more natural approach would use a workqueue, but since
in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.

The issue is reproducible with stress-ng:

  while true; do
      stress-ng --sched deadline --sched-period 1000000000 \
	      --sched-runtime 800000000 --sched-deadline \
	      1000000000 --mmapfork 23 -t 20
  done

Reported-by: Hu Chunyu &lt;chuhu@redhat.com&gt;
Suggested-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Suggested-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Wander Lairson Costa &lt;wander@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'core_guards_for_6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue</title>
<updated>2023-07-04T20:50:38+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-07-04T20:50:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=04f2933d375e3f90d4435b7b518d3065afd1fa25'/>
<id>urn:sha1:04f2933d375e3f90d4435b7b518d3065afd1fa25</id>
<content type='text'>
Pull scope-based resource management infrastructure from Peter Zijlstra:
 "These are the first few patches in the Scope-based Resource Management
  series that introduce the infrastructure but not any conversions as of
  yet.

  Adding the infrastructure now allows multiple people to start using
  them.

  Of note is that Sparse will need some work since it doesn't yet
  understand this attribute and might have decl-after-stmt issues"

* tag 'core_guards_for_6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue:
  kbuild: Drop -Wdeclaration-after-statement
  locking: Introduce __cleanup() based infrastructure
  apparmor: Free up __cleanup() name
  dmaengine: ioat: Free up __cleanup() name
</content>
</entry>
<entry>
<title>Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2023-06-27T21:03:21+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-06-27T21:03:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ed3b7923a816ded62dccef377c9ee346c7d3b1b4'/>
<id>urn:sha1:ed3b7923a816ded62dccef377c9ee346c7d3b1b4</id>
<content type='text'>
Pull scheduler updates from Ingo Molnar:
 "Scheduler SMP load-balancer improvements:

   - Avoid unnecessary migrations within SMT domains on hybrid systems.

     Problem:

        On hybrid CPU systems, (processors with a mixture of
        higher-frequency SMT cores and lower-frequency non-SMT cores),
        under the old code lower-priority CPUs pulled tasks from the
        higher-priority cores if more than one SMT sibling was busy -
        resulting in many unnecessary task migrations.

     Solution:

        The new code improves the load balancer to recognize SMT cores
        with more than one busy sibling and allows lower-priority CPUs
        to pull tasks, which avoids superfluous migrations and lets
        lower-priority cores inspect all SMT siblings for the busiest
        queue.

   - Implement the 'runnable boosting' feature in the EAS balancer:
     consider CPU contention in frequency, EAS max util &amp; load-balance
     busiest CPU selection.

     This improves CPU utilization for certain workloads, while leaves
     other key workloads unchanged.

  Scheduler infrastructure improvements:

   - Rewrite the scheduler topology setup code by consolidating it into
     the build_sched_topology() helper function and building it
     dynamically on the fly.

   - Resolve the local_clock() vs. noinstr complications by rewriting
     the code: provide separate sched_clock_noinstr() and
     local_clock_noinstr() functions to be used in instrumentation code,
     and make sure it is all instrumentation-safe.

  Fixes:

   - Fix a kthread_park() race with wait_woken()

   - Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
       - Fix UP PREEMPT bug by unifying the SMP and UP implementations
       - Fix task_struct::saved_state handling

   - Fix various rq clock update bugs, unearthed by turning on the rq
     clock debugging code.

   - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
     by creating enough cgroups, by removing the warnign and restricting
     window size triggers to PSI file write-permission or
     CAP_SYS_RESOURCE.

   - Propagate SMT flags in the topology when removing degenerate domain

   - Fix grub_reclaim() calculation bug in the deadline scheduler code

   - Avoid resetting the min update period when it is unnecessary, in
     psi_trigger_destroy().

   - Don't balance a task to its current running CPU in load_balance(),
     which was possible on certain NUMA topologies with overlapping
     groups.

   - Fix the sched-debug printing of rq-&gt;nr_uninterruptible

  Cleanups:

   - Address various -Wmissing-prototype warnings, as a preparation to
     (maybe) enable this warning in the future.

   - Remove unused code

   - Mark more functions __init

   - Fix shadow-variable warnings"

* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
  sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
  sched/core: Fixed missing rq clock update before calling set_rq_offline()
  sched/deadline: Update GRUB description in the documentation
  sched/deadline: Fix bandwidth reclaim equation in GRUB
  sched/wait: Fix a kthread_park race with wait_woken()
  sched/topology: Mark set_sched_topology() __init
  sched/fair: Rename variable cpu_util eff_util
  arm64/arch_timer: Fix MMIO byteswap
  sched/fair, cpufreq: Introduce 'runnable boosting'
  sched/fair: Refactor CPU utilization functions
  cpuidle: Use local_clock_noinstr()
  sched/clock: Provide local_clock_noinstr()
  x86/tsc: Provide sched_clock_noinstr()
  clocksource: hyper-v: Provide noinstr sched_clock()
  clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
  x86/vdso: Fix gettimeofday masking
  math64: Always inline u128 version of mul_u64_u64_shr()
  s390/time: Provide sched_clock_noinstr()
  loongarch: Provide noinstr sched_clock_read()
  ...
</content>
</entry>
<entry>
<title>Merge tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2023-06-26T21:10:45+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-06-26T21:10:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cd336f6562d3d7646a9cf071b902db200a1dd77b'/>
<id>urn:sha1:cd336f6562d3d7646a9cf071b902db200a1dd77b</id>
<content type='text'>
Pull timer updates from Thomas Gleixner:
 "Time, timekeeping and related device driver updates:

  Core:

   - A set of fixes, cleanups and enhancements to the posix timer code:

       - Prevent another possible live lock scenario in the exit() path,
         which affects POSIX_CPU_TIMERS_TASK_WORK enabled architectures.

       - Fix a loop termination issue which was reported syzcaller/KSAN
         in the posix timer ID allocation code.

         That triggered a deeper look into the posix-timer code which
         unearthed more small issues.

       - Add missing READ/WRITE_ONCE() annotations

       - Fix or remove completely outdated comments

       - Document places which are subtle and completely undocumented.

   - Add missing hrtimer modes to the trace event decoder

   - Small cleanups and enhancements all over the place

  Drivers:

   - Rework the Hyper-V clocksource and sched clock setup code

   - Remove a deprecated clocksource driver

   - Small fixes and enhancements all over the place"

* tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe
  dt-bindings: timers: Add Ralink SoCs timer
  clocksource/drivers/hyper-v: Rework clocksource and sched clock setup
  dt-bindings: timer: brcm,kona-timer: convert to YAML
  clocksource/drivers/imx-gpt: Fold &lt;soc/imx/timer.h&gt; into its only user
  clk: imx: Drop inclusion of unused header &lt;soc/imx/timer.h&gt;
  hrtimer: Add missing sparse annotations to hrtimer locking
  clocksource/drivers/imx-gpt: Use only a single name for functions
  clocksource/drivers/loongson1: Move PWM timer to clocksource framework
  dt-bindings: timer: Add Loongson-1 clocksource
  MIPS: Loongson32: Remove deprecated PWM timer clocksource
  clocksource/drivers/ingenic-timer: Use pm_sleep_ptr() macro
  tracing/timer: Add missing hrtimer modes to decode_hrtimer_mode().
  posix-timers: Add sys_ni_posix_timers() prototype
  tick/rcu: Fix bogus ratelimit condition
  alarmtimer: Remove unnecessary (void *) cast
  alarmtimer: Remove unnecessary initialization of variable 'ret'
  posix-timers: Refer properly to CONFIG_HIGH_RES_TIMERS
  posix-timers: Polish coding style in a few places
  posix-timers: Remove pointless comments
  ...
</content>
</entry>
<entry>
<title>locking: Introduce __cleanup() based infrastructure</title>
<updated>2023-06-26T09:14:18+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2023-05-26T10:23:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=54da6a0924311c7cf5015533991e44fb8eb12773'/>
<id>urn:sha1:54da6a0924311c7cf5015533991e44fb8eb12773</id>
<content type='text'>
Use __attribute__((__cleanup__(func))) to build:

 - simple auto-release pointers using __free()

 - 'classes' with constructor and destructor semantics for
   scope-based resource management.

 - lock guards based on the above classes.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20230612093537.614161713%40infradead.org
</content>
</entry>
<entry>
<title>posix-timers: Ensure timer ID search-loop limit is valid</title>
<updated>2023-06-18T20:41:48+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2023-06-01T18:58:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8ce8849dd1e78dadcee0ec9acbd259d239b7069f'/>
<id>urn:sha1:8ce8849dd1e78dadcee0ec9acbd259d239b7069f</id>
<content type='text'>
posix_timer_add() tries to allocate a posix timer ID by starting from the
cached ID which was stored by the last successful allocation.

This is done in a loop searching the ID space for a free slot one by
one. The loop has to terminate when the search wrapped around to the
starting point.

But that's racy vs. establishing the starting point. That is read out
lockless, which leads to the following problem:

CPU0	  	      	     	   CPU1
posix_timer_add()
  start = sig-&gt;posix_timer_id;
  lock(hash_lock);
  ...				   posix_timer_add()
  if (++sig-&gt;posix_timer_id &lt; 0)
      			             start = sig-&gt;posix_timer_id;
     sig-&gt;posix_timer_id = 0;

So CPU1 can observe a negative start value, i.e. -1, and the loop break
never happens because the condition can never be true:

  if (sig-&gt;posix_timer_id == start)
     break;

While this is unlikely to ever turn into an endless loop as the ID space is
huge (INT_MAX), the racy read of the start value caught the attention of
KCSAN and Dmitry unearthed that incorrectness.

Rewrite it so that all id operations are under the hash lock.

Reported-by: syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lore.kernel.org/r/87bkhzdn6g.ffs@tglx

</content>
</entry>
<entry>
<title>sched/topology: Mark set_sched_topology() __init</title>
<updated>2023-06-16T15:08:01+00:00</updated>
<author>
<name>Miaohe Lin</name>
<email>linmiaohe@huawei.com</email>
</author>
<published>2023-06-03T07:36:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0cce0fde499a92c726cd2e24f7763644f7c9f971'/>
<id>urn:sha1:0cce0fde499a92c726cd2e24f7763644f7c9f971</id>
<content type='text'>
All callers of set_sched_topology() are within __init section. Mark
it __init too.

Signed-off-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com
</content>
</entry>
<entry>
<title>sched/clock: Provide local_clock_noinstr()</title>
<updated>2023-06-05T19:11:09+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2023-05-19T10:21:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fb7d4948c4da2dbd26da4b7ec76bbd2f19ff862a'/>
<id>urn:sha1:fb7d4948c4da2dbd26da4b7ec76bbd2f19ff862a</id>
<content type='text'>
Now that all ARCH_WANTS_NO_INSTR architectures (arm64, loongarch,
s390, x86) provide sched_clock_noinstr(), use this to provide
local_clock_noinstr().

This local_clock_noinstr() will be safe to use from noinstr code with
the assumption that any such noinstr code is non-preemptible (it had
better be, entry code will have IRQs disabled while __cpuidle must
have preemption disabled).

Specifically, preempt_enable_notrace(), a common part of many a
sched_clock() implementation calls out to schedule() -- even though,
per the above, it will never trigger -- which frustrates noinstr
validation.

  vmlinux.o: warning: objtool: local_clock+0xb5: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Michael Kelley &lt;mikelley@microsoft.com&gt;  # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.978624636@infradead.org
</content>
</entry>
<entry>
<title>fork, vhost: Use CLONE_THREAD to fix freezer/ps regression</title>
<updated>2023-06-01T21:15:33+00:00</updated>
<author>
<name>Mike Christie</name>
<email>michael.christie@oracle.com</email>
</author>
<published>2023-06-01T18:32:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f9010dbdce911ee1f1af1398a24b1f9f992e0080'/>
<id>urn:sha1:f9010dbdce911ee1f1af1398a24b1f9f992e0080</id>
<content type='text'>
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.

This is a modified version of the patch written by Mike Christie
&lt;michael.christie@oracle.com&gt; which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c.  Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn.  vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit.  This collection clears
__fatal_signal_pending.  This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file.  To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec.  The
coredump code and de_thread have been modified to ignore vhost threads.

Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.

Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Co-developed-by: Mike Christie &lt;michael.christie@oracle.com&gt;
Signed-off-by: Mike Christie &lt;michael.christie@oracle.com&gt;
Acked-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched/topology: Remove SHARED_CHILD from ASYM_PACKING</title>
<updated>2023-05-08T08:58:37+00:00</updated>
<author>
<name>Ricardo Neri</name>
<email>ricardo.neri-calderon@linux.intel.com</email>
</author>
<published>2023-04-06T20:31:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ca528cc501896a808dc79c3c0544369d23b331c8'/>
<id>urn:sha1:ca528cc501896a808dc79c3c0544369d23b331c8</id>
<content type='text'>
Only x86 and Power7 use ASYM_PACKING. They use it differently.

Power7 has cores of equal priority, but the SMT siblings of a core have
different priorities. Parent scheduling domains do not need (nor have) the
ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
cause the topology debug code to complain.

X86 has cores of different priority, but all the SMT siblings of the core
have equal priority. It needs ASYM_PACKING at the MC level, but not at the
SMT level (it also needs it at upper levels if they have scheduling groups
of different priority). Removing ASYM_PACKING from the SMT domain causes
the topology debug code to complain.

Remove SHARED_CHILD for now. We still need a topology check that satisfies
both architectures.

Suggested-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Signed-off-by: Ricardo Neri &lt;ricardo.neri-calderon@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Zhang Rui &lt;rui.zhang@intel.com&gt;
Link: https://lore.kernel.org/r/20230406203148.19182-10-ricardo.neri-calderon@linux.intel.com
</content>
</entry>
</feed>
