<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/sched.h, branch v6.4-rc6</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.4-rc6</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.4-rc6'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2023-04-28T22:57:53+00:00</updated>
<entry>
<title>Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace</title>
<updated>2023-04-28T22:57:53+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-04-28T22:57:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d579c468d7ad6e37f5b4290b0244a9a5a7d3c4bf'/>
<id>urn:sha1:d579c468d7ad6e37f5b4290b0244a9a5a7d3c4bf</id>
<content type='text'>
Pull tracing updates from Steven Rostedt:

 - User events are finally ready!

   After lots of collaboration between various parties, we finally
   locked down on a stable interface for user events that can also work
   with user space only tracing.

   This is implemented by telling the kernel (or user space library, but
   that part is user space only and not part of this patch set), where
   the variable is that the application uses to know if something is
   listening to the trace.

   There's also an interface to tell the kernel about these events,
   which will show up in the /sys/kernel/tracing/events/user_events/
   directory, where it can be enabled.

   When it's enabled, the kernel will update the variable, to tell the
   application to start writing to the kernel.

   See https://lwn.net/Articles/927595/

 - Cleaned up the direct trampolines code to simplify arm64 addition of
   direct trampolines.

   Direct trampolines use the ftrace interface but instead of jumping to
   the ftrace trampoline, applications (mostly BPF) can register their
   own trampoline for performance reasons.

 - Some updates to the fprobe infrastructure. fprobes are more efficient
   than kprobes, as it does not need to save all the registers that
   kprobes on ftrace do. More work needs to be done before the fprobes
   will be exposed as dynamic events.

 - More updates to references to the obsolete path of
   /sys/kernel/debug/tracing for the new /sys/kernel/tracing path.

 - Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer
   line by line instead of all at once.

   There are users in production kernels that have a large data dump
   that originally used printk() directly, but the data dump was larger
   than what printk() allowed as a single print.

   Using seq_buf() to do the printing fixes that.

 - Add /sys/kernel/tracing/touched_functions that shows all functions
   that was every traced by ftrace or a direct trampoline. This is used
   for debugging issues where a traced function could have caused a
   crash by a bpf program or live patching.

 - Add a "fields" option that is similar to "raw" but outputs the fields
   of the events. It's easier to read by humans.

 - Some minor fixes and clean ups.

* tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits)
  ring-buffer: Sync IRQ works before buffer destruction
  tracing: Add missing spaces in trace_print_hex_seq()
  ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
  recordmcount: Fix memory leaks in the uwrite function
  tracing/user_events: Limit max fault-in attempts
  tracing/user_events: Prevent same address and bit per process
  tracing/user_events: Ensure bit is cleared on unregister
  tracing/user_events: Ensure write index cannot be negative
  seq_buf: Add seq_buf_do_printk() helper
  tracing: Fix print_fields() for __dyn_loc/__rel_loc
  tracing/user_events: Set event filter_type from type
  ring-buffer: Clearly check null ptr returned by rb_set_head_page()
  tracing: Unbreak user events
  tracing/user_events: Use print_format_fields() for trace output
  tracing/user_events: Align structs with tabs for readability
  tracing/user_events: Limit global user_event count
  tracing/user_events: Charge event allocs to cgroups
  tracing/user_events: Update documentation for ABI
  tracing/user_events: Use write ABI in example
  tracing/user_events: Add ABI self-test
  ...
</content>
</entry>
<entry>
<title>Merge tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2023-04-28T21:53:30+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-04-28T21:53:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=586b222d748e91c619d68e9239654ebc7fed9b0c'/>
<id>urn:sha1:586b222d748e91c619d68e9239654ebc7fed9b0c</id>
<content type='text'>
Pull scheduler updates from Ingo Molnar:

 - Allow unprivileged PSI poll()ing

 - Fix performance regression introduced by mm_cid

 - Improve livepatch stalls by adding livepatch task switching to
   cond_resched(). This resolves livepatching busy-loop stalls with
   certain CPU-bound kthreads

 - Improve sched_move_task() performance on autogroup configs

 - On core-scheduling CPUs, avoid selecting throttled tasks to run

 - Misc cleanups, fixes and improvements

* tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock: Fix local_clock() before sched_clock_init()
  sched/rt: Fix bad task migration for rt tasks
  sched: Fix performance regression introduced by mm_cid
  sched/core: Make sched_dynamic_mutex static
  sched/psi: Allow unprivileged polling of N*2s period
  sched/psi: Extract update_triggers side effect
  sched/psi: Rename existing poll members in preparation
  sched/psi: Rearrange polling code in preparation
  sched/fair: Fix inaccurate tally of ttwu_move_affine
  vhost: Fix livepatch timeouts in vhost_worker()
  livepatch,sched: Add livepatch task switching to cond_resched()
  livepatch: Skip task_call_func() for current task
  livepatch: Convert stack entries array to percpu
  sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization
  sched/core: Reduce cost of sched_move_task when config autogroup
  sched/core: Avoid selecting the task that is throttled to run when core-sched enable
  sched/topology: Make sched_energy_mutex,update static
</content>
</entry>
<entry>
<title>Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next</title>
<updated>2023-04-26T23:07:23+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-04-26T23:07:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6e98b09da931a00bf4e0477d0fa52748bf28fcce'/>
<id>urn:sha1:6e98b09da931a00bf4e0477d0fa52748bf28fcce</id>
<content type='text'>
Pull networking updates from Paolo Abeni:
 "Core:

   - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
     default value allows for better BIG TCP performances

   - Reduce compound page head access for zero-copy data transfers

   - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
     possible

   - Threaded NAPI improvements, adding defer skb free support and
     unneeded softirq avoidance

   - Address dst_entry reference count scalability issues, via false
     sharing avoidance and optimize refcount tracking

   - Add lockless accesses annotation to sk_err[_soft]

   - Optimize again the skb struct layout

   - Extends the skb drop reasons to make it usable by multiple
     subsystems

   - Better const qualifier awareness for socket casts

  BPF:

   - Add skb and XDP typed dynptrs which allow BPF programs for more
     ergonomic and less brittle iteration through data and
     variable-sized accesses

   - Add a new BPF netfilter program type and minimal support to hook
     BPF programs to netfilter hooks such as prerouting or forward

   - Add more precise memory usage reporting for all BPF map types

   - Adds support for using {FOU,GUE} encap with an ipip device
     operating in collect_md mode and add a set of BPF kfuncs for
     controlling encap params

   - Allow BPF programs to detect at load time whether a particular
     kfunc exists or not, and also add support for this in light
     skeleton

   - Bigger batch of BPF verifier improvements to prepare for upcoming
     BPF open-coded iterators allowing for less restrictive looping
     capabilities

   - Rework RCU enforcement in the verifier, add kptr_rcu and enforce
     BPF programs to NULL-check before passing such pointers into kfunc

   - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
     in local storage maps

   - Enable RCU semantics for task BPF kptrs and allow referenced kptr
     tasks to be stored in BPF maps

   - Add support for refcounted local kptrs to the verifier for allowing
     shared ownership, useful for adding a node to both the BPF list and
     rbtree

   - Add BPF verifier support for ST instructions in
     convert_ctx_access() which will help new -mcpu=v4 clang flag to
     start emitting them

   - Add ARM32 USDT support to libbpf

   - Improve bpftool's visual program dump which produces the control
     flow graph in a DOT format by adding C source inline annotations

  Protocols:

   - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
     indicates the provenance of the IP address

   - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition

   - Add the handshake upcall mechanism, allowing the user-space to
     implement generic TLS handshake on kernel's behalf

   - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
     resilience to nodes failures

   - SCTP: add support for Fair Capacity and Weighted Fair Queueing
     schedulers

   - MPTCP: delay first subflow allocation up to its first usage. This
     will allow for later better LSM interaction

   - xfrm: Remove inner/outer modes from input/output path. These are
     not needed anymore

   - WiFi:
      - reduced neighbor report (RNR) handling for AP mode
      - HW timestamping support
      - support for randomized auth/deauth TA for PASN privacy
      - per-link debugfs for multi-link
      - TC offload support for mac80211 drivers
      - mac80211 mesh fast-xmit and fast-rx support
      - enable Wi-Fi 7 (EHT) mesh support

  Netfilter:

   - Add nf_tables 'brouting' support, to force a packet to be routed
     instead of being bridged

   - Update bridge netfilter and ovs conntrack helpers to handle IPv6
     Jumbo packets properly, i.e. fetch the packet length from
     hop-by-hop extension header. This is needed for BIT TCP support

   - The iptables 32bit compat interface isn't compiled in by default
     anymore

   - Move ip(6)tables builtin icmp matches to the udptcp one. This has
     the advantage that icmp/icmpv6 match doesn't load the
     iptables/ip6tables modules anymore when iptables-nft is used

   - Extended netlink error report for netdevice in flowtables and
     netdev/chains. Allow for incrementally add/delete devices to netdev
     basechain. Allow to create netdev chain without device

  Driver API:

   - Remove redundant Device Control Error Reporting Enable, as PCI core
     has already error reporting enabled at enumeration time

   - Move Multicast DB netlink handlers to core, allowing devices other
     then bridge to use them

   - Allow the page_pool to directly recycle the pages from safely
     localized NAPI

   - Implement lockless TX queue stop/wake combo macros, allowing for
     further code de-duplication and sanitization

   - Add YNL support for user headers and struct attrs

   - Add partial YNL specification for devlink

   - Add partial YNL specification for ethtool

   - Add tc-mqprio and tc-taprio support for preemptible traffic classes

   - Add tx push buf len param to ethtool, specifies the maximum number
     of bytes of a transmitted packet a driver can push directly to the
     underlying device

   - Add basic LED support for switch/phy

   - Add NAPI documentation, stop relaying on external links

   - Convert dsa_master_ioctl() to netdev notifier. This is a
     preparatory work to make the hardware timestamping layer selectable
     by user space

   - Add transceiver support and improve the error messages for CAN-FD
     controllers

  New hardware / drivers:

   - Ethernet:
      - AMD/Pensando core device support
      - MediaTek MT7981 SoC
      - MediaTek MT7988 SoC
      - Broadcom BCM53134 embedded switch
      - Texas Instruments CPSW9G ethernet switch
      - Qualcomm EMAC3 DWMAC ethernet
      - StarFive JH7110 SoC
      - NXP CBTX ethernet PHY

   - WiFi:
      - Apple M1 Pro/Max devices
      - RealTek rtl8710bu/rtl8188gu
      - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset

   - Bluetooth:
      - Realtek RTL8821CS, RTL8851B, RTL8852BS
      - Mediatek MT7663, MT7922
      - NXP w8997
      - Actions Semi ATS2851
      - QTI WCN6855
      - Marvell 88W8997

   - Can:
      - STMicroelectronics bxcan stm32f429

  Drivers:

   - Ethernet NICs:
      - Intel (1G, icg):
         - add tracking and reporting of QBV config errors
         - add support for configuring max SDU for each Tx queue
      - Intel (100G, ice):
         - refactor mailbox overflow detection to support Scalable IOV
         - GNSS interface optimization
      - Intel (i40e):
         - support XDP multi-buffer
      - nVidia/Mellanox:
         - add the support for linux bridge multicast offload
         - enable TC offload for egress and engress MACVLAN over bond
         - add support for VxLAN GBP encap/decap flows offload
         - extend packet offload to fully support libreswan
         - support tunnel mode in mlx5 IPsec packet offload
         - extend XDP multi-buffer support
         - support MACsec VLAN offload
         - add support for dynamic msix vectors allocation
         - drop RX page_cache and fully use page_pool
         - implement thermal zone to report NIC temperature
      - Netronome/Corigine:
         - add support for multi-zone conntrack offload
      - Solarflare/Xilinx:
         - support offloading TC VLAN push/pop actions to the MAE
         - support TC decap rules
         - support unicast PTP

   - Other NICs:
      - Broadcom (bnxt): enforce software based freq adjustments only on
        shared PHC NIC
      - RealTek (r8169): refactor to addess ASPM issues during NAPI poll
      - Micrel (lan8841): add support for PTP_PF_PEROUT
      - Cadence (macb): enable PTP unicast
      - Engleder (tsnep): add XDP socket zero-copy support
      - virtio-net: implement exact header length guest feature
      - veth: add page_pool support for page recycling
      - vxlan: add MDB data path support
      - gve: add XDP support for GQI-QPL format
      - geneve: accept every ethertype
      - macvlan: allow some packets to bypass broadcast queue
      - mana: add support for jumbo frame

   - Ethernet high-speed switches:
      - Microchip (sparx5): Add support for TC flower templates

   - Ethernet embedded switches:
      - Broadcom (b54):
         - configure 6318 and 63268 RGMII ports
      - Marvell (mv88e6xxx):
         - faster C45 bus scan
      - Microchip:
         - lan966x:
            - add support for IS1 VCAP
            - better TX/RX from/to CPU performances
         - ksz9477: add ETS Qdisc support
         - ksz8: enhance static MAC table operations and error handling
         - sama7g5: add PTP capability
      - NXP (ocelot):
         - add support for external ports
         - add support for preemptible traffic classes
      - Texas Instruments:
         - add CPSWxG SGMII support for J7200 and J721E

   - Intel WiFi (iwlwifi):
      - preparation for Wi-Fi 7 EHT and multi-link support
      - EHT (Wi-Fi 7) sniffer support
      - hardware timestamping support for some devices/firwmares
      - TX beacon protection on newer hardware

   - Qualcomm 802.11ax WiFi (ath11k):
      - MU-MIMO parameters support
      - ack signal support for management packets

   - RealTek WiFi (rtw88):
      - SDIO bus support
      - better support for some SDIO devices (e.g. MAC address from
        efuse)

   - RealTek WiFi (rtw89):
      - HW scan support for 8852b
      - better support for 6 GHz scanning
      - support for various newer firmware APIs
      - framework firmware backwards compatibility

   - MediaTek WiFi (mt76):
      - P2P support
      - mesh A-MSDU support
      - EHT (Wi-Fi 7) support
      - coredump support"

* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
  net: phy: hide the PHYLIB_LEDS knob
  net: phy: marvell-88x2222: remove unnecessary (void*) conversions
  tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  net: amd: Fix link leak when verifying config failed
  net: phy: marvell: Fix inconsistent indenting in led_blink_set
  lan966x: Don't use xdp_frame when action is XDP_TX
  tsnep: Add XDP socket zero-copy TX support
  tsnep: Add XDP socket zero-copy RX support
  tsnep: Move skb receive action to separate function
  tsnep: Add functions for queue enable/disable
  tsnep: Rework TX/RX queue initialization
  tsnep: Replace modulo operation with mask
  net: phy: dp83867: Add led_brightness_set support
  net: phy: Fix reading LED reg property
  drivers: nfc: nfcsim: remove return value check of `dev_dir`
  net: phy: dp83867: Remove unnecessary (void*) conversions
  net: ethtool: coalesce: try to make user settings stick twice
  net: mana: Check if netdev/napi_alloc_frag returns single page
  net: mana: Rename mana_refill_rxoob and remove some empty lines
  net: veth: add page_pool stats
  ...
</content>
</entry>
<entry>
<title>sched: Fix performance regression introduced by mm_cid</title>
<updated>2023-04-21T11:24:20+00:00</updated>
<author>
<name>Mathieu Desnoyers</name>
<email>mathieu.desnoyers@efficios.com</email>
</author>
<published>2023-04-20T14:55:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=223baf9d17f25e2608dbdff7232c095c1e612268'/>
<id>urn:sha1:223baf9d17f25e2608dbdff7232c095c1e612268</id>
<content type='text'>
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.

Keep track of the currently allocated mm_cid for each mm/cpu rather than
freeing them immediately on context switch. This eliminates most atomic
operations when context switching back and forth between threads
belonging to different memory spaces in multi-threaded scenarios (many
processes, each with many threads). The per-mm/per-cpu mm_cid values are
serialized by their respective runqueue locks.

Thread migration is handled by introducing invocation to
sched_mm_cid_migrate_to() (with destination runqueue lock held) in
activate_task() for migrating tasks. If the destination cpu's mm_cid is
unset, and if the source runqueue is not actively using its mm_cid, then
the source cpu's mm_cid is moved to the destination cpu on migration.

Introduce a task-work executed periodically, similarly to NUMA work,
which delays reclaim of cid values when they are unused for a period of
time.

Keep track of the allocation time for each per-cpu cid, and let the task
work clear them when they are observed to be older than
SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
mm_cids which are greater or equal to the Hamming weight of the mm
cidmask to keep concurrency ids compact.

Because we want to ensure the mm_cid converges towards the smaller
values as migrations happen, the prior optimization that was done when
context switching between threads belonging to the same mm is removed,
because it could delay the lazy release of the destination runqueue
mm_cid after it has been replaced by a migration. Removing this prior
optimization is not an issue performance-wise because the introduced
per-mm/per-cpu mm_cid tracking also covers this more specific case.

Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Reported-by: Aaron Lu &lt;aaron.lu@intel.com&gt;
Signed-off-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Aaron Lu &lt;aaron.lu@intel.com&gt;
Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
</content>
</entry>
<entry>
<title>tracing/user_events: Track fork/exec/exit for mm lifetime</title>
<updated>2023-03-29T10:52:08+00:00</updated>
<author>
<name>Beau Belgrave</name>
<email>beaub@linux.microsoft.com</email>
</author>
<published>2023-03-28T23:52:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fd593511cdfc0b0e38af2eb21c99f5154a1d7acf'/>
<id>urn:sha1:fd593511cdfc0b0e38af2eb21c99f5154a1d7acf</id>
<content type='text'>
During tracefs discussions it was decided instead of requiring a mapping
within a user-process to track the lifetime of memory descriptors we
should hook the appropriate calls. Do this by adding the minimal stubs
required for task fork, exec, and exit. Currently this is just a NOP.
Future patches will implement these calls fully.

Link: https://lkml.kernel.org/r/20230328235219.203-3-beaub@linux.microsoft.com

Suggested-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Signed-off-by: Beau Belgrave &lt;beaub@linux.microsoft.com&gt;
Signed-off-by: Steven Rostedt (Google) &lt;rostedt@goodmis.org&gt;
</content>
</entry>
<entry>
<title>livepatch,sched: Add livepatch task switching to cond_resched()</title>
<updated>2023-03-22T16:09:28+00:00</updated>
<author>
<name>Josh Poimboeuf</name>
<email>jpoimboe@kernel.org</email>
</author>
<published>2023-02-24T16:50:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8'/>
<id>urn:sha1:e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8</id>
<content type='text'>
There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.

Fix it by patching tasks in cond_resched().

There are four different flavors of cond_resched(), depending on the
kernel configuration.  Hook into all of them.

A more elegant solution might be to use a preempt notifier.  However,
non-ORC unwinders can't unwind a preempted task reliably.

[1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
[2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org

Signed-off-by: Josh Poimboeuf &lt;jpoimboe@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Petr Mladek &lt;pmladek@suse.com&gt;
Tested-by: Seth Forshee (DigitalOcean) &lt;sforshee@kernel.org&gt;
Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org
</content>
</entry>
<entry>
<title>tasks: Extract rcu_users out of union</title>
<updated>2023-03-13T19:42:24+00:00</updated>
<author>
<name>David Vernet</name>
<email>void@manifault.com</email>
</author>
<published>2023-02-15T23:30:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=22df776a9a866713d9decfb92b633bcfdb571954'/>
<id>urn:sha1:22df776a9a866713d9decfb92b633bcfdb571954</id>
<content type='text'>
In commit 3fbd7ee285b2b ("tasks: Add a count of task RCU users"), a
count on the number of RCU users was added to struct task_struct. This
was done so as to enable the removal of task_rcu_dereference(), and
allow tasks to be protected by RCU even after exiting and being removed
from the runqueue. In this commit, the 'refcount_t rcu_users' field that
keeps track of this refcount was put into a union co-located with
'struct rcu_head rcu', so as to avoid taking up any extra space in
task_struct. This was possible to do safely, because the field was only
ever decremented by a static set of specific callers, and then never
incremented again.

While this restriction of there only being a small, static set of users
of this field has worked fine, it prevents us from leveraging the field
to use RCU to protect tasks in other contexts.

During tracing, for example, it would be useful to be able to collect
some tasks that performed a certain operation, put them in a map, and
then periodically summarize who they are, which cgroup they're in, how
much CPU time they've utilized, etc. While this can currently be done
with 'usage', it becomes tricky when a task is already in a map, or if a
reference should only be taken if a task is valid and will not soon be
reaped. Ideally, we could do something like pass a reference to a map
value, and then try to acquire a reference to the task in an RCU read
region by using refcount_inc_not_zero().

Similarly, in sched_ext, schedulers are using integer pids to remember
tasks, and then looking them up with find_task_by_pid_ns(). This is
slow, error prone, and adds complexity. It would be more convenient and
performant if BPF schedulers could instead store tasks directly in maps,
and then leverage RCU to ensure they can be safely accessed with low
overhead.

Finally, overloading fields like this is error prone. Someone that wants
to use 'rcu_users' could easily overlook the fact that once the rcu
callback is scheduled, the refcount will go back to being nonzero, thus
precluding the use of refcount_inc_not_zero(). Furthermore, as described
below, it's possible to extract the fields of the union without changing
the size of task_struct.

There are several possible ways to enable this:

1. The lightest touch approach is likely the one proposed in this patch,
   which is to simply extract 'rcu_users' and 'rcu' from the union, so
   that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users'
   refcount. If we have a trusted task pointer, this would allow us to
   use refcnt_inc_not_zero() inside of an RCU region to determine if we
   can safely acquire a reference to the task and store it in a map. As
   mentioned below, this can be done without changing the size of
   task_struct, by moving the location of the union to another location
   that has padding gaps we can fill in.

2. Removing 'refcount_t rcu_users', and instead having the entire task
   be freed in an rcu callback. This is likely the most sound overall
   design, though it changes the behavioral semantics exposed to
   callers, who currently expect that a task that's successfully looked
   up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
   'usage' reference acquired on them, as it's guaranteed to be &gt;
   0 until after the next gp. In order for this approach to work, we'd
   have to audit all callers. This approach also slightly changes
   behavior observed by user space by not invoking
   trace_sched_process_free() until the whole task_struct is actually being
   freed, rather than just after it's exited. It also may change
   timings, as memory will be freed in an RCU callback rather than
   immediately when the final 'usage' refcount drops to 0. This also is
   arguably a benefit, as it provides more predictable performance to
   callers who are refcounting tasks.

3. There may be other solutions as well that don't require changing the
   layout of task_struct. For example, we could possibly do something
   complex from the BPF side, such as listen for task exit and remove a
   task from a map when the task is exiting. This would likely require
   significant custom handling for task_struct in the verifier, so a
   more generalizable solution is likely warranted.

As mentioned above, this patch proposes the lightest-touch approach
which allows callers elsewhere in the kernel to use 'rcu_users' to
ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from
the union. There is no size change in task_struct with this patch.

Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Signed-off-by: David Vernet &lt;void@manifault.com&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>fork/vm: Move common PF_IO_WORKER behavior to new flag</title>
<updated>2023-03-12T09:54:43+00:00</updated>
<author>
<name>Mike Christie</name>
<email>michael.christie@oracle.com</email>
</author>
<published>2023-03-10T22:03:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=54e6842d0775ba76db65cbe21311c3ca466e663d'/>
<id>urn:sha1:54e6842d0775ba76db65cbe21311c3ca466e663d</id>
<content type='text'>
This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use a new helper
instead of create_io_thread because they require different behavior for
operations like signal handling.

The common behavior PF_USER_WORKER covers is the vm reclaim handling.

Signed-off-by: Mike Christie &lt;michael.christie@oracle.com&gt;
Acked-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Signed-off-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2023-02-21T01:41:08+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-02-21T01:41:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1f2d9ffc7a5f916935749ffc6e93fb33bfe94d2f'/>
<id>urn:sha1:1f2d9ffc7a5f916935749ffc6e93fb33bfe94d2f</id>
<content type='text'>
Pull scheduler updates from Ingo Molnar:

 - Improve the scalability of the CFS bandwidth unthrottling logic with
   large number of CPUs.

 - Fix &amp; rework various cpuidle routines, simplify interaction with the
   generic scheduler code. Add __cpuidle methods as noinstr to objtool's
   noinstr detection and fix boatloads of cpuidle bugs &amp; quirks.

 - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
   previously issued registrations.

 - Limit scheduler slice duration to the sysctl_sched_latency period, to
   improve scheduling granularity with a large number of SCHED_IDLE
   tasks.

 - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
   but also enable them to prevent a cascade of followup problems and
   repeat warnings.

 - Fix the rescheduling logic in prio_changed_dl().

 - Micro-optimize cpufreq and sched-util methods.

 - Micro-optimize ttwu_runnable()

 - Micro-optimize the idle-scanning in update_numa_stats(),
   select_idle_capacity() and steal_cookie_task().

 - Update the RSEQ code &amp; self-tests

 - Constify various scheduler methods

 - Remove unused methods

 - Refine __init tags

 - Documentation updates

 - Misc other cleanups, fixes

* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  sched/rt: pick_next_rt_entity(): check list_entry
  sched/deadline: Add more reschedule cases to prio_changed_dl()
  sched/fair: sanitize vruntime of entity being placed
  sched/fair: Remove capacity inversion detection
  sched/fair: unlink misfit task from cpu overutilized
  objtool: mem*() are not uaccess safe
  cpuidle: Fix poll_idle() noinstr annotation
  sched/clock: Make local_clock() noinstr
  sched/clock/x86: Mark sched_clock() noinstr
  x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
  x86/atomics: Always inline arch_atomic64*()
  cpuidle: tracing, preempt: Squash _rcuidle tracing
  cpuidle: tracing: Warn about !rcu_is_watching()
  cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
  cpuidle: drivers: firmware: psci: Dont instrument suspend code
  KVM: selftests: Fix build of rseq test
  exit: Detect and fix irq disabled state in oops
  cpuidle, arm64: Fix the ARM64 cpuidle logic
  cpuidle: mvebu: Fix duplicate flags assignment
  sched/fair: Limit sched slice duration
  ...
</content>
</entry>
<entry>
<title>blk-cgroup: store a gendisk to throttle in struct task_struct</title>
<updated>2023-02-03T15:20:05+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2023-02-03T15:03:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f05837ed73d0c73e950b2d9f2612febb0d3d451e'/>
<id>urn:sha1:f05837ed73d0c73e950b2d9f2612febb0d3d451e</id>
<content type='text'>
Switch from a request_queue pointer and reference to a gendisk once
for the throttle information in struct task_struct.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Andreas Herrmann &lt;aherrmann@suse.de&gt;
Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
