<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/kernfs, branch v6.6.131</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.131</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.131'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-09-19T14:32:05+00:00</updated>
<entry>
<title>kernfs: Fix UAF in polling when open file is released</title>
<updated>2025-09-19T14:32:05+00:00</updated>
<author>
<name>Chen Ridong</name>
<email>chenridong@huawei.com</email>
</author>
<published>2025-08-22T07:07:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=854baafc00c433cccbe0ab4231b77aeb9b637b77'/>
<id>urn:sha1:854baafc00c433cccbe0ab4231b77aeb9b637b77</id>
<content type='text'>
commit 3c9ba2777d6c86025e1ba4186dc5cd930e40ec5f upstream.

A use-after-free (UAF) vulnerability was identified in the PSI (Pressure
Stall Information) monitoring mechanism:

BUG: KASAN: slab-use-after-free in psi_trigger_poll+0x3c/0x140
Read of size 8 at addr ffff3de3d50bd308 by task systemd/1

psi_trigger_poll+0x3c/0x140
cgroup_pressure_poll+0x70/0xa0
cgroup_file_poll+0x8c/0x100
kernfs_fop_poll+0x11c/0x1c0
ep_item_poll.isra.0+0x188/0x2c0

Allocated by task 1:
cgroup_file_open+0x88/0x388
kernfs_fop_open+0x73c/0xaf0
do_dentry_open+0x5fc/0x1200
vfs_open+0xa0/0x3f0
do_open+0x7e8/0xd08
path_openat+0x2fc/0x6b0
do_filp_open+0x174/0x368

Freed by task 8462:
cgroup_file_release+0x130/0x1f8
kernfs_drain_open_files+0x17c/0x440
kernfs_drain+0x2dc/0x360
kernfs_show+0x1b8/0x288
cgroup_file_show+0x150/0x268
cgroup_pressure_write+0x1dc/0x340
cgroup_file_write+0x274/0x548

Reproduction Steps:
1. Open test/cpu.pressure and establish epoll monitoring
2. Disable monitoring: echo 0 &gt; test/cgroup.pressure
3. Re-enable monitoring: echo 1 &gt; test/cgroup.pressure

The race condition occurs because:
1. When cgroup.pressure is disabled (echo 0 &gt; cgroup.pressure), it:
   - Releases PSI triggers via cgroup_file_release()
   - Frees of-&gt;priv through kernfs_drain_open_files()
2. While epoll still holds reference to the file and continues polling
3. Re-enabling (echo 1 &gt; cgroup.pressure) accesses freed of-&gt;priv

epolling			disable/enable cgroup.pressure
fd=open(cpu.pressure)
while(1)
...
epoll_wait
kernfs_fop_poll
kernfs_get_active = true	echo 0 &gt; cgroup.pressure
...				cgroup_file_show
				kernfs_show
				// inactive kn
				kernfs_drain_open_files
				cft-&gt;release(of);
				kfree(ctx);
				...
kernfs_get_active = false
				echo 1 &gt; cgroup.pressure
				kernfs_show
				kernfs_activate_one(kn);
kernfs_fop_poll
kernfs_get_active = true
cgroup_file_poll
psi_trigger_poll
// UAF
...
end: close(fd)

To address this issue, introduce kernfs_get_active_of() for kernfs open
files to obtain active references. This function will fail if the open file
has been released. Replace kernfs_get_active() with kernfs_get_active_of()
to prevent further operations on released file descriptors.

Fixes: 34f26a15611a ("sched/psi: Per-cgroup PSI accounting disable/re-enable interface")
Cc: stable &lt;stable@kernel.org&gt;
Reported-by: Zhang Zhaotian &lt;zhangzhaotian@huawei.com&gt;
Signed-off-by: Chen Ridong &lt;chenridong@huawei.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Link: https://lore.kernel.org/r/20250822070715.1565236-2-chenridong@huaweicloud.com
[ Drop llseek bits ]
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>kernfs: Relax constraint in draining guard</title>
<updated>2025-06-19T13:28:16+00:00</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-05-05T12:12:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6c81f1c7812c61f187bed1b938f1d2e391d503ab'/>
<id>urn:sha1:6c81f1c7812c61f187bed1b938f1d2e391d503ab</id>
<content type='text'>
[ Upstream commit 071d8e4c2a3b0999a9b822e2eb8854784a350f8a ]

The active reference lifecycle provides the break/unbreak mechanism but
the active reference is not truly active after unbreak -- callers don't
use it afterwards but it's important for proper pairing of kn-&gt;active
counting. Assuming this mechanism is in place, the WARN check in
kernfs_should_drain_open_files() is too sensitive -- it may transiently
catch those (rightful) callers between
kernfs_unbreak_active_protection() and kernfs_put_active() as found out by Chen
Ridong:

	kernfs_remove_by_name_ns	kernfs_get_active // active=1
	__kernfs_remove					  // active=0x80000002
	kernfs_drain			...
	wait_event
	//waiting (active == 0x80000001)
					kernfs_break_active_protection
					// active = 0x80000001
	// continue
					kernfs_unbreak_active_protection
					// active = 0x80000002
	...
	kernfs_should_drain_open_files
	// warning occurs
					kernfs_put_active

To avoid the false positives (mind panic_on_warn) remove the check altogether.
(This is meant as quick fix, I think active reference break/unbreak may be
simplified with larger rework.)

Fixes: bdb2fd7fc56e1 ("kernfs: Skip kernfs_drain_open_files() more aggressively")
Link: https://lore.kernel.org/r/kmmrseckjctb4gxcx2rdminrjnq2b4ipf7562nvfd432ld5v5m@2byj5eedkb2o/

Cc: Chen Ridong &lt;chenridong@huawei.com&gt;
Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Link: https://lore.kernel.org/r/20250505121201.879823-1-mkoutny@suse.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>kernfs: fix false-positive WARN(nr_mmapped) in kernfs_drain_open_files</title>
<updated>2024-08-29T15:33:33+00:00</updated>
<author>
<name>Neel Natu</name>
<email>neelnatu@google.com</email>
</author>
<published>2024-01-27T23:46:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1fb112cefadb23b42fb24fcbb46a9e59e122ccec'/>
<id>urn:sha1:1fb112cefadb23b42fb24fcbb46a9e59e122ccec</id>
<content type='text'>
[ Upstream commit 05d8f255867e3196565bb31a911a437697fab094 ]

Prior to this change 'on-&gt;nr_mmapped' tracked the total number of
mmaps across all of its associated open files via kernfs_fop_mmap().
Thus if the file descriptor associated with a kernfs_open_file was
mmapped 10 times then we would have: 'of-&gt;mmapped = true' and
'of_on(of)-&gt;nr_mmapped = 10'.

The problem is that closing or draining a 'of-&gt;mmapped' file would
only decrement one from the 'of_on(of)-&gt;nr_mmapped' counter.

For e.g. we have this from kernfs_unlink_open_file():
        if (of-&gt;mmapped)
                on-&gt;nr_mmapped--;

The WARN_ON_ONCE(on-&gt;nr_mmapped) in kernfs_drain_open_files() is
easy to reproduce by:
1. opening a (mmap-able) kernfs file.
2. mmap-ing that file more than once (mapping just once masks the issue).
3. trigger a drain of that kernfs file.

Modulo out-of-tree patches I was able to trigger this reliably by
identifying pci device nodes in sysfs that have resource regions
that are mmap-able and that don't have any driver attached to them
(steps 1 and 2). For step 3 we can "echo 1 &gt; remove" to trigger a
kernfs_drain.

Signed-off-by: Neel Natu &lt;neelnatu@google.com&gt;
Link: https://lore.kernel.org/r/20240127234636.609265-1-neelnatu@google.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()</title>
<updated>2024-08-03T06:53:21+00:00</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2023-12-12T21:17:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=aea95c68b745952faa93b8a2c1955a2c3aefe5dd'/>
<id>urn:sha1:aea95c68b745952faa93b8a2c1955a2c3aefe5dd</id>
<content type='text'>
[ Upstream commit ff6d413b0b59466e5acf2e42f294b1842ae130a1 ]

One of the last remaining users of strlcpy() in the kernel is
kernfs_path_from_node_locked(), which passes back the problematic "length
we _would_ have copied" return value to indicate truncation.  Convert the
chain of all callers to use the negative return value (some of which
already doing this explicitly). All callers were already also checking
for negative return values, so the risk to missed checks looks very low.

In this analysis, it was found that cgroup1_release_agent() actually
didn't handle the "too large" condition, so this is technically also a
bug fix. :)

Here's the chain of callers, and resolution identifying each one as now
handling the correct return value:

kernfs_path_from_node_locked()
        kernfs_path_from_node()
                pr_cont_kernfs_path()
                        returns void
                kernfs_path()
                        sysfs_warn_dup()
                                return value ignored
                        cgroup_path()
                                blkg_path()
                                        bfq_bic_update_cgroup()
                                                return value ignored
                                TRACE_IOCG_PATH()
                                        return value ignored
                                TRACE_CGROUP_PATH()
                                        return value ignored
                                perf_event_cgroup()
                                        return value ignored
                                task_group_path()
                                        return value ignored
                                damon_sysfs_memcg_path_eq()
                                        return value ignored
                                get_mm_memcg_path()
                                        return value ignored
                                lru_gen_seq_show()
                                        return value ignored
                        cgroup_path_from_kernfs_id()
                                return value ignored
                cgroup_show_path()
                        already converted "too large" error to negative value
                cgroup_path_ns_locked()
                        cgroup_path_ns()
                                bpf_iter_cgroup_show_fdinfo()
                                        return value ignored
                                cgroup1_release_agent()
                                        wasn't checking "too large" error
                        proc_cgroup_show()
                                already converted "too large" to negative value

Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Zefan Li &lt;lizefan.x@bytedance.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Waiman Long &lt;longman@redhat.com&gt;
Cc: &lt;cgroups@vger.kernel.org&gt;
Co-developed-by: Azeem Shaikh &lt;azeemshaikh38@gmail.com&gt;
Signed-off-by: Azeem Shaikh &lt;azeemshaikh38@gmail.com&gt;
Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Stable-dep-of: 1be59c97c83c ("cgroup/cpuset: Prevent UAF in proc_cpuset_show()")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()</title>
<updated>2024-04-13T11:07:38+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2024-01-09T21:48:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4b1f991bad566de14ae79bca952d201189cada52'/>
<id>urn:sha1:4b1f991bad566de14ae79bca952d201189cada52</id>
<content type='text'>
[ Upstream commit 4207b556e62f0a8915afc5da4c5d5ad915a253a5 ]

The BPF helper bpf_cgroup_from_id() calls kernfs_find_and_get_node_by_id()
which acquires kernfs_idr_lock, which is an non-raw non-IRQ-safe lock. This
can lead to deadlocks as bpf_cgroup_from_id() can be called from any BPF
programs including e.g. the ones that attach to functions which are holding
the scheduler rq lock.

Consider the following BPF program:

  SEC("fentry/__set_cpus_allowed_ptr_locked")
  int BPF_PROG(__set_cpus_allowed_ptr_locked, struct task_struct *p,
	       struct affinity_context *affn_ctx, struct rq *rq, struct rq_flags *rf)
  {
	  struct cgroup *cgrp = bpf_cgroup_from_id(p-&gt;cgroups-&gt;dfl_cgrp-&gt;kn-&gt;id);

	  if (cgrp) {
		  bpf_printk("%d[%s] in %s", p-&gt;pid, p-&gt;comm, cgrp-&gt;kn-&gt;name);
		  bpf_cgroup_release(cgrp);
	  }
	  return 0;
  }

__set_cpus_allowed_ptr_locked() is called with rq lock held and the above
BPF program calls bpf_cgroup_from_id() within leading to the following
lockdep warning:

  =====================================================
  WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected
  6.7.0-rc3-work-00053-g07124366a1d7-dirty #147 Not tainted
  -----------------------------------------------------
  repro/1620 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
  ffffffff833b3688 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1e/0x70

		and this task is already holding:
  ffff888237ced698 (&amp;rq-&gt;__lock){-.-.}-{2:2}, at: task_rq_lock+0x4e/0xf0
  which would create a new lock dependency:
   (&amp;rq-&gt;__lock){-.-.}-{2:2} -&gt; (kernfs_idr_lock){+.+.}-{2:2}
  ...
   Possible interrupt unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(kernfs_idr_lock);
				 local_irq_disable();
				 lock(&amp;rq-&gt;__lock);
				 lock(kernfs_idr_lock);
    &lt;Interrupt&gt;
      lock(&amp;rq-&gt;__lock);

		 *** DEADLOCK ***
  ...
  Call Trace:
   dump_stack_lvl+0x55/0x70
   dump_stack+0x10/0x20
   __lock_acquire+0x781/0x2a40
   lock_acquire+0xbf/0x1f0
   _raw_spin_lock+0x2f/0x40
   kernfs_find_and_get_node_by_id+0x1e/0x70
   cgroup_get_from_id+0x21/0x240
   bpf_cgroup_from_id+0xe/0x20
   bpf_prog_98652316e9337a5a___set_cpus_allowed_ptr_locked+0x96/0x11a
   bpf_trampoline_6442545632+0x4f/0x1000
   __set_cpus_allowed_ptr_locked+0x5/0x5a0
   sched_setaffinity+0x1b3/0x290
   __x64_sys_sched_setaffinity+0x4f/0x60
   do_syscall_64+0x40/0xe0
   entry_SYSCALL_64_after_hwframe+0x46/0x4e

Let's fix it by protecting kernfs_node and kernfs_root with RCU and making
kernfs_find_and_get_node_by_id() acquire rcu_read_lock() instead of
kernfs_idr_lock.

This adds an rcu_head to kernfs_node making it larger by 16 bytes on 64bit.
Combined with the preceding rearrange patch, the net increase is 8 bytes.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Andrea Righi &lt;andrea.righi@canonical.com&gt;
Cc: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Link: https://lore.kernel.org/r/20240109214828.252092-4-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs/kernfs/dir: obey S_ISGID</title>
<updated>2024-02-05T20:14:32+00:00</updated>
<author>
<name>Max Kellermann</name>
<email>max.kellermann@ionos.com</email>
</author>
<published>2023-12-08T09:33:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=952d0cbd1f68374203c8abd3fcee5f533caf2f78'/>
<id>urn:sha1:952d0cbd1f68374203c8abd3fcee5f533caf2f78</id>
<content type='text'>
[ Upstream commit 5133bee62f0ea5d4c316d503cc0040cac5637601 ]

Handling of S_ISGID is usually done by inode_init_owner() in all other
filesystems, but kernfs doesn't use that function.  In kernfs, struct
kernfs_node is the primary data structure, and struct inode is only
created from it on demand.  Therefore, inode_init_owner() can't be
used and we need to imitate its behavior.

S_ISGID support is useful for the cgroup filesystem; it allows
subtrees managed by an unprivileged process to retain a certain owner
gid, which then enables sharing access to the subtree with another
unprivileged process.

--
v1 -&gt; v2: minor coding style fix (comment)

Signed-off-by: Max Kellermann &lt;max.kellermann@ionos.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Link: https://lore.kernel.org/r/20231208093310.297233-2-max.kellermann@ionos.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"</title>
<updated>2024-01-25T23:35:41+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2024-01-09T21:48:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5f93225dc925c43bb95c0a8a1e24f1a28d5370da'/>
<id>urn:sha1:5f93225dc925c43bb95c0a8a1e24f1a28d5370da</id>
<content type='text'>
commit e3977e0609a07d86406029fceea0fd40d7849368 upstream.

This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.

The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:

  =============================
  [ BUG: Invalid wait context ]
  6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
  -----------------------------
  swapper/0/0 is trying to lock:
  dfbcd488 (&amp;c-&gt;lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
  other info that might help us debug this:
  context-{5:5}
  2 locks held by swapper/0/0:
   #0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
   #1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
  stack backtrace:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
  Hardware name: Generic SH73A0 (Flattened Device Tree)
   unwind_backtrace from show_stack+0x10/0x14
   show_stack from dump_stack_lvl+0x68/0x90
   dump_stack_lvl from __lock_acquire+0x3cc/0x168c
   __lock_acquire from lock_acquire+0x274/0x30c
   lock_acquire from local_lock_acquire+0x28/0xa4
   local_lock_acquire from ___slab_alloc+0x234/0x8a8
   ___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
   __slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
   kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
   radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
   idr_get_free from idr_alloc_u32+0x9c/0x108
   idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
   idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
   __kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
   kernfs_create_root from sysfs_init+0x18/0x5c
   sysfs_init from mnt_init+0xc4/0x220
   mnt_init from vfs_caches_init+0x6c/0x88
   vfs_caches_init from start_kernel+0x474/0x528
   start_kernel from 0x0

Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Andrea Righi &lt;andrea.righi@canonical.com&gt;
Reported-by: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/20240109214828.252092-2-tj@kernel.org
Tested-by: Andrea Righi &lt;andrea.righi@canonical.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>kernfs: convert kernfs_idr_lock to an irq safe raw spinlock</title>
<updated>2024-01-25T23:35:41+00:00</updated>
<author>
<name>Andrea Righi</name>
<email>andrea.righi@canonical.com</email>
</author>
<published>2023-12-29T07:49:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3058183333a574c1de69aa893b9cd5d5dad91acd'/>
<id>urn:sha1:3058183333a574c1de69aa893b9cd5d5dad91acd</id>
<content type='text'>
commit c312828c37a72fe2d033a961c47c227b0767e9f8 upstream.

bpf_cgroup_from_id() is basically a wrapper to cgroup_get_from_id(),
that is relying on kernfs to determine the right cgroup associated to
the target id.

As a kfunc, it has the potential to be attached to any function through
BPF, particularly in contexts where certain locks are held.

However, kernfs is not using an irq safe spinlock for kernfs_idr_lock,
that means any kernfs function that is acquiring this lock can be
interrupted and potentially hit bpf_cgroup_from_id() in the process,
triggering a deadlock.

For example, it is really easy to trigger a lockdep splat between
kernfs_idr_lock and rq-&gt;_lock, attaching a small BPF program to
__set_cpus_allowed_ptr_locked() that just calls bpf_cgroup_from_id():

 =====================================================
 WARNING: HARDIRQ-safe -&gt; HARDIRQ-unsafe lock order detected
 6.7.0-rc7-virtme #5 Not tainted
 -----------------------------------------------------
 repro/131 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 ffffffffb2dc4578 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1d/0x80

 and this task is already holding:
 ffff911cbecaf218 (&amp;rq-&gt;__lock){-.-.}-{2:2}, at: task_rq_lock+0x50/0xc0
 which would create a new lock dependency:
  (&amp;rq-&gt;__lock){-.-.}-{2:2} -&gt; (kernfs_idr_lock){+.+.}-{2:2}

 but this new dependency connects a HARDIRQ-irq-safe lock:
  (&amp;rq-&gt;__lock){-.-.}-{2:2}

 ... which became HARDIRQ-irq-safe at:
   lock_acquire+0xbf/0x2b0
   _raw_spin_lock_nested+0x2e/0x40
   scheduler_tick+0x5d/0x170
   update_process_times+0x9c/0xb0
   tick_periodic+0x27/0xe0
   tick_handle_periodic+0x24/0x70
   __sysvec_apic_timer_interrupt+0x64/0x1a0
   sysvec_apic_timer_interrupt+0x6f/0x80
   asm_sysvec_apic_timer_interrupt+0x1a/0x20
   memcpy+0xc/0x20
   arch_dup_task_struct+0x15/0x30
   copy_process+0x1ce/0x1eb0
   kernel_clone+0xac/0x390
   kernel_thread+0x6f/0xa0
   kthreadd+0x199/0x230
   ret_from_fork+0x31/0x50
   ret_from_fork_asm+0x1b/0x30

 to a HARDIRQ-irq-unsafe lock:
  (kernfs_idr_lock){+.+.}-{2:2}

 ... which became HARDIRQ-irq-unsafe at:
 ...
   lock_acquire+0xbf/0x2b0
   _raw_spin_lock+0x30/0x40
   __kernfs_new_node.isra.0+0x83/0x280
   kernfs_create_root+0xf6/0x1d0
   sysfs_init+0x1b/0x70
   mnt_init+0xd9/0x2a0
   vfs_caches_init+0xcf/0xe0
   start_kernel+0x58a/0x6a0
   x86_64_start_reservations+0x18/0x30
   x86_64_start_kernel+0xc5/0xe0
   secondary_startup_64_no_verify+0x178/0x17b

 other info that might help us debug this:

  Possible interrupt unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(kernfs_idr_lock);
                                local_irq_disable();
                                lock(&amp;rq-&gt;__lock);
                                lock(kernfs_idr_lock);
   &lt;Interrupt&gt;
     lock(&amp;rq-&gt;__lock);

  *** DEADLOCK ***

Prevent this deadlock condition converting kernfs_idr_lock to a raw irq
safe spinlock.

The performance impact of this change should be negligible and it also
helps to prevent similar deadlock conditions with any other subsystems
that may depend on kernfs.

Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc")
Cc: stable &lt;stable@kernel.org&gt;
Signed-off-by: Andrea Righi &lt;andrea.righi@canonical.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Link: https://lore.kernel.org/r/20231229074916.53547-1-andrea.righi@canonical.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'driver-core-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core</title>
<updated>2023-09-01T16:43:18+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-09-01T16:43:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=28a4f91f5f251689c69155bc6a0b1afc9916c874'/>
<id>urn:sha1:28a4f91f5f251689c69155bc6a0b1afc9916c874</id>
<content type='text'>
Pull driver core updates from Greg KH:
 "Here is a small set of driver core updates and additions for 6.6-rc1.

  Included in here are:

   - stable kernel documentation updates

   - class structure const work from Ivan on various subsystems

   - kernfs tweaks

   - driver core tests!

   - kobject sanity cleanups

   - kobject structure reordering to save space

   - driver core error code handling fixups

   - other minor driver core cleanups

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
  driver core: Call in reversed order in device_platform_notify_remove()
  driver core: Return proper error code when dev_set_name() fails
  kobject: Remove redundant checks for whether ktype is NULL
  kobject: Add sanity check for kset-&gt;kobj.ktype in kset_register()
  drivers: base: test: Add missing MODULE_* macros to root device tests
  drivers: base: test: Add missing MODULE_* macros for platform devices tests
  drivers: base: Free devm resources when unregistering a device
  drivers: base: Add basic devm tests for platform devices
  drivers: base: Add basic devm tests for root devices
  kernfs: fix missing kernfs_iattr_rwsem locking
  docs: stable-kernel-rules: mention that regressions must be prevented
  docs: stable-kernel-rules: fine-tune various details
  docs: stable-kernel-rules: make the examples for option 1 a proper list
  docs: stable-kernel-rules: move text around to improve flow
  docs: stable-kernel-rules: improve structure by changing headlines
  base/node: Remove duplicated include
  kernfs: attach uuid for every kernfs and report it in fsid
  kernfs: add stub helper for kernfs_generic_poll()
  x86/resctrl: make pseudo_lock_class a static const structure
  x86/MSR: make msr_class a static const structure
  ...
</content>
</entry>
<entry>
<title>Merge tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2023-08-28T16:55:25+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-08-28T16:55:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ecd7db20474c3859d4d01f34aaabf41bd28c7d84'/>
<id>urn:sha1:ecd7db20474c3859d4d01f34aaabf41bd28c7d84</id>
<content type='text'>
Pull libfs and tmpfs updates from Christian Brauner:
 "This cycle saw a lot of work for tmpfs that required changes to the
  vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this
  cycle. Things will go back to mm next cycle.

  Features
  ========

   - By far the biggest work is the quota support for tmpfs. New tmpfs
     quota infrastructure is added to support it and a new QFMT_SHMEM
     uapi option is exposed.

     This offers user and group quotas to tmpfs (project quotas will be
     added later). Similar to other filesystems tmpfs quota are not
     supported within user namespaces yet.

   - Add support for user xattrs. While tmpfs already supports security
     xattrs (security.*) and POSIX ACLs for a long time it lacked
     support for user xattrs (user.*). With this pull request tmpfs will
     be able to support a limited number of user xattrs.

     This is accompanied by a fix (see below) to limit persistent simple
     xattr allocations.

   - Add support for stable directory offsets. Currently tmpfs relies on
     the libfs provided cursor-based mechanism for readdir. This causes
     issues when a tmpfs filesystem is exported via NFS.

     NFS clients do not open directories. Instead, each server-side
     readdir operation opens the directory, reads it, and then closes
     it. Since the cursor state for that directory is associated with
     the opened file it is discarded after each readdir operation. Such
     directory offsets are not just cached by NFS clients but also
     various userspace libraries based on these clients.

     As it stands there is no way to invalidate the caches when
     directory offsets have changed and the whole application depends on
     unchanging directory offsets.

     At LSFMM we discussed how to solve this problem and decided to
     support stable directory offsets. libfs now allows filesystems like
     tmpfs to use an xarrary to map a directory offset to a dentry. This
     mechanism is currently only used by tmpfs but can be supported by
     others as well.

  Fixes
  =====

   - Change persistent simple xattrs allocations in libfs from
     GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory
     cgroup limits. Since this is a change to libfs it affects both
     tmpfs and kernfs.

   - Correctly verify {g,u}id mount options.

     A new filesystem context is created via fsopen() which records the
     namespace that becomes the owning namespace of the superblock when
     fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are
     mountable in namespaces. However, fsconfig() calls can occur in a
     namespace different from the namespace where fsopen() has been
     called.

     Currently, when fsconfig() is called to set {g,u}id mount options
     the requested {g,u}id is mapped into a k{g,u}id according to the
     namespace where fsconfig() was called from. The resulting k{g,u}id
     is not guaranteed to be resolvable in the namespace of the
     filesystem (the one that fsopen() was called in).

     This means it's possible for an unprivileged user to create files
     owned by any group in a tmpfs mount since it's possible to set the
     setid bits on the tmpfs directory.

     The contract for {g,u}id mount options and {g,u}id values in
     general set from userspace has always been that they are translated
     according to the caller's idmapping. In so far, tmpfs has been
     doing the correct thing. But since tmpfs is mountable in
     unprivileged contexts it is also necessary to verify that the
     resulting {k,g}uid is representable in the namespace of the
     superblock to avoid such bugs.

     The new mount api's cross-namespace delegation abilities are
     already widely used. Having talked to a bunch of userspace this is
     the most faithful solution with minimal regression risks"

* tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs
  mm: invalidation check mapping before folio_contains
  tmpfs: trivial support for direct IO
  tmpfs,xattr: enable limited user extended attributes
  tmpfs: track free_ispace instead of free_inodes
  xattr: simple_xattr_set() return old_xattr to be freed
  tmpfs: verify {g,u}id mount options correctly
  shmem: move spinlock into shmem_recalc_inode() to fix quota support
  libfs: Remove parent dentry locking in offset_iterate_dir()
  libfs: Add a lock class for the offset map's xa_lock
  shmem: stable directory offsets
  shmem: Refactor shmem_symlink()
  libfs: Add directory operations for stable offsets
  shmem: fix quota lock nesting in huge hole handling
  shmem: Add default quota limit mount options
  shmem: quota support
  shmem: prepare shmem quota infrastructure
  quota: Check presence of quota operation structures instead of -&gt;quota_read and -&gt;quota_write callbacks
  shmem: make shmem_get_inode() return ERR_PTR instead of NULL
  shmem: make shmem_inode_acct_block() return error
</content>
</entry>
</feed>
