<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/btrfs/inode.c, branch linux-7.1.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.1.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.1.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-05-16T01:06:56+00:00</updated>
<entry>
<title>btrfs: always drop root-&gt;inodes lock before cond_resched()</title>
<updated>2026-05-16T01:06:56+00:00</updated>
<author>
<name>Boris Burkov</name>
<email>boris@bur.io</email>
</author>
<published>2026-05-08T20:11:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=975e63c7a8074d83e195577b7f76dadc9a3d14b7'/>
<id>urn:sha1:975e63c7a8074d83e195577b7f76dadc9a3d14b7</id>
<content type='text'>
find_first_inode() and find_first_inode_to_shrink() lock root-&gt;inodes,
then loop over them, occasionally skipping some inodes. When they skip
an inode, they attempt to share the cpu/lock with cond_resched_lock().

However, that has a subtle problem associated with it.
cond_resched_lock() only drops the lock if it needs to actually call
schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as
detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So
regardless of HZ, we will run for at least 2.8ms in this loop without
dropping the lock, assuming it finds no suitable inodes. If HZ is
small enough, it might be even worse as the tick granularity becomes
bigger than the timeslice.

The knock-on effect of this is that callers to
btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab
or userspace threads calling evict() will spin on xa_lock(&amp;root-&gt;inodes)
for 2.8ms, so the extent map shrinker dominates the lock even though
ostensibly it is intending to share it. This produces memory pressure as
there is only one kswapd and it runs sequentially so it can get stuck in
the inode slab shrinking.

To fix it, simply replace cond_resched_lock() with an open coded variant
which unconditionally does unlock/lock around cond_resched. Sharing the
lock is decoupled from sharing the CPU, and all the users of the lock
now share it fairly.

I was able to reproduce this on test systems by producing a lot of empty
files (to make a big root-&gt;inodes xarray), then producing memory
pressure by reading large files larger than ram, triggering kswapd and
the extent_map shrinker. The lock contention is visible with perf or
lockstat. This patch also relieved a user-apparent bottleneck on a
production system from the original report.

Tested-by: Rik van Riel &lt;riel@surriel.com&gt;
Reviewed-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: fix incorrect i_size after remount caused by KEEP_SIZE prealloc gap</title>
<updated>2026-05-07T22:32:08+00:00</updated>
<author>
<name>Robbie Ko</name>
<email>robbieko@synology.com</email>
</author>
<published>2026-05-01T02:41:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c562ba61fc5e11798720acc1b172862158f1fa0b'/>
<id>urn:sha1:c562ba61fc5e11798720acc1b172862158f1fa0b</id>
<content type='text'>
When fallocate() with FALLOC_FL_KEEP_SIZE preallocates an extent past the
current i_size, the file_extent_tree of the inode is updated to cover
that range. However, on the next mount, btrfs_read_locked_inode() only
re-populates file_extent_tree with [0, round_up(i_size, sectorsize)),
losing the marks that belonged to the KEEP_SIZE prealloc extent beyond
i_size.

Later, when a non-KEEP_SIZE fallocate() extends i_size into / past that
old prealloc extent, the reservation loop in btrfs_fallocate() skips
already-prealloc segments and does not call into the path that marks the
file_extent_tree, so a gap remains inside the file_extent_tree across
[old_aligned_i_size, start_of_new_alloc). Then __btrfs_prealloc_file_range()
calls btrfs_inode_safe_disk_i_size_write(), which uses
find_contiguous_extent_bit() starting at offset 0 to derive disk_i_size.
The walk stops at the gap, so disk_i_size ends up smaller than i_size and
gets persisted. After the next mount, the file shows the wrong (smaller)
size.

The following reproducer triggers the problem:

  $ cat test.sh
  MNT=/mnt/sdi
  DEV=/dev/sdi

  mkdir -p $MNT
  mkfs.btrfs -f -O ^no-holes $DEV
  mount $DEV $MNT

  touch $MNT/file1
  # KEEP_SIZE prealloc beyond i_size (i_size stays 0)
  fallocate -n -o 4M -l 4M $MNT/file1
  umount $MNT
  mount $DEV $MNT

  # non-KEEP_SIZE fallocate that overlaps the previous prealloc tail
  # and extends past it
  fallocate -o 7M -l 2M $MNT/file1
  ls -lh $MNT/file1
  umount $MNT
  mount $DEV $MNT
  ls -lh $MNT/file1
  umount $MNT

Running the reproducer gives the following result:

  $ ./test.sh
  (...)
  -rw-rw-r-- 1 root root 9.0M May  4 16:35 /mnt/sdi/file1
  -rw-rw-r-- 1 root root 7.0M May  4 16:35 /mnt/sdi/file1

The size before the second mount is correct (9M), but after the
remount it drops to 7M, i.e. the start of the gap inside file_extent_tree.

Fix this in __btrfs_prealloc_file_range() by marking the entire range
[round_down(old_i_size, sectorsize), round_up(new_i_size, sectorsize))
in file_extent_tree before updating i_size and calling
btrfs_inode_safe_disk_i_size_write(). This ensures the contiguous bit
search starting from 0 is not truncated by a stale gap left behind by a
previous KEEP_SIZE prealloc that was not restored on inode load.

The fix has no effect when the NO_HOLES feature is enabled because
btrfs_inode_safe_disk_i_size_write() and
btrfs_inode_set_file_extent_range()
both take the fast path that directly tracks disk_i_size without
consulting file_extent_tree.

Fixes: 9ddc959e802b ("btrfs: use the file extent tree infrastructure")
Reviewed-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Robbie Ko &lt;robbieko@synology.com&gt;
[ Minor updates to the change log ]
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent()</title>
<updated>2026-04-21T02:03:08+00:00</updated>
<author>
<name>Mark Harmstone</name>
<email>mark@harmstone.com</email>
</author>
<published>2026-04-16T17:15:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=82323b1a7088b7a5c3e528a5d634bff447fa286f'/>
<id>urn:sha1:82323b1a7088b7a5c3e528a5d634bff447fa286f</id>
<content type='text'>
submit_one_async_extent() calls btrfs_reserve_extent(), which decrements
bytes_may_use. If the call btrfs_create_io_em() fails, we jump to
out_free_reserve, which calls extent_clear_unlock_delalloc().

Because we're specifying EXTENT_DO_ACCOUNTING, i.e.
EXTENT_CLEAR_META_RESV | EXTENT_CLEAR_DATA_RESV, this decreases
bytes_may_use again. This can lead to problems later on, as an initial
write can fail only for the writeback to silently ENOSPC.

Fix this by replacing EXTENT_DO_ACCOUNTING with EXTENT_CLEAR_META_RESV.
This parallels a4fe134fc1d8eb ("btrfs: fix a double release on reserved
extents in cow_one_range()"), which is the same fix in cow_one_range().

Fixes: 151a41bc46df ("Btrfs: fix what bits we clear when erroring out from delalloc")
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: Mark Harmstone &lt;mark@harmstone.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: fix missing last_unlink_trans update when removing a directory</title>
<updated>2026-04-21T02:01:48+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2026-04-09T14:46:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=999757231c49376cd1a37308d2c8c4c9932571e1'/>
<id>urn:sha1:999757231c49376cd1a37308d2c8c4c9932571e1</id>
<content type='text'>
When removing a directory we are not updating its last_unlink_trans field,
which can result in incorrect fsync behaviour in case some one fsyncs the
directory after it was removed because it's holding a file descriptor on
it.

Example scenario:

   mkdir /mnt/dir1
   mkdir /mnt/dir1/dir2
   mkdir /mnt/dir3

   sync -f /mnt

   # Do some change to the directory and fsync it.
   chmod 700 /mnt/dir1
   xfs_io -c fsync /mnt/dir1

   # Move dir2 out of dir1 so that dir1 becomes empty.
   mv /mnt/dir1/dir2 /mnt/dir3/

   open fd on /mnt/dir1
   call rmdir(2) on path "/mnt/dir1"
   fsync fd

   &lt;trigger power failure&gt;

When attempting to mount the filesystem, the log replay will fail with
an -EIO error and dmesg/syslog has the following:

   [445771.626482] BTRFS info (device dm-0): first mount of filesystem 0368bbea-6c5e-44b5-b409-09abe496e650
   [445771.626486] BTRFS info (device dm-0): using crc32c checksum algorithm
   [445771.627912] BTRFS info (device dm-0): start tree-log replay
   [445771.628335] page: refcount:2 mapcount:0 mapping:0000000061443ddc index:0x1d00 pfn:0x7072a5
   [445771.629453] memcg:ffff89f400351b00
   [445771.629892] aops:btree_aops [btrfs] ino:1
   [445771.630737] flags: 0x17fffc00000402a(uptodate|lru|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
   [445771.632359] raw: 017fffc00000402a fffff47284d950c8 fffff472907b7c08 ffff89f458e412b8
   [445771.633713] raw: 0000000000001d00 ffff89f6c51d1a90 00000002ffffffff ffff89f400351b00
   [445771.635029] page dumped because: eb page dump
   [445771.635825] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=10 ino=258, invalid nlink: has 2 expect no more than 1 for dir
   [445771.638088] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14878 owner 5
   [445771.638091] BTRFS info (device dm-0): refs 4 lock_owner 0 current 3581087
   [445771.638094] 	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
   [445771.638097] 		inode generation 3 transid 9 size 16 nbytes 16384
   [445771.638098] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.638100] 		rdev 0 sequence 2 flags 0x0
   [445771.638102] 		atime 1775744884.0
   [445771.660056] 		ctime 1775744885.645502983
   [445771.660058] 		mtime 1775744885.645502983
   [445771.660060] 		otime 1775744884.0
   [445771.660062] 	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
   [445771.660064] 		index 0 name_len 2
   [445771.660066] 	item 2 key (256 DIR_ITEM 1843588421) itemoff 16077 itemsize 34
   [445771.660068] 		location key (259 1 0) type 2
   [445771.660070] 		transid 9 data_len 0 name_len 4
   [445771.660075] 	item 3 key (256 DIR_ITEM 2363071922) itemoff 16043 itemsize 34
   [445771.660076] 		location key (257 1 0) type 2
   [445771.660077] 		transid 9 data_len 0 name_len 4
   [445771.660078] 	item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
   [445771.660079] 		location key (257 1 0) type 2
   [445771.660080] 		transid 9 data_len 0 name_len 4
   [445771.660081] 	item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
   [445771.660082] 		location key (259 1 0) type 2
   [445771.660083] 		transid 9 data_len 0 name_len 4
   [445771.660084] 	item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
   [445771.660086] 		inode generation 9 transid 9 size 8 nbytes 0
   [445771.660087] 		block group 0 mode 40777 links 1 uid 0 gid 0
   [445771.660088] 		rdev 0 sequence 2 flags 0x0
   [445771.660089] 		atime 1775744885.641174097
   [445771.660090] 		ctime 1775744885.645502983
   [445771.660091] 		mtime 1775744885.645502983
   [445771.660105] 		otime 1775744885.641174097
   [445771.660106] 	item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
   [445771.660107] 		index 2 name_len 4
   [445771.660108] 	item 8 key (257 DIR_ITEM 2676584006) itemoff 15767 itemsize 34
   [445771.660109] 		location key (258 1 0) type 2
   [445771.660110] 		transid 9 data_len 0 name_len 4
   [445771.660111] 	item 9 key (257 DIR_INDEX 2) itemoff 15733 itemsize 34
   [445771.660112] 		location key (258 1 0) type 2
   [445771.660113] 		transid 9 data_len 0 name_len 4
   [445771.660114] 	item 10 key (258 INODE_ITEM 0) itemoff 15573 itemsize 160
   [445771.660115] 		inode generation 9 transid 10 size 0 nbytes 0
   [445771.660116] 		block group 0 mode 40755 links 2 uid 0 gid 0
   [445771.660117] 		rdev 0 sequence 0 flags 0x0
   [445771.660118] 		atime 1775744885.645502983
   [445771.660119] 		ctime 1775744885.645502983
   [445771.660120] 		mtime 1775744885.645502983
   [445771.660121] 		otime 1775744885.645502983
   [445771.660122] 	item 11 key (258 INODE_REF 257) itemoff 15559 itemsize 14
   [445771.660123] 		index 2 name_len 4
   [445771.660124] 	item 12 key (258 INODE_REF 259) itemoff 15545 itemsize 14
   [445771.660125] 		index 2 name_len 4
   [445771.660126] 	item 13 key (259 INODE_ITEM 0) itemoff 15385 itemsize 160
   [445771.660127] 		inode generation 9 transid 10 size 8 nbytes 0
   [445771.660128] 		block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.660129] 		rdev 0 sequence 1 flags 0x0
   [445771.660130] 		atime 1775744885.645502983
   [445771.660130] 		ctime 1775744885.645502983
   [445771.660131] 		mtime 1775744885.645502983
   [445771.660132] 		otime 1775744885.645502983
   [445771.660133] 	item 14 key (259 INODE_REF 256) itemoff 15371 itemsize 14
   [445771.660134] 		index 3 name_len 4
   [445771.660135] 	item 15 key (259 DIR_ITEM 2676584006) itemoff 15337 itemsize 34
   [445771.660136] 		location key (258 1 0) type 2
   [445771.660137] 		transid 10 data_len 0 name_len 4
   [445771.660138] 	item 16 key (259 DIR_INDEX 2) itemoff 15303 itemsize 34
   [445771.660139] 		location key (258 1 0) type 2
   [445771.660140] 		transid 10 data_len 0 name_len 4
   [445771.660144] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
   [445771.661650] ------------[ cut here ]------------
   [445771.662358] WARNING: fs/btrfs/disk-io.c:326 at btree_csum_one_bio+0x217/0x230 [btrfs], CPU#8: mount/3581087
   [445771.663588] Modules linked in: btrfs f2fs xfs (...)
   [445771.671229] CPU: 8 UID: 0 PID: 3581087 Comm: mount Tainted: G        W           7.0.0-rc6-btrfs-next-230+ #2 PREEMPT(full)
   [445771.672575] Tainted: [W]=WARN
   [445771.672987] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [445771.674460] RIP: 0010:btree_csum_one_bio+0x217/0x230 [btrfs]
   [445771.675222] Code: 89 44 24 (...)
   [445771.677364] RSP: 0018:ffffd23882247660 EFLAGS: 00010246
   [445771.678029] RAX: 0000000000000000 RBX: ffff89f6c51d1a90 RCX: 0000000000000000
   [445771.678975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff89f406020000
   [445771.679983] RBP: ffff89f821204000 R08: 0000000000000000 R09: 00000000ffefffff
   [445771.680905] R10: ffffd23882247448 R11: 0000000000000003 R12: ffffd23882247668
   [445771.681978] R13: ffff89f458e40fc0 R14: ffff89f737f4f500 R15: ffff89f737f4f500
   [445771.682912] FS:  00007f0447a98840(0000) GS:ffff89fb9771d000(0000) knlGS:0000000000000000
   [445771.684393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [445771.685230] CR2: 00007f0447bf1330 CR3: 000000017cb02002 CR4: 0000000000370ef0
   [445771.686273] Call Trace:
   [445771.686646]  &lt;TASK&gt;
   [445771.686969]  btrfs_submit_bbio+0x83f/0x860 [btrfs]
   [445771.687750]  ? write_one_eb+0x28f/0x340 [btrfs]
   [445771.688428]  btree_writepages+0x2e3/0x550 [btrfs]
   [445771.689180]  ? kmem_cache_alloc_noprof+0x12a/0x490
   [445771.689963]  ? alloc_extent_state+0x19/0x120 [btrfs]
   [445771.690801]  ? kmem_cache_free+0x135/0x380
   [445771.691328]  ? preempt_count_add+0x69/0xa0
   [445771.691831]  ? set_extent_bit+0x252/0x8e0 [btrfs]
   [445771.692468]  ? xas_load+0x9/0xc0
   [445771.692873]  ? xas_find+0x14d/0x1a0
   [445771.693304]  do_writepages+0xc6/0x160
   [445771.693756]  filemap_writeback+0xb8/0xe0
   [445771.694274]  btrfs_write_marked_extents+0x61/0x170 [btrfs]
   [445771.694999]  btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
   [445771.695818]  btrfs_commit_transaction+0x5c8/0xd10 [btrfs]
   [445771.696530]  ? kmem_cache_free+0x135/0x380
   [445771.697120]  ? release_extent_buffer+0x34/0x160 [btrfs]
   [445771.697786]  btrfs_recover_log_trees+0x7be/0x7e0 [btrfs]
   [445771.698525]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [445771.699206]  open_ctree+0x11e5/0x1810 [btrfs]
   [445771.699776]  btrfs_get_tree.cold+0xb/0x162 [btrfs]
   [445771.700463]  ? fscontext_read+0x165/0x180
   [445771.701146]  ? rw_verify_area+0x50/0x180
   [445771.701866]  vfs_get_tree+0x25/0xd0
   [445771.702491]  vfs_cmd_create+0x59/0xe0
   [445771.703125]  __do_sys_fsconfig+0x303/0x610
   [445771.703603]  do_syscall_64+0xe9/0xf20
   [445771.703974]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [445771.704700] RIP: 0033:0x7f0447cbd4aa
   [445771.705108] Code: 73 01 c3 (...)
   [445771.707263] RSP: 002b:00007ffc4e528318 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [445771.708107] RAX: ffffffffffffffda RBX: 00005561585d8c20 RCX: 00007f0447cbd4aa
   [445771.708931] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [445771.709744] RBP: 00005561585d9120 R08: 0000000000000000 R09: 0000000000000000
   [445771.710674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [445771.711477] R13: 00007f0447e4f580 R14: 00007f0447e5126c R15: 00007f0447e36a23
   [445771.712277]  &lt;/TASK&gt;
   [445771.712541] ---[ end trace 0000000000000000 ]---
   [445771.713382] BTRFS error (device dm-0): error while writing out transaction: -5
   [445771.714679] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
   [445771.715562] BTRFS error (device dm-0 state A): Transaction aborted (error -5)
   [445771.716459] BTRFS: error (device dm-0 state A) in cleanup_transaction:2068: errno=-5 IO failure
   [445771.717936] BTRFS error (device dm-0 state EA): failed to recover log trees with error: -5
   [445771.719681] BTRFS error (device dm-0 state EA): open_ctree failed: -5

The problem is that such a fsync should have result in a fallback to a
transaction commit, but that did not happen because through the
btrfs_rmdir() we never update the directory's last_unlink_trans field.
Any inode that had a link removed must have its last_unlink_trans updated
to the ID of transaction used for the operation, otherwise fsync and log
replay will not work correctly.

btrfs_rmdir() calls btrfs_unlink_inode() and through that call chain we
never call btrfs_record_unlink_dir() in order to update last_unlink_trans.
However btrfs_unlink(), which is used for unlinking regular files, calls
btrfs_record_unlink_dir() and then calls btrfs_unlink_inode(). So fix
this by moving the call to btrfs_record_unlink_dir() from btrfs_unlink()
to btrfs_unlink_inode().

A test case for fstests will follow soon.

Reported-by: Slava0135 &lt;slava.kovalevskiy.2014@gmail.com&gt;
Link: https://lore.kernel.org/linux-btrfs/CAAJYhww5ov62Hm+n+tmhcL-e_4cBobg+OWogKjOJxVUXivC=MQ@mail.gmail.com/
CC: stable@vger.kernel.org
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents</title>
<updated>2026-04-07T17:43:22+00:00</updated>
<author>
<name>Dave Chen</name>
<email>davechen@synology.com</email>
</author>
<published>2026-03-30T03:31:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e0dfaebb8f4a1de59a8b805d600e3b662b235efc'/>
<id>urn:sha1:e0dfaebb8f4a1de59a8b805d600e3b662b235efc</id>
<content type='text'>
In btrfs_finish_one_ordered(), clear_bits is unconditionally initialized
with EXTENT_DEFRAG.  For NOCOW ordered extents this is always a no-op
because should_nocow() already forces the COW path when EXTENT_DEFRAG is
set, so a NOCOW ordered extent can never have EXTENT_DEFRAG on its range.

Although harmless, the unconditional btrfs_clear_extent_bit() call still
performs a cold rbtree lookup under the io tree spinlock on every NOCOW
write completion.  Avoid this by only adding EXTENT_DEFRAG to clear_bits
for non-NOCOW ordered extents, and skip the call entirely when there are
no bits to clear.

Signed-off-by: Dave Chen &lt;davechen@synology.com&gt;
Signed-off-by: Robbie Ko &lt;robbieko@synology.com&gt;
Reviewed-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: tag as unlikely if statements that check for fs in error state</title>
<updated>2026-04-07T17:41:42+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2026-04-01T17:51:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7801f3ea9591cf040f7f92c44f8ec91eaa0d6207'/>
<id>urn:sha1:7801f3ea9591cf040f7f92c44f8ec91eaa0d6207</id>
<content type='text'>
Having the filesystem in an error state, meaning we had a transaction
abort, is unexpected. Mark every check for the error state with the
unlikely annotation to convey that and to allow the compiler to generate
better code.

On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly
reduced object size and better code.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2008598	 175912	  15592	2200102	 219226	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2008450	 175912	  15592	2199954	 219192	fs/btrfs/btrfs.ko

Reviewed-by: Anand Jain &lt;asj@kernel.org&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: prevent direct reclaim during compressed readahead</title>
<updated>2026-04-07T16:56:08+00:00</updated>
<author>
<name>JP Kobryn (Meta)</name>
<email>jp.kobryn@linux.dev</email>
</author>
<published>2026-03-28T21:46:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7ae37b2c94ed30bfefece2b68c727a4474206718'/>
<id>urn:sha1:7ae37b2c94ed30bfefece2b68c727a4474206718</id>
<content type='text'>
Under memory pressure, direct reclaim can kick in during compressed
readahead. This puts the associated task into D-state. Then shrink_lruvec()
disables interrupts when acquiring the LRU lock. Under heavy pressure,
we've observed reclaim can run long enough that the CPU becomes prone to
CSD lock stalls since it cannot service incoming IPIs. Although the CSD
lock stalls are the worst case scenario, we have found many more subtle
occurrences of this latency on the order of seconds, over a minute in some
cases.

Prevent direct reclaim during compressed readahead. This is achieved by
using different GFP flags at key points when the bio is marked for
readahead.

There are two functions that allocate during compressed readahead:
btrfs_alloc_compr_folio() and add_ra_bio_pages(). Both currently use
GFP_NOFS which includes __GFP_DIRECT_RECLAIM.

For the internal API call btrfs_alloc_compr_folio(), the signature changes
to accept an additional gfp_t parameter. At the readahead call site, it
gets flags similar to GFP_NOFS but stripped of __GFP_DIRECT_RECLAIM.
__GFP_NOWARN is added since these allocations are allowed to fail. Demand
reads still use full GFP_NOFS and will enter reclaim if needed. All other
existing call sites of btrfs_alloc_compr_folio() now explicitly pass
GFP_NOFS to retain their current behavior.

add_ra_bio_pages() gains a bool parameter which allows callers to specify
if they want to allow direct reclaim or not. In either case, the
__GFP_NOWARN flag was added unconditionally since the allocations are
speculative.

There has been some previous work done on calling add_ra_bio_pages() [0].
This patch is complementary: where that patch reduces call frequency, this
patch reduces the latency associated with those calls.

[0] https://lore.kernel.org/linux-btrfs/656838ec1232314a2657716e59f4f15a8eadba64.1751492111.git.boris@bur.io/

Reviewed-by: Mark Harmstone &lt;mark@harmstone.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: JP Kobryn (Meta) &lt;jp.kobryn@linux.dev&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: fix unnecessary flush on close when truncating zero-sized files</title>
<updated>2026-04-07T16:56:06+00:00</updated>
<author>
<name>Dave Chen</name>
<email>davechen@synology.com</email>
</author>
<published>2026-03-23T03:43:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0e6a169c6487ca3a13d19a9ec6dc9673af5a1cf7'/>
<id>urn:sha1:0e6a169c6487ca3a13d19a9ec6dc9673af5a1cf7</id>
<content type='text'>
In btrfs_setsize(), when a file is truncated to size 0, the
BTRFS_INODE_FLUSH_ON_CLOSE flag is unconditionally set to ensure
pending writes get flushed on close. This flag was designed to protect
the "truncate-then-rewrite" pattern, where an application truncates a
file with existing data down to zero and writes new content, ensuring
the new data reach disk on close.

However, when a file already has a size of 0 (e.g. a newly created
file opened with O_CREAT | O_TRUNC), oldsize and newsize are both 0.
In this case, setting BTRFS_INODE_FLUSH_ON_CLOSE is unnecessary because
no "good data" was truncated away. The subsequent filemap_flush() in
btrfs_release_file() then triggers avoidable writeback that disrupts
the normal delayed writeback batching, adding I/O overhead.

This comes from a real workload. A backup service creates temporary
files via mkstemp(), closes them, and later reopens them with O_TRUNC
for writing. The O_TRUNC is defensive.  The file creation and usage is
done by a different component, so removing the unneeded truncation is
not straightforward.  This pattern repeats for a large number of files
each close() triggers an unnecessary filemap_flush().

Signed-off-by: Dave Chen &lt;davechen@synology.com&gt;
Signed-off-by: Robbie Ko &lt;robbieko@synology.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: prefer IS_ERR_OR_NULL() over manual NULL check</title>
<updated>2026-04-07T16:56:02+00:00</updated>
<author>
<name>Philipp Hahn</name>
<email>phahn-oss@avm.de</email>
</author>
<published>2026-03-10T11:48:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e5267796482fad93ee7948c7cbc37f32046244f7'/>
<id>urn:sha1:e5267796482fad93ee7948c7cbc37f32046244f7</id>
<content type='text'>
Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
check.

IS_ERR_OR_NULL() already uses likely(!ptr) internally. checkpatch does
not like nesting it:
&gt; WARNING: nested (un)?likely() calls, IS_ERR_OR_NULL already uses
&gt; unlikely() internally
Remove the explicit use of likely().

Change generated with coccinelle.

Signed-off-by: Philipp Hahn &lt;phahn-oss@avm.de&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: extract inlined creation into a dedicated delalloc helper</title>
<updated>2026-04-07T16:56:01+00:00</updated>
<author>
<name>Qu Wenruo</name>
<email>wqu@suse.com</email>
</author>
<published>2026-03-04T05:18:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3eaf5f082c4cc71dea70bee23355bf63b24df303'/>
<id>urn:sha1:3eaf5f082c4cc71dea70bee23355bf63b24df303</id>
<content type='text'>
Currently we call cow_file_range_inline() in different situations, from
regular cow_file_range() to compress_file_range().

This is because inline extent creation has different conditions based on
whether it's a compressed one or not.

But on the other hand, inline extent creation shouldn't be so
distributed, we can just have a dedicated branch in
btrfs_run_delalloc_range().

It will become more obvious for compressed inline cases, it makes no
sense to go through all the complex async extent mechanism just to
inline a single block.

So here we introduce a dedicated run_delalloc_inline() helper, and
remove all inline related handling from cow_file_range() and
compress_file_range().

There is a special update to inode_need_compress(), that a new
@check_inline parameter is introduced.
This is to allow inline specific checks to be done inside
run_delalloc_inline(), which allows single block compression, but
other call sites should always reject single block compression.

Reviewed-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
</feed>
