summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2026-03-19btrfs: abort transaction on failure to update root in the received subvol ioctlFilipe Manana1-1/+2
commit 0f475ee0ebce5c9492b260027cd95270191675fa upstream. If we failed to update the root we don't abort the transaction, which is wrong since we already used the transaction to remove an item from the uuid tree. Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree") CC: stable@vger.kernel.org # 3.12+ Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19btrfs: add missing RCU unlock in error path in ↵Bart Van Assche1-0/+1
try_release_subpage_extent_buffer() commit b2840e33127ce0eea880504b7f133e780f567a9b upstream. Call rcu_read_lock() before exiting the loop in try_release_subpage_extent_buffer() because there is a rcu_read_unlock() call past the loop. This has been detected by the Clang thread-safety analyzer. Fixes: ad580dfa388f ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") CC: stable@vger.kernel.org # 6.18+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19btrfs: fix transaction abort on set received ioctl due to item overflowFilipe Manana3-2/+59
commit 87f2c46003fce4d739138aab4af1942b1afdadac upstream. If the set received ioctl fails due to an item overflow when attempting to add the BTRFS_UUID_KEY_RECEIVED_SUBVOL we have to abort the transaction since we did some metadata updates before. This means that if a user calls this ioctl with the same received UUID field for a lot of subvolumes, we will hit the overflow, trigger the transaction abort and turn the filesystem into RO mode. A malicious user could exploit this, and this ioctl does not even requires that a user has admin privileges (CAP_SYS_ADMIN), only that he/she owns the subvolume. Fix this by doing an early check for item overflow before starting a transaction. This is also race safe because we are holding the subvol_sem semaphore in exclusive (write) mode. A test case for fstests will follow soon. Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree") CC: stable@vger.kernel.org # 3.12+ Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19btrfs: fix transaction abort on file creation due to name hash collisionFilipe Manana1-0/+19
commit 2d1ababdedd4ba38867c2500eb7f95af5ddeeef7 upstream. If we attempt to create several files with names that result in the same hash, we have to pack them in same dir item and that has a limit inherent to the leaf size. However if we reach that limit, we trigger a transaction abort and turns the filesystem into RO mode. This allows for a malicious user to disrupt a system, without the need to have administration privileges/capabilities. Reproducer: $ cat exploit-hash-collisions.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi # Use smallest node size to make the test faster and require fewer file # names that result in hash collision. mkfs.btrfs -f --nodesize 4K $DEV mount $DEV $MNT # List of names that result in the same crc32c hash for btrfs. declare -a names=( 'foobar' '%a8tYkxfGMLWRGr55QSeQc4PBNH9PCLIvR6jZnkDtUUru1t@RouaUe_L:@xGkbO3nCwvLNYeK9vhE628gss:T$yZjZ5l-Nbd6CbC$M=hqE-ujhJICXyIxBvYrIU9-TDC' 'AQci3EUB%shMsg-N%frgU:02ByLs=IPJU0OpgiWit5nexSyxZDncY6WB:=zKZuk5Zy0DD$Ua78%MelgBuMqaHGyKsJUFf9s=UW80PcJmKctb46KveLSiUtNmqrMiL9-Y0I_l5Fnam04CGIg=8@U:Z' 'CvVqJpJzueKcuA$wqwePfyu7VxuWNN3ho$p0zi2H8QFYK$7YlEqOhhb%:hHgjhIjW5vnqWHKNP4' 'ET:vk@rFU4tsvMB0$C_p=xQHaYZjvoF%-BTc%wkFW8yaDAPcCYoR%x$FH5O:' 'HwTon%v7SGSP4FE08jBwwiu5aot2CFKXHTeEAa@38fUcNGOWvE@Mz6WBeDH_VooaZ6AgsXPkVGwy9l@@ZbNXabUU9csiWrrOp0MWUdfi$EZ3w9GkIqtz7I_eOsByOkBOO' 'Ij%2VlFGXSuPvxJGf5UWy6O@1svxGha%b@=%wjkq:CIgE6u7eJOjmQY5qTtxE2Rjbis9@us' 'KBkjG5%9R8K9sOG8UTnAYjxLNAvBmvV5vz3IiZaPmKuLYO03-6asI9lJ_j4@6Xo$KZicaLWJ3Pv8XEwVeUPMwbHYWwbx0pYvNlGMO9F:ZhHAwyctnGy%_eujl%WPd4U2BI7qooOSr85J-C2V$LfY' 'NcRfDfuUQ2=zP8K3CCF5dFcpfiOm6mwenShsAb_F%n6GAGC7fT2JFFn:c35X-3aYwoq7jNX5$ZJ6hI3wnZs$7KgGi7wjulffhHNUxAT0fRRLF39vJ@NvaEMxsMO' 'Oj42AQAEzRoTxa5OuSKIr=A_lwGMy132v4g3Pdq1GvUG9874YseIFQ6QU' 'Ono7avN5GjC:_6dBJ_' 'WHmN2gnmaN-9dVDy4aWo:yNGFzz8qsJyJhWEWcud7$QzN2D9R0efIWWEdu5kwWr73NZm4=@CoCDxrrZnRITr-kGtU_cfW2:%2_am' 'WiFnuTEhAG9FEC6zopQmj-A-$LDQ0T3WULz%ox3UZAPybSV6v1Z$b4L_XBi4M4BMBtJZpz93r9xafpB77r:lbwvitWRyo$odnAUYlYMmU4RvgnNd--e=I5hiEjGLETTtaScWlQp8mYsBovZwM2k' 'XKyH=OsOAF3p%uziGF_ZVr$ivrvhVgD@1u%5RtrV-gl_vqAwHkK@x7YwlxX3qT6WKKQ%PR56NrUBU2dOAOAdzr2=5nJuKPM-T-$ZpQfCL7phxQbUcb:BZOTPaFExc-qK-gDRCDW2' 'd3uUR6OFEwZr%ns1XH_@tbxA@cCPmbBRLdyh7p6V45H$P2$F%w0RqrD3M0g8aGvWpoTFMiBdOTJXjD:JF7=h9a_43xBywYAP%r$SPZi%zDg%ql-KvkdUCtF9OLaQlxmd' 'ePTpbnit%hyNm@WELlpKzNZYOzOTf8EQ$sEfkMy1VOfIUu3coyvIr13-Y7Sv5v-Ivax2Go_GQRFMU1b3362nktT9WOJf3SpT%z8sZmM3gvYQBDgmKI%%RM-G7hyrhgYflOw%z::ZRcv5O:lDCFm' 'evqk743Y@dvZAiG5J05L_ROFV@$2%rVWJ2%3nxV72-W7$e$-SK3tuSHA2mBt$qloC5jwNx33GmQUjD%akhBPu=VJ5g$xhlZiaFtTrjeeM5x7dt4cHpX0cZkmfImndYzGmvwQG:$euFYmXn$_2rA9mKZ' 'gkgUtnihWXsZQTEkrMAWIxir09k3t7jk_IK25t1:cy1XWN0GGqC%FrySdcmU7M8MuPO_ppkLw3=Dfr0UuBAL4%GFk2$Ma10V1jDRGJje%Xx9EV2ERaWKtjpwiZwh0gCSJsj5UL7CR8RtW5opCVFKGGy8Cky' 'hNgsG_8lNRik3PvphqPm0yEH3P%%fYG:kQLY=6O-61Wa6nrV_WVGR6TLB09vHOv%g4VQRP8Gzx7VXUY1qvZyS' 'isA7JVzN12xCxVPJZ_qoLm-pTBuhjjHMvV7o=F:EaClfYNyFGlsfw-Kf%uxdqW-kwk1sPl2vhbjyHU1A6$hz' 'kiJ_fgcdZFDiOptjgH5PN9-PSyLO4fbk_:u5_2tz35lV_iXiJ6cx7pwjTtKy-XGaQ5IefmpJ4N_ZqGsqCsKuqOOBgf9LkUdffHet@Wu' 'lvwtxyhE9:%Q3UxeHiViUyNzJsy:fm38pg_b6s25JvdhOAT=1s0$pG25x=LZ2rlHTszj=gN6M4zHZYr_qrB49i=pA--@WqWLIuX7o1S_SfS@2FSiUZN' 'rC24cw3UBDZ=5qJBUMs9e$=S4Y94ni%Z8639vnrGp=0Hv4z3dNFL0fBLmQ40=EYIY:Z=SLc@QLMSt2zsss2ZXrP7j4=' 'uwGl2s-fFrf@GqS=DQqq2I0LJSsOmM%xzTjS:lzXguE3wChdMoHYtLRKPvfaPOZF2fER@j53evbKa7R%A7r4%YEkD=kicJe@SFiGtXHbKe4gCgPAYbnVn' 'UG37U6KKua2bgc:IHzRs7BnB6FD:2Mt5Cc5NdlsW%$1tyvnfz7S27FvNkroXwAW:mBZLA1@qa9WnDbHCDmQmfPMC9z-Eq6QT0jhhPpqyymaD:R02ghwYo%yx7SAaaq-:x33LYpei$5g8DMl3C' 'y2vjek0FE1PDJC0qpfnN:x8k2wCFZ9xiUF2ege=JnP98R%wxjKkdfEiLWvQzmnW' '8-HCSgH5B%K7P8_jaVtQhBXpBk:pE-$P7ts58U0J@iR9YZntMPl7j$s62yAJO@_9eanFPS54b=UTw$94C-t=HLxT8n6o9P=QnIxq-f1=Ne2dvhe6WbjEQtc' 'YPPh:IFt2mtR6XWSmjHptXL_hbSYu8bMw-JP8@PNyaFkdNFsk$M=xfL6LDKCDM-mSyGA_2MBwZ8Dr4=R1D%7-mCaaKGxb990jzaagRktDTyp' '9hD2ApKa_t_7x-a@GCG28kY:7$M@5udI1myQ$x5udtggvagmCQcq9QXWRC5hoB0o-_zHQUqZI5rMcz_kbMgvN5jr63LeYA4Cj-c6F5Ugmx6DgVf@2Jqm%MafecpgooqreJ53P-QTS' ) # Now create files with all those names in the same parent directory. # It should not fail since a 4K leaf has enough space for them. for name in "${names[@]}"; do touch $MNT/$name done # Now add one more file name that causes a crc32c hash collision. # This should fail, but it should not turn the filesystem into RO mode # (which could be exploited by malicious users) due to a transaction # abort. touch $MNT/'W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt' # Check that we are able to create another file, with a name that does not cause # a crc32c hash collision. echo -n "hello world" > $MNT/baz # Unmount and mount again, verify file baz exists and with the right content. umount $MNT mount $DEV $MNT echo "File baz content: $(cat $MNT/baz)" umount $MNT When running the reproducer: $ ./exploit-hash-collisions.sh (...) touch: cannot touch '/mnt/sdi/W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt': Value too large for defined data type ./exploit-hash-collisions.sh: line 57: /mnt/sdi/baz: Read-only file system cat: /mnt/sdi/baz: No such file or directory File baz content: And the transaction abort stack trace in dmesg/syslog: $ dmesg (...) [758240.509761] ------------[ cut here ]------------ [758240.510668] BTRFS: Transaction aborted (error -75) [758240.511577] WARNING: fs/btrfs/inode.c:6854 at btrfs_create_new_inode+0x805/0xb50 [btrfs], CPU#6: touch/888644 [758240.513513] Modules linked in: btrfs dm_zero (...) [758240.523221] CPU: 6 UID: 0 PID: 888644 Comm: touch Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full) [758240.524621] Tainted: [W]=WARN [758240.525037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [758240.526331] RIP: 0010:btrfs_create_new_inode+0x80b/0xb50 [btrfs] [758240.527093] Code: 0f 82 cf (...) [758240.529211] RSP: 0018:ffffce64418fbb48 EFLAGS: 00010292 [758240.529935] RAX: 00000000ffffffd3 RBX: 0000000000000000 RCX: 00000000ffffffb5 [758240.531040] RDX: 0000000d04f33e06 RSI: 00000000ffffffb5 RDI: ffffffffc0919dd0 [758240.531920] RBP: ffffce64418fbc10 R08: 0000000000000000 R09: 00000000ffffffb5 [758240.532928] R10: 0000000000000000 R11: ffff8e52c0000000 R12: ffff8e53eee7d0f0 [758240.533818] R13: ffff8e57f70932a0 R14: ffff8e5417629568 R15: 0000000000000000 [758240.534664] FS: 00007f1959a2a740(0000) GS:ffff8e5b27cae000(0000) knlGS:0000000000000000 [758240.535821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [758240.536644] CR2: 00007f1959b10ce0 CR3: 000000012a2cc005 CR4: 0000000000370ef0 [758240.537517] Call Trace: [758240.537828] <TASK> [758240.538099] btrfs_create_common+0xbf/0x140 [btrfs] [758240.538760] path_openat+0x111a/0x15b0 [758240.539252] do_filp_open+0xc2/0x170 [758240.539699] ? preempt_count_add+0x47/0xa0 [758240.540200] ? __virt_addr_valid+0xe4/0x1a0 [758240.540800] ? __check_object_size+0x1b3/0x230 [758240.541661] ? alloc_fd+0x118/0x180 [758240.542315] do_sys_openat2+0x70/0xd0 [758240.543012] __x64_sys_openat+0x50/0xa0 [758240.543723] do_syscall_64+0x50/0xf20 [758240.544462] entry_SYSCALL_64_after_hwframe+0x76/0x7e [758240.545397] RIP: 0033:0x7f1959abc687 [758240.546019] Code: 48 89 fa (...) [758240.548522] RSP: 002b:00007ffe16ff8690 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 [758240.566278] RAX: ffffffffffffffda RBX: 00007f1959a2a740 RCX: 00007f1959abc687 [758240.567068] RDX: 0000000000000941 RSI: 00007ffe16ffa333 RDI: ffffffffffffff9c [758240.567860] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [758240.568707] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000561eec7c4b90 [758240.569712] R13: 0000561eec7c311f R14: 00007ffe16ffa333 R15: 0000000000000000 [758240.570758] </TASK> [758240.571040] ---[ end trace 0000000000000000 ]--- [758240.571681] BTRFS: error (device sdi state A) in btrfs_create_new_inode:6854: errno=-75 unknown [758240.572899] BTRFS info (device sdi state EA): forced readonly Fix this by checking for hash collision, and if the adding a new name is possible, early in btrfs_create_new_inode() before we do any tree updates, so that we don't need to abort the transaction if we cannot add the new name due to the leaf size limit. A test case for fstests will be sent soon. Fixes: caae78e03234 ("btrfs: move common inode creation code into btrfs_create_new_inode()") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19btrfs: fix transaction abort when snapshotting received subvolumesFilipe Manana1-0/+16
commit e1b18b959025e6b5dbad668f391f65d34b39595a upstream. Currently a user can trigger a transaction abort by snapshotting a previously received snapshot a bunch of times until we reach a BTRFS_UUID_KEY_RECEIVED_SUBVOL item overflow (the maximum item size we can store in a leaf). This is very likely not common in practice, but if it happens, it turns the filesystem into RO mode. The snapshot, send and set_received_subvol and subvol_setflags (used by receive) don't require CAP_SYS_ADMIN, just inode_owner_or_capable(). A malicious user could use this to turn a filesystem into RO mode and disrupt a system. Reproducer script: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi # Use smallest node size to make the test faster. mkfs.btrfs -f --nodesize 4K $DEV mount $DEV $MNT # Create a subvolume and set it to RO so that it can be used for send. btrfs subvolume create $MNT/sv touch $MNT/sv/foo btrfs property set $MNT/sv ro true # Send and receive the subvolume into snaps/sv. mkdir $MNT/snaps btrfs send $MNT/sv | btrfs receive $MNT/snaps # Now snapshot the received subvolume, which has a received_uuid, a # lot of times to trigger the leaf overflow. total=500 for ((i = 1; i <= $total; i++)); do echo -ne "\rCreating snapshot $i/$total" btrfs subvolume snapshot -r $MNT/snaps/sv $MNT/snaps/sv_$i > /dev/null done echo umount $MNT When running the test: $ ./test.sh (...) Create subvolume '/mnt/sdi/sv' At subvol /mnt/sdi/sv At subvol sv Creating snapshot 496/500ERROR: Could not create subvolume: Value too large for defined data type Creating snapshot 497/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 498/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 499/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 500/500ERROR: Could not create subvolume: Read-only file system And in dmesg/syslog: $ dmesg (...) [251067.627338] BTRFS warning (device sdi): insert uuid item failed -75 (0x4628b21c4ac8d898, 0x2598bee2b1515c91) type 252! [251067.629212] ------------[ cut here ]------------ [251067.630033] BTRFS: Transaction aborted (error -75) [251067.630871] WARNING: fs/btrfs/transaction.c:1907 at create_pending_snapshot.cold+0x52/0x465 [btrfs], CPU#10: btrfs/615235 [251067.632851] Modules linked in: btrfs dm_zero (...) [251067.644071] CPU: 10 UID: 0 PID: 615235 Comm: btrfs Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full) [251067.646165] Tainted: [W]=WARN [251067.646733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [251067.648735] RIP: 0010:create_pending_snapshot.cold+0x55/0x465 [btrfs] [251067.649984] Code: f0 48 0f (...) [251067.653313] RSP: 0018:ffffce644908fae8 EFLAGS: 00010292 [251067.653987] RAX: 00000000ffffff01 RBX: ffff8e5639e63a80 RCX: 00000000ffffffd3 [251067.655042] RDX: ffff8e53faa76b00 RSI: 00000000ffffffb5 RDI: ffffffffc0919750 [251067.656077] RBP: ffffce644908fbd8 R08: 0000000000000000 R09: ffffce644908f820 [251067.657068] R10: ffff8e5adc1fffa8 R11: 0000000000000003 R12: ffff8e53c0431bd0 [251067.658050] R13: ffff8e5414593600 R14: ffff8e55efafd000 R15: 00000000ffffffb5 [251067.659019] FS: 00007f2a4944b3c0(0000) GS:ffff8e5b27dae000(0000) knlGS:0000000000000000 [251067.660115] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [251067.660943] CR2: 00007ffc5aa57898 CR3: 00000005813a2003 CR4: 0000000000370ef0 [251067.661972] Call Trace: [251067.662292] <TASK> [251067.662653] create_pending_snapshots+0x97/0xc0 [btrfs] [251067.663413] btrfs_commit_transaction+0x26e/0xc00 [btrfs] [251067.664257] ? btrfs_qgroup_convert_reserved_meta+0x35/0x390 [btrfs] [251067.665238] ? _raw_spin_unlock+0x15/0x30 [251067.665837] ? record_root_in_trans+0xa2/0xd0 [btrfs] [251067.666531] btrfs_mksubvol+0x330/0x580 [btrfs] [251067.667145] btrfs_mksnapshot+0x74/0xa0 [btrfs] [251067.667827] __btrfs_ioctl_snap_create+0x194/0x1d0 [btrfs] [251067.668595] btrfs_ioctl_snap_create_v2+0x107/0x130 [btrfs] [251067.669479] btrfs_ioctl+0x1580/0x2690 [btrfs] [251067.670093] ? count_memcg_events+0x6d/0x180 [251067.670849] ? handle_mm_fault+0x1a0/0x2a0 [251067.671652] __x64_sys_ioctl+0x92/0xe0 [251067.672406] do_syscall_64+0x50/0xf20 [251067.673129] entry_SYSCALL_64_after_hwframe+0x76/0x7e [251067.674096] RIP: 0033:0x7f2a495648db [251067.674812] Code: 00 48 89 (...) [251067.678227] RSP: 002b:00007ffc5aa57840 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [251067.679691] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2a495648db [251067.681145] RDX: 00007ffc5aa588b0 RSI: 0000000050009417 RDI: 0000000000000004 [251067.682511] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 [251067.683842] R10: 000000000000000a R11: 0000000000000246 R12: 00007ffc5aa59910 [251067.685176] R13: 00007ffc5aa588b0 R14: 0000000000000004 R15: 0000000000000006 [251067.686524] </TASK> [251067.686972] ---[ end trace 0000000000000000 ]--- [251067.687890] BTRFS: error (device sdi state A) in create_pending_snapshot:1907: errno=-75 unknown [251067.689049] BTRFS info (device sdi state EA): forced readonly [251067.689054] BTRFS warning (device sdi state EA): Skipping commit of aborted transaction. [251067.690119] BTRFS: error (device sdi state EA) in cleanup_transaction:2043: errno=-75 unknown [251067.702028] BTRFS info (device sdi state EA): last unmount of filesystem 46dc3975-30a2-4a69-a18f-418b859cccda Fix this by ignoring -EOVERFLOW errors from btrfs_uuid_tree_add() in the snapshot creation code when attempting to add the BTRFS_UUID_KEY_RECEIVED_SUBVOL item. This is OK because it's not critical and we are still able to delete the snapshot, as snapshot/subvolume deletion ignores if a BTRFS_UUID_KEY_RECEIVED_SUBVOL is missing (see inode.c:btrfs_delete_subvolume()). As for send/receive, we can still do send/receive operations since it always peeks the first root ID in the existing BTRFS_UUID_KEY_RECEIVED_SUBVOL (it could peek any since all snapshots have the same content), and even if the key is missing, it falls back to searching by BTRFS_UUID_KEY_SUBVOL key. A test case for fstests will be sent soon. Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree") CC: stable@vger.kernel.org # 3.12+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19btrfs: fix chunk map leak in btrfs_map_block() after ↵Mark Harmstone1-2/+4
btrfs_chunk_map_num_copies() commit f15fb3d41543244d1179f423da4a4832a55bc050 upstream. Fix a chunk map leak in btrfs_map_block(): if we return early with -EINVAL, we're not freeing the chunk map that we've just looked up. Fixes: 0ae653fbec2b ("btrfs: reduce chunk_map lookups in btrfs_map_block()") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19btrfs: hold space_info->lock when clearing periodic reclaim readySun YangKai1-1/+4
[ Upstream commit b8883b61f2fc50dcf22938cbed40fec05020552f ] btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held, as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was calling it after do_reclaim_sweep() returns, at which point space_info->lock is no longer held. Fix this by explicitly acquiring space_info->lock before clearing the periodic reclaim ready flag in btrfs_reclaim_sweep(). Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/ Fixes: 19eff93dc738 ("btrfs: fix periodic reclaim condition") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10Naohiro Aota1-15/+179
[ Upstream commit 52ee9965d09b2c56a027613db30d1fb20d623861 ] When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: define the AUTO_KFREE/AUTO_KVFREE helper macrosMiquel Sabaté Solà1-0/+7
[ Upstream commit d00cbce0a7d5de5fc31bf60abd59b44d36806b6e ] These are two simple macros which ensure that a pointer is initialized to NULL and with the proper cleanup attribute for it. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 52ee9965d09b ("btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: fix compat mask in error messages in btrfs_check_features()Mark Harmstone1-3/+3
[ Upstream commit 587bb33b10bda645a1028c1737ad3992b3d7cf61 ] Commit d7f67ac9a928 ("btrfs: relax block-group-tree feature dependency checks") introduced a regression when it comes to handling unsupported incompat or compat_ro flags. Beforehand we only printed the flags that we didn't recognize, afterwards we printed them all, which is less useful. Fix the error handling so it behaves like it used to. Fixes: d7f67ac9a928 ("btrfs: relax block-group-tree feature dependency checks") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: print correct subvol num if active swapfile prevents deletionMark Harmstone1-1/+1
[ Upstream commit 1c7e9111f4e6d6d42bc47759c9af1ef91f03ac2c ] Fix the error message in btrfs_delete_subvolume() if we can't delete a subvolume because it has an active swapfile: we were printing the number of the parent rather than the target. Fixes: 60021bd754c6 ("btrfs: prevent subvol with swapfile from being deleted") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: fix warning in scrub_verify_one_metadata()Mark Harmstone1-1/+1
[ Upstream commit 44e2fda66427a0442d8d2c0e6443256fb458ab6b ] Commit b471965fdb2d ("btrfs: fix replace/scrub failure with metadata_uuid") fixed the comparison in scrub_verify_one_metadata() to use metadata_uuid rather than fsid, but left the warning as it was. Fix it so it matches what we're doing. Fixes: b471965fdb2d ("btrfs: fix replace/scrub failure with metadata_uuid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: fix objectid value in error message in check_extent_data_ref()Mark Harmstone1-1/+1
[ Upstream commit a10172780526c2002e062102ad4f2aabac495889 ] Fix a copy-paste error in check_extent_data_ref(): we're printing root as in the message above, we should be printing objectid. Fixes: f333a3c7e832 ("btrfs: tree-checker: validate dref root and objectid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: fix incorrect key offset in error message in check_dev_extent_item()Mark Harmstone1-1/+1
[ Upstream commit 511dc8912ae3e929c1a182f5e6b2326516fd42a0 ] Fix the error message in check_dev_extent_item(), when an overlapping stripe is encountered. For dev extents, objectid is the disk number and offset the physical address, so prev_key->objectid should actually be prev_key->offset. (I can't take any credit for this one - this was discovered by Chris and his friend Claude.) Reported-by: Chris Mason <clm@fb.com> Fixes: 008e2512dc56 ("btrfs: tree-checker: add dev extent item checks") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: fix error message order of parameters in btrfs_delete_delayed_dir_index()Mark Harmstone1-1/+1
[ Upstream commit 3cf0f35779d364cf2003c617bb7f3f3e41023372 ] Fix the error message in btrfs_delete_delayed_dir_index() if __btrfs_add_delayed_item() fails: the message says root, inode, index, error, but we're actually passing index, root, inode, error. Fixes: adc1ef55dc04 ("btrfs: add details to error messages at btrfs_delete_delayed_dir_index()") Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12btrfs: free pages on error in btrfs_uring_read_extent()Miquel Sabaté Solà1-1/+6
[ Upstream commit 3f501412f2079ca14bf68a18d80a2b7a823f1f64 ] In this function the 'pages' object is never freed in the hopes that it is picked up by btrfs_uring_read_finished() whenever that executes in the future. But that's just the happy path. Along the way previous allocations might have gone wrong, or we might not get -EIOCBQUEUED from btrfs_encoded_read_regular_fill_pages(). In all these cases, we go to a cleanup section that frees all memory allocated by this function without assuming any deferred execution, and this also needs to happen for the 'pages' allocation. Fixes: 34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)") Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04btrfs: continue trimming remaining devices on failurejinbaohong1-1/+1
[ Upstream commit 912d1c6680bdb40b72b1b9204706f32b6eb842c3 ] Commit 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle error better") intended to make device trimming continue even if one device fails, tracking failures and reporting them at the end. However, it used 'break' instead of 'continue', causing the loop to exit on the first device failure. Fix this by replacing 'break' with 'continue'. Fixes: 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle error better") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04btrfs: zoned: fixup last alloc pointer after extent removal for DUPNaohiro Aota1-0/+14
[ Upstream commit e2d848649e64de39fc1b9c64002629b4daa1105d ] When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: c0d90a79e8e6 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups") CC: stable@vger.kernel.org # 6.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04btrfs: zoned: fixup last alloc pointer after extent removal for RAID1Naohiro Aota1-0/+15
[ Upstream commit dda3ec9ee6b3e120603bff1b798f25b51e54ac5d ] When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04btrfs: fix periodic reclaim conditionSun YangKai2-11/+16
[ Upstream commit 19eff93dc738e8afaa59cb374b44bb5a162e6c2d ] Problems with current implementation: 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing negative reclaimable_bytes to trigger reclaim unexpectedly 2. The "space must be freed between scans" assumption breaks the two-scan requirement: first scan marks block groups, second scan reclaims them. Without the second scan, no reclamation occurs. Instead, track actual reclaim progress: pause reclaim when block groups will be reclaimed, and resume only when progress is made. This ensures reclaim continues until no further progress can be made. And resume periodic reclaim when there's enough free space. And we take care if reclaim is making any progress now, so it's unnecessary to set periodic_reclaim_ready to false when failed to reclaim a block group. Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops") CC: stable@vger.kernel.org # 6.12+ Suggested-by: Boris Burkov <boris@bur.io> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04btrfs: replace BUG() with error handling in __btrfs_balance()Adarsh Das1-2/+8
[ Upstream commit be6324a809dbda76d5fdb23720ad9b20e5c1905c ] We search with offset (u64)-1 which should never match exactly. Previously this was handled with BUG(). Now logs an error and return -EUCLEAN. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Adarsh Das <adarshdas950@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04btrfs: handle user interrupt properly in btrfs_trim_fs()jinbaohong1-0/+11
[ Upstream commit bfb670b9183b0e4ba660aff2e396ec1cc01d0761 ] When a fatal signal is pending or the process is freezing, btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS. Currently this is treated as a regular error: the loops continue to the next iteration and count it as a block group or device failure. Instead, break out of the loops immediately and return -ERESTARTSYS to userspace without counting it as a failure. Also skip the device loop entirely if the block group loop was interrupted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04btrfs: fallback to buffered IO if the data profile has duplicationQu Wenruo1-0/+12
[ Upstream commit 7c2830f00c3e086292c1ee9f27b61efaf8e76c9a ] [BACKGROUND] Inspired by a recent kernel bug report, which is related to direct IO buffer modification during writeback, that leads to contents mismatch of different RAID1 mirrors. [CAUSE AND PROBLEMS] The root cause is exactly the same explained in commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode requires checksum"), that we can not trust direct IO buffer which can be modified halfway during writeback. Unlike data checksum verification, if this happened on inodes without data checksum but has the data has extra mirrors, it will lead to stealth data mismatch on different mirrors. This will be way harder to detect without data checksum. Furthermore for RAID56, we can even have data without checksum and data with checksum mixed inside the same full stripe. In that case if the direct IO buffer got changed halfway for the nodatasum part, the data with checksum immediately lost its ability to recover, e.g.: " " = Good old data or parity calculated using good old data "X" = Data modified during writeback 0 32K 64K Data 1 | | Has csum Data 2 |XXXXXXXXXXXXXXXX | No csum Parity | | In above case, the parity is calculated using data 1 (has csum, from page cache, won't change during writeback), and old data 2 (has no csum, direct IO write). After parity is calculated, but before submission to the storage, direct IO buffer of data 2 is modified, causing the range [0, 32K) of data 2 has a different content. Now all data is submitted to the storage, and the fs got fully synced. Then the device of data 1 is lost, has to be rebuilt from data 2 and parity. But since the data 2 has some modified data, and the parity is calculated using old data, the recovered data is no the same for data 1, causing data checksum mismatch. [FIX] Fix the problem by checking the data allocation profile. If our data allocation profile is either RAID0 or SINGLE, we can allow true zero-copy direct IO and the end user is fully responsible for any race. However this is not going to fix all situations, as it's still possible to race with balance where the fs got a new data profile after the data allocation profile check. But this fix should still greatly reduce the window of the original bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: fix invalid leaf access in btrfs_quota_enable() if ref key not foundFilipe Manana1-4/+7
[ Upstream commit ecb7c2484cfc83a93658907580035a8adf1e0a92 ] If btrfs_search_slot_for_read() returns 1, it means we did not find any key greater than or equals to the key we asked for, meaning we have reached the end of the tree and therefore the path is not valid. If this happens we need to break out of the loop and stop, instead of continuing and accessing an invalid path. Fixes: 5223cc60b40a ("btrfs: drop the path before adding qgroup items when enabling qgroups") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: use the correct type to initialize block reserve for delayed refsFilipe Manana2-4/+5
[ Upstream commit 2155d0c0a761a56ce7ede83a26eb23ea0f935260 ] When initializing the delayed refs block reserve for a transaction handle we are passing a type of BTRFS_BLOCK_RSV_DELOPS, which is meant for delayed items and not for delayed refs. The correct type for delayed refs is BTRFS_BLOCK_RSV_DELREFS. On release of any excess space reserved in a local delayed refs reserve, we also should transfer that excess space to the global block reserve (it it's full, we return to the space info for general availability). By initializing a transaction's local delayed refs block reserve with a type of BTRFS_BLOCK_RSV_DELOPS, we were also causing any excess space released from the delayed block reserve (fs_info->delayed_block_rsv, used for delayed inodes and items) to be transferred to the global block reserve instead of the global delayed refs block reserve. This was an unintentional change in commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction"), but it's not particularly serious as things tend to cancel out each other most of the time and it's relatively rare to be anywhere near exhaustion of the global reserve. Fix this by initializing a transaction's local delayed refs reserve with a type of BTRFS_BLOCK_RSV_DELREFS and making btrfs_block_rsv_release() attempt to transfer unused space from such a reserve into the global block reserve, just as we did before that commit for when the block reserve is a delayed refs rsv. Reported-by: Alex Lyakas <alex.lyakas@zadara.com> Link: https://lore.kernel.org/linux-btrfs/CAOcd+r0FHG5LWzTSu=LknwSoqxfw+C00gFAW7fuX71+Z5AfEew@mail.gmail.com/ Fixes: 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction") Reviewed-by: Alex Lyakas <alex.lyakas@zadara.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: reset block group size class when it becomes emptyJiasheng Jiang1-0/+10
[ Upstream commit 5870ec7c8fe57a8b2c65005e5da5efc054faa3e6 ] Block group size classes are managed consistently everywhere. Currently, btrfs_use_block_group_size_class() sets a block group's size class to specialize it for a specific allocation size. However, this size class remains "stale" even if the block group becomes completely empty (both used and reserved bytes reach zero). This happens in two scenarios: 1. When space reservations are freed (e.g., due to errors or transaction aborts) via btrfs_free_reserved_bytes(). 2. When the last extent in a block group is freed via btrfs_update_block_group(). While size classes are advisory, a stale size class can cause find_free_extent to unnecessarily skip candidate block groups during initial search loops. This undermines the purpose of size classes to reduce fragmentation by keeping block groups restricted to a specific size class when they could be reused for any size. Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a block group's used and reserved counts both reach zero. This ensures that empty block groups are fully available for any allocation size in the next cycle. Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: reduce block group critical section in btrfs_free_reserved_bytes()Filipe Manana1-6/+9
[ Upstream commit 8b6fa164ab59f9e3f24e627fe09a0234783e7a8b ] There's no need to update the space_info fields (bytes_reserved, max_extent_size, bytes_readonly, bytes_zone_unusable) while holding the block group's spinlock. So move those updates to happen after we unlock the block group (and while holding the space_info locked of course), so that all we do under the block group's critical section is to update the block group itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5870ec7c8fe5 ("btrfs: reset block group size class when it becomes empty") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: remove fs_info argument from btrfs_try_granting_tickets()Filipe Manana4-13/+12
[ Upstream commit e3df6408b13a75cf73e543e53453f28261874c6f ] We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 5870ec7c8fe5 ("btrfs: reset block group size class when it becomes empty") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocationBoris Burkov1-60/+183
[ Upstream commit b14c5e04bd0f722ed631845599d52d03fcae1bc1 ] I have been observing a number of systems aborting at insert_dev_extents() in btrfs_create_pending_block_groups(). The following is a sample stack trace of such an abort coming from forced chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this can theoretically happen to any DUP chunk allocation. [81.801] ------------[ cut here ]------------ [81.801] BTRFS: Transaction aborted (error -17) [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319 [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs] [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282 [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000 [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0 [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007 [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000 [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000 [81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000 [81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0 [81.810] Call Trace: [81.810] <TASK> [81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs] [81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs] [81.811] kernfs_fop_write_iter+0x15f/0x240 [81.812] vfs_write+0x264/0x500 [81.812] ksys_write+0x6c/0xe0 [81.812] do_syscall_64+0x66/0x770 [81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e [81.813] RIP: 0033:0x7fec6be66197 [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197 [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001 [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000 [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0 [81.817] </TASK> [81.817] irq event stamp: 20039 [81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60 [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60 [81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.820] ---[ end trace 0000000000000000 ]--- [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists Inspecting these aborts with drgn, I observed a pattern of overlapping chunk_maps. Note how stripe 1 of the first chunk overlaps in physical address with stripe 0 of the second chunk. Physical Start Physical End Length Logical Type Stripe ---------------------------------------------------------------------------------------------------- 0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META|DUP 0/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META|DUP 1/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META|DUP 0/2 0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META|DUP 1/2 Now how could this possibly happen? All chunk allocation is protected by the chunk_mutex so racing allocations should see a consistent view of the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree (device->alloc_state as set by chunk_map_device_set_bits()) The tree itself is protected by a spin lock, and clearing/setting the bits is always protected by fs_info->mapping_tree_lock, so no race is apparent. It turns out that there is a subtle bug in the logic regarding chunk allocations that have happened in the current transaction, known as "pending extents". The chunk allocation as defined in find_free_dev_extent() is a loop which searches the commit root of the dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it then checks alloc_state bitmap for any pending extents and adjusts the hole that it finds accordingly. However, the logic in that adjustment assumes that the first pending extent is the only one in that range. e.g., given a layout with two non-consecutive pending extents in a hole passed to dev_extent_hole_check() via *hole_start and *hole_size: |----pending A----| real hole |----pending B----| | candidate hole | *hole_start *hole_start + *hole_size the code incorrectly returns a "hole" from the end of pending extent A until the passed in hole end, failing to account for pending B. However, it is not entirely obvious that it is actually possible to produce such a layout. I was able to reproduce it, but with some contortions: I continued to use the force chunk allocation sysfs file and I introduced a long delay (10 seconds) into the start of the cleaner thread. I also prevented the unused bgs cleaning logic from ever deleting metadata bgs. These help make it easier to deterministically produce the condition but shouldn't really matter if you imagine the conditions happening by race/luck. Allocations/frees can happen concurrently with the cleaner thread preparing to process an unused extent and both create some used chunks with an unused chunk interleaved, all during one transaction. Then btrfs_delete_unused_bgs() sees the unused one and clears it, leaving a range with several pending chunk allocations and a gap in the middle. The basic idea is that the unused_bgs cleanup work happens on a worker so if we allocate 3 block groups in one transaction, then the cleaner work kicked off by the previous transaction comes through and deletes the middle one of the 3, then the commit root shows no dev extents and we have the bad pattern in the extent-io-tree. One final consideration is that the code happens to loop to the next hole if there are no more extents at all, so we need one more dev extent way past the area we are working in. Something like the following demonstrates the technique: # push the BG frontier out to 20G fallocate -l 20G $mnt/foo # allocate one more that will prevent the "no more dev extents" luck fallocate -l 1G $mnt/sticky # sync sync # clear out the allocation area rm $mnt/foo sync _cleaner # let everything quiesce sleep 20 sync # dev tree should have one bg 20G out and the rest at the beginning.. # sort of like an empty FS but with a random sticky chunk. # kick off the cleaner in the background, remember it will sleep 10s # before doing interesting work _cleaner & sleep 3 # create 3 trivial block groups, all empty, all immediately marked as unused. echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" # let the cleaner thread definitely finish, it will remove the data bg sleep 10 # this allocation sees the non-consecutive pending metadata chunks with # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC! echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" As for the fix, it is not that obvious. I could not see a trivial way to do it even by adding backup loops into find_free_dev_extent(), so I opted to change the semantics of dev_extent_hole_check() to not stop looping until it finds a sufficiently big hole. For clarity, this also required changing the helper function contains_pending_extent() into two new helpers which find the first pending extent and the first suitable hole in a range. I attempted to clean up the documentation and range calculations to be as consistent and clear as possible for the future. I also looked at the zoned case and concluded that the loop there is different and not to be unified with this one. As far as I can tell, the zoned check will only further constrain the hole so looping back to find more holes is acceptable. Though given that zoned really only appends, I find it highly unlikely that it is susceptible to this bug. Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") Reported-by: Dimitrios Apostolou <jimis@gmx.net> Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: fix block_group_tree dirty_list corruptionBoris Burkov1-7/+0
[ Upstream commit 3a1f4264daed4b419c325a7fe35e756cada3cf82 ] When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the block group tree to the switch_commits list before calling switch_commit_roots, as we do for the tree root and the chunk root. However, the block group tree uses normal root dirty tracking and in any transaction that does an allocation and dirties a block group, the block group root will already be linked to a list by the dirty_list field and this use of list_add_tail() is invalid and corrupts the prev/next members of block_group_root->dirty_list. This is apparent on a subsequent list_del on the prev if we enable CONFIG_DEBUG_LIST: [32.1571] ------------[ cut here ]------------ [32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538) [32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607 [32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none) [32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014 [32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120 [32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202 [32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538 [32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00 [32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff [32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538 [32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538 [32.1603] FS: 00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000 [32.1605] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0 [32.1609] Call Trace: [32.1610] <TASK> [32.1611] switch_commit_roots+0x82/0x1d0 [btrfs] [32.1615] btrfs_commit_transaction+0x968/0x1550 [btrfs] [32.1618] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [32.1621] __iterate_supers+0xe8/0x190 [32.1622] ? __pfx_sync_fs_one_sb+0x10/0x10 [32.1623] ksys_sync+0x63/0xb0 [32.1624] __do_sys_sync+0xe/0x20 [32.1625] do_syscall_64+0x73/0x450 [32.1626] entry_SYSCALL_64_after_hwframe+0x76/0x7e [32.1627] RIP: 0033:0x7f0c28d05d2b [32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2 [32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b [32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d [32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000 [32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80 [32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034 [32.1643] </TASK> [32.1644] irq event stamp: 0 [32.1644] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1648] softirqs last enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0 [32.1652] ---[ end trace 0000000000000000 ]--- Furthermore, this list corruption eventually (when we happen to add a new block group) results in getting the switch_commits and dirty_cowonly_roots lists mixed up and attempting to call update_root on the tree root which can't be found in the tree root, resulting in a transaction abort: [87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1 [87.8272] ------------[ cut here ]------------ [87.8274] BTRFS: Transaction aborted (error -117) [87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703 [87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none) [87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014 [87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs] [87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282 [87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000 [87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270 [87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff [87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001 [87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0 [87.8307] FS: 00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000 [87.8309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0 [87.8312] Call Trace: [87.8313] <TASK> [87.8314] ? _raw_spin_unlock+0x23/0x40 [87.8315] commit_cowonly_roots+0x1ad/0x250 [btrfs] [87.8317] ? btrfs_commit_transaction+0x79b/0x1560 [btrfs] [87.8320] btrfs_commit_transaction+0x8aa/0x1560 [btrfs] [87.8322] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [87.8325] __iterate_supers+0xf1/0x170 [87.8326] ? __pfx_sync_fs_one_sb+0x10/0x10 [87.8327] ksys_sync+0x63/0xb0 [87.8328] __do_sys_sync+0xe/0x20 [87.8329] do_syscall_64+0x73/0x450 [87.8330] entry_SYSCALL_64_after_hwframe+0x76/0x7e [87.8331] RIP: 0033:0x7f54cdd05d2b [87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 [87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b [87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d [87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80 [87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034 [87.8348] </TASK> [87.8348] irq event stamp: 0 [87.8349] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8353] softirqs last enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0 [87.8357] ---[ end trace 0000000000000000 ]--- [87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted [87.8360] BTRFS info (device nvme1n1 state EA): forced readonly [87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction. [87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted Since the block group tree was pulled out of the extent tree and uses normal root dirty tracking, remove the offending extra list_add. This fixes the list corruption and the resulting fs corruption. Fixes: 14033b08a029 ("btrfs: don't save block group root into super block") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: qgroup: return correct error when deleting qgroup relation itemFilipe Manana1-1/+3
[ Upstream commit 51b1fcf71c88c3c89e7dcf07869c5de837b1f428 ] If we fail to delete the second qgroup relation item, we end up returning success or -ENOENT in case the first item does not exist, instead of returning the error from the second item deletion. Fixes: 73798c465b66 ("btrfs: qgroup: Try our best to delete qgroup relations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: zoned: don't zone append to conventional zoneJohannes Thumshirn2-10/+12
[ Upstream commit b39b26e017c7889181cb84032e22bef72e81cf29 ] In case of a zoned RAID, it can happen that a data write is targeting a sequential write required zone and a conventional zone. In this case the bio will be marked as REQ_OP_ZONE_APPEND but for the conventional zone, this needs to be REQ_OP_WRITE. The setting of REQ_OP_ZONE_APPEND is deferred to the last possible time in btrfs_submit_dev_bio(), but the decision if we can use zone append is cached in btrfs_bio. CC: Naohiro Aota <naohiro.aota@wdc.com> Fixes: e9b9b911e03c ("btrfs: add raid stripe tree to features enabled with debug config") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: introduce btrfs_bio::async_csumQu Wenruo4-23/+67
[ Upstream commit dd57c78aec398717a2fa6488d87b1a6cd43c7d0d ] [ENHANCEMENT] Btrfs currently calculates data checksums then submits the bio. But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode requires checksum"), any writes with data checksum will fallback to buffered IO, meaning the content will not change during writeback. This means we're safe to calculate the data checksum and submit the bio in parallel, and only need the following new behavior: - Wait the csum generation to finish before calling btrfs_bio::end_io() Or this can lead to use-after-free for the csum generation worker. - Save the current bi_iter for csum_one_bio() As the submission part can advance btrfs_bio::bio.bi_iter, if not saved csum_one_bio() may got an empty bi_iter and do not generate any checksum. Unfortunately this means we have to increase the size of btrfs_bio for 16 bytes, but this is still acceptable. As usual, such new feature is hidden behind the experimental flag. [THEORETIC ANALYZE] Consider the following theoretic hardware performance, which should be more or less close to modern mainstream hardware: Memory bandwidth: 50GiB/s CRC32C bandwidth: 45GiB/s SSD bandwidth: 8GiB/s Then write bandwidth with data checksum before the patch is: 1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s After the patch, the bandwidth is: 1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s The difference is 15.32% improvement. [REAL WORLD BENCHMARK] I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the storage is backed by a PCIe gen3 x4 NVMe. The test is a direct IO write, with 1MiB block size, write 7GiB data into a btrfs mount with data checksum. Thus the direct write will fallback to buffered one: Vanilla Datasum: 1619.97 GiB/s Patched Datasum: 1792.26 GiB/s Diff +10.6 % In my case, the bottleneck is the storage, thus the improvement is not reaching the theoretic one, but still some observable improvement. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: make sure all btrfs_bio::end_io are called in task contextQu Wenruo1-18/+46
[ Upstream commit 4591c3ef751d861d7dd95ff4d2aadb1b5e95854e ] [BACKGROUND] Btrfs has a lot of different bi_end_io functions, to handle different raid profiles. But they introduced a lot of different contexts for btrfs_bio::end_io() calls: - Simple read bios Run in task context, backed by either endio_meta_workers or endio_workers. - Simple write bios Run in IRQ context. - RAID56 write or rebuild bios Run in task context, backed by rmw_workers. - Mirrored write bios Run in irq context. This is inconsistent, and contributes to the number of workqueues used in btrfs. [ENHANCEMENT] Make all the above bios call their btrfs_bio::end_io() in task context, backed by either endio_meta_workers for metadata, or endio_workers for data. For simple write bios, merge the handling into simple_end_io_work(). For mirrored write bios, it will be a little more complex, since both the original or the cloned bios can run the final btrfs_bio::end_io(). Here we make sure the cloned bios are using btrfs_bioset, to reuse the end_io_work, and run both original and cloned work inside the workqueue. Add extra ASSERT()s to make sure btrfs_bio_end_io() is running in task context. This not only unifies the context for btrfs_bio::end_io() functions, but also opens a new door for further btrfs_bio::end_io() related cleanups. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inodeQu Wenruo9-81/+87
[ Upstream commit 81cea6cd7041ebd42281e0517f856d88527d3326 ] Currently there is only one caller which doesn't populate btrfs_bio::inode, and that's scrub. The idea is scrub doesn't want any automatic csum verification nor read-repair, as everything will be handled by scrub itself. However that behavior is really no different than metadata inode, thus we can reuse btree_inode as btrfs_bio::inode for scrub. The only exception is in btrfs_submit_chunk() where if a bbio is from scrub or data reloc inode, we set rst_search_commit_root to true. This means we still need a way to distinguish scrub from metadata, but that can be done by a new flag inside btrfs_bio. Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from btrfs_bio structure. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27btrfs: headers cleanup to remove unnecessary local includesQu Wenruo21-22/+21
[ Upstream commit c5667f9c8eb90293dfa4e52c65eb89fe39f5652d ] [BUG] When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to grab the fs_info, the header "btrfs_inode.h" is needed to access the full btrfs_inode structure. Then btrfs will fail to compile. [CAUSE] There is a recursive including chain: "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" -> "bio.h" That recursive including is causing problems for btrfs. [ENHANCEMENT] To reduce the risk of recursive including: - Remove unnecessary local includes from btrfs headers Either the included header is pulled in by other headers, or is completely unnecessary. - Remove btrfs local includes if the header only requires a pointer In that case let the implementing C file to pull the required header. This is especially important for headers like "btrfs_inode.h" which pulls in a lot of other btrfs headers, thus it's a mine field of recursive including. - Remove unnecessary temporary structure definition Either if we have included the header defining the structure, or completely unused. Now including "btrfs_inode.h" inside "bio.h" is completely fine, although "btrfs_inode.h" still includes "extent_map.h", but that header only includes "fs.h", no more chain back to "bio.h". Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11btrfs: reject new transactions if the fs is fully read-onlyQu Wenruo2-0/+21
[ Upstream commit 1972f44c189c8aacde308fa9284e474c1a5cbd9f ] [BUG] There is a bug report where a heavily fuzzed fs is mounted with all rescue mount options, which leads to the following warnings during unmount: BTRFS: Transaction aborted (error -22) Modules linked in: CPU: 0 UID: 0 PID: 9758 Comm: repro.out Not tainted 6.19.0-rc5-00002-gb71e635feefc #7 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:find_free_extent_update_loop fs/btrfs/extent-tree.c:4208 [inline] RIP: 0010:find_free_extent+0x52f0/0x5d20 fs/btrfs/extent-tree.c:4611 Call Trace: <TASK> btrfs_reserve_extent+0x2cd/0x790 fs/btrfs/extent-tree.c:4705 btrfs_alloc_tree_block+0x1e1/0x10e0 fs/btrfs/extent-tree.c:5157 btrfs_force_cow_block+0x578/0x2410 fs/btrfs/ctree.c:517 btrfs_cow_block+0x3c4/0xa80 fs/btrfs/ctree.c:708 btrfs_search_slot+0xcad/0x2b50 fs/btrfs/ctree.c:2130 btrfs_truncate_inode_items+0x45d/0x2350 fs/btrfs/inode-item.c:499 btrfs_evict_inode+0x923/0xe70 fs/btrfs/inode.c:5628 evict+0x5f4/0xae0 fs/inode.c:837 __dentry_kill+0x209/0x660 fs/dcache.c:670 finish_dput+0xc9/0x480 fs/dcache.c:879 shrink_dcache_for_umount+0xa0/0x170 fs/dcache.c:1661 generic_shutdown_super+0x67/0x2c0 fs/super.c:621 kill_anon_super+0x3b/0x70 fs/super.c:1289 btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2127 deactivate_locked_super+0xbc/0x130 fs/super.c:474 cleanup_mnt+0x425/0x4c0 fs/namespace.c:1318 task_work_run+0x1d4/0x260 kernel/task_work.c:233 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x694/0x22f0 kernel/exit.c:971 do_group_exit+0x21c/0x2d0 kernel/exit.c:1112 __do_sys_exit_group kernel/exit.c:1123 [inline] __se_sys_exit_group kernel/exit.c:1121 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1121 x64_sys_call+0x2210/0x2210 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe8/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x44f639 Code: Unable to access opcode bytes at 0x44f60f. RSP: 002b:00007ffc15c4e088 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00000000004c32f0 RCX: 000000000044f639 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001 RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004c32f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 </TASK> Since rescue mount options will mark the full fs read-only, there should be no new transaction triggered. But during unmount we will evict all inodes, which can trigger a new transaction, and triggers warnings on a heavily corrupted fs. [CAUSE] Btrfs allows new transaction even on a read-only fs, this is to allow log replay happen even on read-only mounts, just like what ext4/xfs do. However with rescue mount options, the fs is fully read-only and cannot be remounted read-write, thus in that case we should also reject any new transactions. [FIX] If we find the fs has rescue mount options, we should treat the fs as error, so that no new transaction can be started. Reported-by: Jiaming Zhang <r772577952@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CANypQFYw8Nt8stgbhoycFojOoUmt+BoZ-z8WJOZVxcogDdwm=Q@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11btrfs: sync read disk super and set block sizeEdward Adam Davis1-0/+2
[ Upstream commit 3f29d661e5686f3aa14e6f11537ff5c49846f2e2 ] When the user performs a btrfs mount, the block device is not set correctly. The user sets the block size of the block device to 0x4000 by executing the BLKBSZSET command. Since the block size change also changes the mapping->flags value, this further affects the result of the mapping_min_folio_order() calculation. Let's analyze the following two scenarios: Scenario 1: Without executing the BLKBSZSET command, the block size is 0x1000, and mapping_min_folio_order() returns 0; Scenario 2: After executing the BLKBSZSET command, the block size is 0x4000, and mapping_min_folio_order() returns 2. do_read_cache_folio() allocates a folio before the BLKBSZSET command is executed. This results in the allocated folio having an order value of 0. Later, after BLKBSZSET is executed, the block size increases to 0x4000, and the mapping_min_folio_order() calculation result becomes 2. This leads to two undesirable consequences: 1. filemap_add_folio() triggers a VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping)) assertion. 2. The syzbot report [1] shows a null pointer dereference in create_empty_buffers() due to a buffer head allocation failure. Synchronization should be established based on the inode between the BLKBSZSET command and read cache page to prevent inconsistencies in block size or mapping flags before and after folio allocation. [1] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:create_empty_buffers+0x4d/0x480 fs/buffer.c:1694 Call Trace: folio_create_buffers+0x109/0x150 fs/buffer.c:1802 block_read_full_folio+0x14c/0x850 fs/buffer.c:2403 filemap_read_folio+0xc8/0x2a0 mm/filemap.c:2496 do_read_cache_folio+0x266/0x5c0 mm/filemap.c:4096 do_read_cache_page mm/filemap.c:4162 [inline] read_cache_page_gfp+0x29/0x120 mm/filemap.c:4195 btrfs_read_disk_super+0x192/0x500 fs/btrfs/volumes.c:1367 Reported-by: syzbot+b4a2af3000eaa84d95d5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b4a2af3000eaa84d95d5 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11btrfs: fix Wmaybe-uninitialized warning in replay_one_buffer()Qiang Ma1-1/+1
[ Upstream commit 9c7e71c97c8cd086b148d0d3d1cd84a1deab023c ] Warning was found when compiling using loongarch64-gcc 12.3.1: $ make CFLAGS_tree-log.o=-Wmaybe-uninitialized In file included from fs/btrfs/ctree.h:21, from fs/btrfs/tree-log.c:12: fs/btrfs/accessors.h: In function 'replay_one_buffer': fs/btrfs/accessors.h:66:16: warning: 'inode_item' may be used uninitialized [-Wmaybe-uninitialized] 66 | return btrfs_get_##bits(eb, s, offsetof(type, member)); \ | ^~~~~~~~~~ fs/btrfs/tree-log.c:2803:42: note: 'inode_item' declared here 2803 | struct btrfs_inode_item *inode_item; | ^~~~~~~~~~ Initialize the inode_item to NULL, the compiler does not seem to see the relation between the first 'wc->log_key.type == BTRFS_INODE_ITEM_KEY' check and the other one that also checks the replay phase. Signed-off-by: Qiang Ma <maqianga@uniontech.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11btrfs: fix reservation leak in some error paths when inserting inline extentFilipe Manana1-6/+10
[ Upstream commit c1c050f92d8f6aac4e17f7f2230160794fceef0c ] If we fail to allocate a path or join a transaction, we return from __cow_file_range_inline() without freeing the reserved qgroup data, resulting in a leak. Fix this by ensuring we call btrfs_qgroup_free_data() in such cases. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11btrfs: do not free data reservation in fallback from inline due to -ENOSPCFilipe Manana1-1/+5
[ Upstream commit f8da41de0bff9eb1d774a7253da0c9f637c4470a ] If we fail to create an inline extent due to -ENOSPC, we will attempt to go through the normal COW path, reserve an extent, create an ordered extent, etc. However we were always freeing the reserved qgroup data, which is wrong since we will use data. Fix this by freeing the reserved qgroup data in __cow_file_range_inline() only if we are not doing the fallback (ret is <= 0). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-06btrfs: do not strictly require dirty metadata threshold for metadata writepagesQu Wenruo3-26/+2
commit 4e159150a9a56d66d247f4b5510bed46fe58aa1c upstream. [BUG] There is an internal report that over 1000 processes are waiting at the io_schedule_timeout() of balance_dirty_pages(), causing a system hang and trigger a kernel coredump. The kernel is v6.4 kernel based, but the root problem still applies to any upstream kernel before v6.18. [CAUSE] From Jan Kara for his wisdom on the dirty page balance behavior first. This cgroup dirty limit was what was actually playing the role here because the cgroup had only a small amount of memory and so the dirty limit for it was something like 16MB. Dirty throttling is responsible for enforcing that nobody can dirty (significantly) more dirty memory than there's dirty limit. Thus when a task is dirtying pages it periodically enters into balance_dirty_pages() and we let it sleep there to slow down the dirtying. When the system is over dirty limit already (either globally or within a cgroup of the running task), we will not let the task exit from balance_dirty_pages() until the number of dirty pages drops below the limit. So in this particular case, as I already mentioned, there was a cgroup with relatively small amount of memory and as a result with dirty limit set at 16MB. A task from that cgroup has dirtied about 28MB worth of pages in btrfs btree inode and these were practically the only dirty pages in that cgroup. So that means the only way to reduce the dirty pages of that cgroup is to writeback the dirty pages of btrfs btree inode, and only after that those processes can exit balance_dirty_pages(). Now back to the btrfs part, btree_writepages() is responsible for writing back dirty btree inode pages. The problem here is, there is a btrfs internal threshold that if the btree inode's dirty bytes are below the 32M threshold, it will not do any writeback. This behavior is to batch as much metadata as possible so we won't write back those tree blocks and then later re-COW them again for another modification. This internal 32MiB is higher than the existing dirty page size (28MiB), meaning no writeback will happen, causing a deadlock between btrfs and cgroup: - Btrfs doesn't want to write back btree inode until more dirty pages - Cgroup/MM doesn't want more dirty pages for btrfs btree inode Thus any process touching that btree inode is put into sleep until the number of dirty pages is reduced. Thanks Jan Kara a lot for the analysis of the root cause. [ENHANCEMENT] Since kernel commit b55102826d7d ("btrfs: set AS_KERNEL_FILE on the btree_inode"), btrfs btree inode pages will only be charged to the root cgroup which should have a much larger limit than btrfs' 32MiB threshold. So it should not affect newer kernels. But for all current LTS kernels, they are all affected by this problem, and backporting the whole AS_KERNEL_FILE may not be a good idea. Even for newer kernels I still think it's a good idea to get rid of the internal threshold at btree_writepages(), since for most cases cgroup/MM has a better view of full system memory usage than btrfs' fixed threshold. For internal callers using btrfs_btree_balance_dirty() since that function is already doing internal threshold check, we don't need to bother them. But for external callers of btree_writepages(), just respect their requests and write back whatever they want, ignoring the internal btrfs threshold to avoid such deadlock on btree inode dirty page balancing. CC: stable@vger.kernel.org CC: Jan Kara <jack@suse.cz> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06btrfs: zlib: fix the folio leak on S390 hardware accelerationQu Wenruo1-0/+1
[ Upstream commit 0d0f1314e8f86f5205f71f9e31e272a1d008e40b ] [BUG] After commit aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration buffer preparation"), we no longer release the folio of the page cache of folio returned by btrfs_compress_filemap_get_folio() for S390 hardware acceleration path. [CAUSE] Before that commit, we call kumap_local() and folio_put() after handling each folio. Although the timing is not ideal (it release previous folio at the beginning of the loop, and rely on some extra cleanup out of the loop), it at least handles the folio release correctly. Meanwhile the refactored code is easier to read, it lacks the call to release the filemap folio. [FIX] Add the missing folio_put() for copy_data_into_buffer(). CC: linux-s390@vger.kernel.org # 6.18+ Fixes: aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration buffer preparation") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREEMark Harmstone1-1/+1
[ Upstream commit 1d8f69f453c2e8a2d99b158e58e02ed65031fa6d ] When the BLOCK_GROUP_TREE compat_ro flag is set, the extent root and csum root fields are getting missed. This is because EXTENT_TREE_V2 treated these differently, and when they were split off this special-casing was mistakenly assigned to BGT rather than the rump EXTENT_TREE_V2. There's no reason why the existence of the block group tree should mean that we don't record the details of the last commit's extent root and csum root. Fix the code in backup_super_roots() so that the correct check gets made. Fixes: 1c56ab991903 ("btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-23btrfs: fix deadlock in wait_current_trans() due to ignored transaction typeRobbie Ko1-5/+6
commit 5037b342825df7094a4906d1e2a9674baab50cb2 upstream. When wait_current_trans() is called during start_transaction(), it currently waits for a blocked transaction without considering whether the given transaction type actually needs to wait for that particular transaction state. The btrfs_blocked_trans_types[] array already defines which transaction types should wait for which transaction states, but this check was missing in wait_current_trans(). This can lead to a deadlock scenario involving two transactions and pending ordered extents: 1. Transaction A is in TRANS_STATE_COMMIT_DOING state 2. A worker processing an ordered extent calls start_transaction() with TRANS_JOIN 3. join_transaction() returns -EBUSY because Transaction A is in TRANS_STATE_COMMIT_DOING 4. Transaction A moves to TRANS_STATE_UNBLOCKED and completes 5. A new Transaction B is created (TRANS_STATE_RUNNING) 6. The ordered extent from step 2 is added to Transaction B's pending ordered extents 7. Transaction B immediately starts commit by another task and enters TRANS_STATE_COMMIT_START 8. The worker finally reaches wait_current_trans(), sees Transaction B in TRANS_STATE_COMMIT_START (a blocked state), and waits unconditionally 9. However, TRANS_JOIN should NOT wait for TRANS_STATE_COMMIT_START according to btrfs_blocked_trans_types[] 10. Transaction B is waiting for pending ordered extents to complete 11. Deadlock: Transaction B waits for ordered extent, ordered extent waits for Transaction B This can be illustrated by the following call stacks: CPU0 CPU1 btrfs_finish_ordered_io() start_transaction(TRANS_JOIN) join_transaction() # -EBUSY (Transaction A is # TRANS_STATE_COMMIT_DOING) # Transaction A completes # Transaction B created # ordered extent added to # Transaction B's pending list btrfs_commit_transaction() # Transaction B enters # TRANS_STATE_COMMIT_START # waiting for pending ordered # extents wait_current_trans() # waits for Transaction B # (should not wait!) Task bstore_kv_sync in btrfs_commit_transaction waiting for ordered extents: __schedule+0x2e7/0x8a0 schedule+0x64/0xe0 btrfs_commit_transaction+0xbf7/0xda0 [btrfs] btrfs_sync_file+0x342/0x4d0 [btrfs] __x64_sys_fdatasync+0x4b/0x80 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Task kworker in wait_current_trans waiting for transaction commit: Workqueue: btrfs-syno_nocow btrfs_work_helper [btrfs] __schedule+0x2e7/0x8a0 schedule+0x64/0xe0 wait_current_trans+0xb0/0x110 [btrfs] start_transaction+0x346/0x5b0 [btrfs] btrfs_finish_ordered_io.isra.0+0x49b/0x9c0 [btrfs] btrfs_work_helper+0xe8/0x350 [btrfs] process_one_work+0x1d3/0x3c0 worker_thread+0x4d/0x3e0 kthread+0x12d/0x150 ret_from_fork+0x1f/0x30 Fix this by passing the transaction type to wait_current_trans() and checking btrfs_blocked_trans_types[cur_trans->state] against the given type before deciding to wait. This ensures that transaction types which are allowed to join during certain blocked states will not unnecessarily wait and cause deadlocks. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Cc: Motiejus Jakštys <motiejus@jakstys.lt> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23btrfs: fix memory leaks in create_space_info() error pathsJiasheng Jiang1-2/+6
[ Upstream commit a11224a016d6d1d46a4d9b6573244448a80d4d7f ] In create_space_info(), the 'space_info' object is allocated at the beginning of the function. However, there are two error paths where the function returns an error code without freeing the allocated memory: 1. When create_space_info_sub_group() fails in zoned mode. 2. When btrfs_sysfs_add_space_info_type() fails. In both cases, 'space_info' has not yet been added to the fs_info->space_info list, resulting in a memory leak. Fix this by adding an error handling label to kfree(space_info) before returning. Fixes: 2be12ef79fe9 ("btrfs: Separate space_info create/update") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-23btrfs: send: check for inline extents in range_is_hole_in_parent()Qu Wenruo1-0/+2
[ Upstream commit 08b096c1372cd69627f4f559fb47c9fb67a52b39 ] Before accessing the disk_bytenr field of a file extent item we need to check if we are dealing with an inline extent. This is because for inline extents their data starts at the offset of the disk_bytenr field. So accessing the disk_bytenr means we are accessing inline data or in case the inline data is less than 8 bytes we can actually cause an invalid memory access if this inline extent item is the first item in the leaf or access metadata from other items. Fixes: 82bfb2e7b645 ("Btrfs: incremental send, fix unnecessary hole writes for sparse files") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-23btrfs: release path before iget_failed() in btrfs_read_locked_inode()Filipe Manana1-0/+9
[ Upstream commit 1e1f2055ad5a7a5d548789b334a4473a7665c418 ] In btrfs_read_locked_inode() if we fail to lookup the inode, we jump to the 'out' label with a path that has a read locked leaf and then we call iget_failed(). This can result in a ABBA deadlock, since iget_failed() triggers inode eviction and that causes the release of the delayed inode, which must lock the delayed inode's mutex, and a task updating a delayed inode starts by taking the node's mutex and then modifying the inode's subvolume btree. Syzbot reported the following lockdep splat for this: ====================================================== WARNING: possible circular locking dependency detected syzkaller #0 Not tainted ------------------------------------------------------ btrfs-cleaner/8725 is trying to acquire lock: ffff0000d6826a48 (&delayed_node->mutex){+.+.}-{4:4}, at: __btrfs_release_delayed_node+0xa0/0x9b0 fs/btrfs/delayed-inode.c:290 but task is already holding lock: ffff0000dbeba878 (btrfs-tree-00){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x44/0x2ec fs/btrfs/locking.c:145 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{4:4}: __lock_release kernel/locking/lockdep.c:5574 [inline] lock_release+0x198/0x39c kernel/locking/lockdep.c:5889 up_read+0x24/0x3c kernel/locking/rwsem.c:1632 btrfs_tree_read_unlock+0xdc/0x298 fs/btrfs/locking.c:169 btrfs_tree_unlock_rw fs/btrfs/locking.h:218 [inline] btrfs_search_slot+0xa6c/0x223c fs/btrfs/ctree.c:2133 btrfs_lookup_inode+0xd8/0x38c fs/btrfs/inode-item.c:395 __btrfs_update_delayed_inode+0x124/0xed0 fs/btrfs/delayed-inode.c:1032 btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1118 [inline] __btrfs_commit_inode_delayed_items+0x15f8/0x1748 fs/btrfs/delayed-inode.c:1141 __btrfs_run_delayed_items+0x1ac/0x514 fs/btrfs/delayed-inode.c:1176 btrfs_run_delayed_items_nr+0x28/0x38 fs/btrfs/delayed-inode.c:1219 flush_space+0x26c/0xb68 fs/btrfs/space-info.c:828 do_async_reclaim_metadata_space+0x110/0x364 fs/btrfs/space-info.c:1158 btrfs_async_reclaim_metadata_space+0x90/0xd8 fs/btrfs/space-info.c:1226 process_one_work+0x7e8/0x155c kernel/workqueue.c:3263 process_scheduled_works kernel/workqueue.c:3346 [inline] worker_thread+0x958/0xed8 kernel/workqueue.c:3427 kthread+0x5fc/0x75c kernel/kthread.c:463 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:844 -> #0 (&delayed_node->mutex){+.+.}-{4:4}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x1774/0x30a4 kernel/locking/lockdep.c:5237 lock_acquire+0x14c/0x2e0 kernel/locking/lockdep.c:5868 __mutex_lock_common+0x1d0/0x2678 kernel/locking/mutex.c:598 __mutex_lock kernel/locking/mutex.c:760 [inline] mutex_lock_nested+0x2c/0x38 kernel/locking/mutex.c:812 __btrfs_release_delayed_node+0xa0/0x9b0 fs/btrfs/delayed-inode.c:290 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:315 [inline] btrfs_remove_delayed_node+0x68/0x84 fs/btrfs/delayed-inode.c:1326 btrfs_evict_inode+0x578/0xe28 fs/btrfs/inode.c:5587 evict+0x414/0x928 fs/inode.c:810 iput_final fs/inode.c:1914 [inline] iput+0x95c/0xad4 fs/inode.c:1966 iget_failed+0xec/0x134 fs/bad_inode.c:248 btrfs_read_locked_inode+0xe1c/0x1234 fs/btrfs/inode.c:4101 btrfs_iget+0x1b0/0x264 fs/btrfs/inode.c:5837 btrfs_run_defrag_inode fs/btrfs/defrag.c:237 [inline] btrfs_run_defrag_inodes+0x520/0xdc4 fs/btrfs/defrag.c:309 cleaner_kthread+0x21c/0x418 fs/btrfs/disk-io.c:1516 kthread+0x5fc/0x75c kernel/kthread.c:463 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:844 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(btrfs-tree-00); lock(&delayed_node->mutex); lock(btrfs-tree-00); lock(&delayed_node->mutex); *** DEADLOCK *** 1 lock held by btrfs-cleaner/8725: #0: ffff0000dbeba878 (btrfs-tree-00){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x44/0x2ec fs/btrfs/locking.c:145 stack backtrace: CPU: 0 UID: 0 PID: 8725 Comm: btrfs-cleaner Not tainted syzkaller #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/03/2025 Call trace: show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:499 (C) __dump_stack+0x30/0x40 lib/dump_stack.c:94 dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120 dump_stack+0x1c/0x28 lib/dump_stack.c:129 print_circular_bug+0x324/0x32c kernel/locking/lockdep.c:2043 check_noncircular+0x154/0x174 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x1774/0x30a4 kernel/locking/lockdep.c:5237 lock_acquire+0x14c/0x2e0 kernel/locking/lockdep.c:5868 __mutex_lock_common+0x1d0/0x2678 kernel/locking/mutex.c:598 __mutex_lock kernel/locking/mutex.c:760 [inline] mutex_lock_nested+0x2c/0x38 kernel/locking/mutex.c:812 __btrfs_release_delayed_node+0xa0/0x9b0 fs/btrfs/delayed-inode.c:290 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:315 [inline] btrfs_remove_delayed_node+0x68/0x84 fs/btrfs/delayed-inode.c:1326 btrfs_evict_inode+0x578/0xe28 fs/btrfs/inode.c:5587 evict+0x414/0x928 fs/inode.c:810 iput_final fs/inode.c:1914 [inline] iput+0x95c/0xad4 fs/inode.c:1966 iget_failed+0xec/0x134 fs/bad_inode.c:248 btrfs_read_locked_inode+0xe1c/0x1234 fs/btrfs/inode.c:4101 btrfs_iget+0x1b0/0x264 fs/btrfs/inode.c:5837 btrfs_run_defrag_inode fs/btrfs/defrag.c:237 [inline] btrfs_run_defrag_inodes+0x520/0xdc4 fs/btrfs/defrag.c:309 cleaner_kthread+0x21c/0x418 fs/btrfs/disk-io.c:1516 kthread+0x5fc/0x75c kernel/kthread.c:463 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:844 Fix this by releasing the path before calling iget_failed(). Reported-by: syzbot+c1c6edb02bea1da754d8@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/694530c2.a70a0220.207337.010d.GAE@google.com/ Fixes: 69673992b1ae ("btrfs: push cleanup into btrfs_read_locked_inode()") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-17btrfs: fix beyond-EOF write handlingQu Wenruo1-4/+4
[ Upstream commit e9e3b22ddfa760762b696ac6417c8d6edd182e49 ] [BUG] For the following write sequence with 64K page size and 4K fs block size, it will lead to file extent items to be inserted without any data checksum: mkfs.btrfs -s 4k -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \ -c "truncate 16k" $mnt/foobar umount $mnt This will result the following 2 file extent items to be inserted (extra trace point added to insert_ordered_extent_file_extent()): btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0 btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384 Note for file offset 60K, we're inserting a file extent without any data checksum. Also note that range [32K, 36K) didn't reach insert_ordered_extent_file_extent(), which is the correct behavior as that OE is fully truncated, should not result any file extent. Although file extent at 60K will be later dropped by btrfs_truncate(), if the transaction got committed after file extent inserted but before the file extent dropping, we will have a small window where we have a file extent beyond EOF and without any data checksum. That will cause "btrfs check" to report error. [CAUSE] The sequence happens like this: - Buffered write dirtied the page cache and updated isize Now the inode size is 64K, with the following page cache layout: 0 16K 32K 48K 64K |/////////////| |//| |//| - Truncate the inode to 16K Which will trigger writeback through: btrfs_setsize() |- truncate_setsize() | Now the inode size is set to 16K | |- btrfs_truncate() |- btrfs_wait_ordered_range() for [16K, u64(-1)] |- btrfs_fdatawrite_range() for [16K, u64(-1)} |- extent_writepage() for folio 0 |- writepage_delalloc() | Generated OE for [0, 16K), [32K, 36K] and [60K, 64K) | |- extent_writepage_io() Then inside extent_writepage_io(), the dirty fs blocks are handled differently: - Submit write for range [0, 16K) As they are still inside the inode size (16K). - Mark OE [32K, 36K) as truncated Since we only call btrfs_lookup_first_ordered_range() once, which returned the first OE after file offset 16K. - Mark all OEs inside range [16K, 64K) as finished Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished. For OE [32K, 36K) since it's already marked as truncated, and its truncated length is 0, no file extent will be inserted. For OE [60K, 64K) it has never been submitted thus has no data checksum, and we insert the file extent as usual. This is the root cause of file extent at 60K to be inserted without any data checksum. - Clear dirty flags for range [16K, 64K) It is the function btrfs_folio_clear_dirty() which searches and clears any dirty blocks inside that range. [FIX] The bug itself was introduced a long time ago, way before subpage and large folio support. At that time, fs block size must match page size, thus the range [cur, end) is just one fs block. But later with subpage and large folios, the same range [cur, end) can have multiple blocks and ordered extents. Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping writeback past i_size") was fixing a bug related to subpage/large folios, but it's still utilizing the old range [cur, end), meaning only the first OE will be marked as truncated. The proper fix here is to make EOF handling block-by-block, not trying to handle the whole range to @end. By this we always locate and truncate the OE for every dirty block. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-17btrfs: use variable for end offset in extent_writepage_io()Filipe Manana1-5/+5
[ Upstream commit 46a23908598f4b8e61483f04ea9f471b2affc58a ] Instead of repeating the expression "start + len" multiple times, store it in a variable and use it where needed. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>