summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-02-21Merge tag 'rcu.2023.02.10a' of ↵Linus Torvalds4-28/+0
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes, perhaps most notably: - Throttling callback invocation based on the number of callbacks that are now ready to invoke instead of on the total number of callbacks - Several patches that suppress false-positive boot-time diagnostics, for example, due to lockdep not yet being initialized - Make expedited RCU CPU stall warnings dump stacks of any tasks that are blocking the stalled grace period. (Normal RCU CPU stall warnings have done this for many years) - Lazy-callback fixes to avoid delays during boot, suspend, and resume. (Note that lazy callbacks must be explicitly enabled, so this should not (yet) affect production use cases) - Make kfree_rcu() and friends take advantage of polled grace periods, thus reducing memory footprint by almost two orders of magnitude, admittedly on a microbenchmark This also begins the transition from kfree_rcu(p) to kfree_rcu_mightsleep(p). This transition was motivated by bugs where kfree_rcu(p), which can block, was typed instead of the intended kfree_rcu(p, rh) - SRCU updates, perhaps most notably fixing a bug that causes SRCU to fail when booted on a system with a non-zero boot CPU. This surprising situation actually happens for kdump kernels on the powerpc architecture This also adds an srcu_down_read() and srcu_up_read(), which act like srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side critical section to be handed off from one task to another - Clean up the now-useless SRCU Kconfig option There are a few more commits that are not yet acked or pulled into maintainer trees, and these will be in a pull request for a later merge window - RCU-tasks updates, perhaps most notably these fixes: - A strange interaction between PID-namespace unshare and the RCU-tasks grace period that results in a low-probability but very real hang - A race between an RCU tasks rude grace period on a single-CPU system and CPU-hotplug addition of the second CPU that can result in a too-short grace period - A race between shrinking RCU tasks down to a single callback list and queuing a new callback to some other CPU, but where that queuing is delayed for more than an RCU grace period. This can result in that callback being stranded on the non-boot CPU - Torture-test updates and fixes - Torture-test scripting updates and fixes - Provide additional RCU CPU stall-warning information in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute timeout limit for expedited RCU CPU stall warnings * tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits) rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep() kernel/notifier: Remove CONFIG_SRCU init: Remove "select SRCU" fs/quota: Remove "select SRCU" fs/notify: Remove "select SRCU" fs/btrfs: Remove "select SRCU" fs: Remove CONFIG_SRCU drivers/pci/controller: Remove "select SRCU" drivers/net: Remove "select SRCU" drivers/md: Remove "select SRCU" drivers/hwtracing/stm: Remove "select SRCU" drivers/dax: Remove "select SRCU" drivers/base: Remove CONFIG_SRCU rcu: Disable laziness if lazy-tracking says so rcu: Track laziness during boot and suspend rcu: Remove redundant call to rcu_boost_kthread_setaffinity() rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts rcu: Align the output of RCU CPU stall warning messages rcu: Add RCU stall diagnosis information sched: Add helper nr_context_switches_cpu() ...
2023-02-21Merge tag 'sched-core-2023-02-20' of ↵Linus Torvalds2-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Improve the scalability of the CFS bandwidth unthrottling logic with large number of CPUs. - Fix & rework various cpuidle routines, simplify interaction with the generic scheduler code. Add __cpuidle methods as noinstr to objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks. - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query previously issued registrations. - Limit scheduler slice duration to the sysctl_sched_latency period, to improve scheduling granularity with a large number of SCHED_IDLE tasks. - Debuggability enhancement on sys_exit(): warn about disabled IRQs, but also enable them to prevent a cascade of followup problems and repeat warnings. - Fix the rescheduling logic in prio_changed_dl(). - Micro-optimize cpufreq and sched-util methods. - Micro-optimize ttwu_runnable() - Micro-optimize the idle-scanning in update_numa_stats(), select_idle_capacity() and steal_cookie_task(). - Update the RSEQ code & self-tests - Constify various scheduler methods - Remove unused methods - Refine __init tags - Documentation updates - Misc other cleanups, fixes * tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) sched/rt: pick_next_rt_entity(): check list_entry sched/deadline: Add more reschedule cases to prio_changed_dl() sched/fair: sanitize vruntime of entity being placed sched/fair: Remove capacity inversion detection sched/fair: unlink misfit task from cpu overutilized objtool: mem*() are not uaccess safe cpuidle: Fix poll_idle() noinstr annotation sched/clock: Make local_clock() noinstr sched/clock/x86: Mark sched_clock() noinstr x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read() x86/atomics: Always inline arch_atomic64*() cpuidle: tracing, preempt: Squash _rcuidle tracing cpuidle: tracing: Warn about !rcu_is_watching() cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG cpuidle: drivers: firmware: psci: Dont instrument suspend code KVM: selftests: Fix build of rseq test exit: Detect and fix irq disabled state in oops cpuidle, arm64: Fix the ARM64 cpuidle logic cpuidle: mvebu: Fix duplicate flags assignment sched/fair: Limit sched slice duration ...
2023-02-21Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds10-64/+38
Pull block updates from Jens Axboe: - NVMe updates via Christoph: - Small improvements to the logging functionality (Amit Engel) - Authentication cleanups (Hannes Reinecke) - Cleanup and optimize the DMA mapping cod in the PCIe driver (Keith Busch) - Work around the command effects for Format NVM (Keith Busch) - Misc cleanups (Keith Busch, Christoph Hellwig) - Fix and cleanup freeing single sgl (Keith Busch) - MD updates via Song: - Fix a rare crash during the takeover process - Don't update recovery_cp when curr_resync is ACTIVE - Free writes_pending in md_stop - Change active_io to percpu - Updates to drbd, inching us closer to unifying the out-of-tree driver with the in-tree one (Andreas, Christoph, Lars, Robert) - BFQ update adding support for multi-actuator drives (Paolo, Federico, Davide) - Make brd compliant with REQ_NOWAIT (me) - Fix for IOPOLL and queue entering, fixing stalled IO waiting on timeouts (me) - Fix for REQ_NOWAIT with multiple bios (me) - Fix memory leak in blktrace cleanup (Greg) - Clean up sbitmap and fix a potential hang (Kemeng) - Clean up some bits in BFQ, and fix a bug in the request injection (Kemeng) - Clean up the request allocation and issue code, and fix some bugs related to that (Kemeng) - ublk updates and fixes: - Add support for unprivileged ublk (Ming) - Improve device deletion handling (Ming) - Misc (Liu, Ziyang) - s390 dasd fixes (Alexander, Qiheng) - Improve utility of request caching and fixes (Anuj, Xiao) - zoned cleanups (Pankaj) - More constification for kobjs (Thomas) - blk-iocost cleanups (Yu) - Remove bio splitting from drivers that don't need it (Christoph) - Switch blk-cgroups to use struct gendisk. Some of this is now incomplete as select late reverts were done. (Christoph) - Add bvec initialization helpers, and convert callers to use that rather than open-coding it (Christoph) - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin, Matthew, Ulf, Zhong) * tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits) brd: use radix_tree_maybe_preload instead of radix_tree_preload block: use proper return value from bio_failfast() block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: Fix io statistics for cgroup in throttle path brd: mark as nowait compatible brd: check for REQ_NOWAIT and set correct page allocation mask brd: return 0/-error from brd_insert_page() block: sync mixed merged request's failfast with 1st bio's Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" Revert "blk-cgroup: pass a gendisk to blkg_lookup" Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" Revert "blk-cgroup: move the cgroup information to struct gendisk" nvme-pci: remove iod use_sgls nvme-pci: fix freeing single sgl block: ublk: check IO buffer based on flag need_get_data s390/dasd: Fix potential memleak in dasd_eckd_init() s390/dasd: sort out physical vs virtual pointers usage block: Remove the ALLOC_CACHE_SLACK constant block: make kobj_type structures constant ...
2023-02-21Merge tag 'for-6.3/dio-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds17-28/+43
Pull legacy dio update from Jens Axboe: "We only have a few file systems that use the old dio code, make them select it rather than build it unconditionally" * tag 'for-6.3/dio-2023-02-16' of git://git.kernel.dk/linux: fs: build the legacy direct I/O code conditionally fs: move sb_init_dio_done_wq out of direct-io.c
2023-02-21Merge tag 'dlm-6.3' of ↵Linus Torvalds6-97/+136
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This fixes some races in the lowcomms startup and shutdown code that were found by targeted stress testing that quickly and repeatedly joins and leaves lockspaces" * tag 'dlm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: remove unnecessary waker_up() calls fs: dlm: move state change into else branch fs: dlm: remove newline in log_print fs: dlm: reduce the shutdown timeout to 5 secs fs: dlm: make dlm sequence id more robust fs: dlm: wait until all midcomms nodes detect version fs: dlm: ignore unexpected non dlm opts msgs fs: dlm: bring back previous shutdown handling fs: dlm: send FIN ack back in right cases fs: dlm: move sending fin message into state change handling fs: dlm: don't set stop rx flag after node reset fs: dlm: fix race setting stop tx flag fs: dlm: be sure to call dlm_send_queue_flush() fs: dlm: fix use after free in midcomms commit fs: dlm: start midcomms before scand fs/dlm: Remove "select SRCU" fs: dlm: fix return value check in dlm_memory_init()
2023-02-20Merge tag 'for-6.3-tag' of ↵Linus Torvalds57-2872/+2346
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The usual mix of performance improvements and new features. The core change is reworking how checksums are processed, with followup cleanups and simplifications. There are two minor changes in block layer and iomap code. Features: - block group allocation class heuristics: - pack files by size (up to 128k, up to 8M, more) to avoid fragmentation in block groups, assuming that file size and life time is correlated, in particular this may help during balance - with tracepoints and extensible in the future Performance: - send: cache directory utimes and only emit the command when necessary - speedup up to 10x - smaller final stream produced (no redundant utimes commands issued) - compatibility not affected - fiemap: skip backref checks for shared leaves - speedup 3x on sample filesystem with all leaves shared (e.g. on snapshots) - micro optimized b-tree key lookup, speedup in metadata operations (sample benchmark: fs_mark +10% of files/sec) Core changes: - change where checksumming is done in the io path: - checksum and read repair does verification at lower layer - cascaded cleanups and simplifications - raid56 refactoring and cleanups Fixes: - sysfs: make sure that a run-time change of a feature is correctly tracked by the feature files - scrub: better reporting of tree block errors Other: - locally enable -Wmaybe-uninitialized after fixing all warnings - misc cleanups, spelling fixes Other code: - block: export bio_split_rw - iomap: remove IOMAP_F_ZONE_APPEND" * tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (109 commits) btrfs: make kobj_type structures constant btrfs: remove the bdev argument to btrfs_rmap_block btrfs: don't rely on unchanging ->bi_bdev for zone append remaps btrfs: never return true for reads in btrfs_use_zone_append btrfs: pass a btrfs_bio to btrfs_use_append btrfs: set bbio->file_offset in alloc_new_bio btrfs: use file_offset to limit bios size in calc_bio_boundaries btrfs: do unsigned integer division in the extent buffer binary search loop btrfs: eliminate extra call when doing binary search on extent buffer btrfs: raid56: handle endio in scrub_rbio btrfs: raid56: handle endio in recover_rbio btrfs: raid56: handle endio in rmw_rbio btrfs: raid56: submit the read bios from scrub_assemble_read_bios btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios btrfs: raid56: fold recover_assemble_read_bios into recover_rbio btrfs: raid56: add a bio_list_put helper btrfs: raid56: wait for I/O completion in submit_read_bios btrfs: raid56: simplify code flow in rmw_rbio btrfs: raid56: simplify error handling and code flow in raid56_parity_write btrfs: replace btrfs_wait_tree_block_writeback by wait_on_extent_buffer_writeback ...
2023-02-20Merge tag 'fixes_for_v6.3-rc1' of ↵Linus Torvalds20-1553/+1362
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF and ext2 fixes from Jan Kara: - Rewrite of udf directory iteration code to address multiple syzbot reports - Fixes to udf extent handling and block mapping code to address several syzbot reports and filesystem corruption issues uncovered by fsx & fsstress - Convert udf to kmap_local() - Add sanity checks when loading udf bitmaps - Drop old VARCONV support which I've never seen used and which was broken for quite some years without anybody noticing - Finish conversion of ext2 to kmap_local() - One fix to mpage_writepages() on which other udf fixes depend * tag 'fixes_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (78 commits) udf: Avoid directory type conversion failure due to ENOMEM udf: Use unsigned variables for size calculations udf: remove reporting loc in debug output udf: Check consistency of Space Bitmap Descriptor udf: Fix file counting in LVID udf: Limit file size to 4TB udf: Don't return bh from udf_expand_dir_adinicb() udf: Convert udf_expand_file_adinicb() to avoid kmap_atomic() udf: Convert udf_adinicb_writepage() to memcpy_to_page() udf: Switch udf_adinicb_readpage() to kmap_local_page() udf: Move udf_adinicb_readpage() to inode.c udf: Mark aops implementation static udf: Switch to single address_space_operations udf: Add handling of in-ICB files to udf_bmap() udf: Convert all file types to use udf_write_end() udf: Convert in-ICB files to use udf_write_begin() udf: Convert in-ICB files to use udf_direct_IO() udf: Convert in-ICB files to use udf_writepages() udf: Unify .read_folio for normal and in-ICB files udf: Fix off-by-one error when discarding preallocation ...
2023-02-20Merge tag 'fsnotify_for_v6.3-rc1' of ↵Linus Torvalds3-25/+77
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Support for auditing decisions regarding fanotify permission events" * tag 'fsnotify_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify,audit: Allow audit to use the full permission event response fanotify: define struct members to hold response decision context fanotify: Ensure consistent variable type for response
2023-02-20Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds18-428/+569
Pull fsverity updates from Eric Biggers: "Fix the longstanding implementation limitation that fsverity was only supported when the Merkle tree block size, filesystem block size, and PAGE_SIZE were all equal. Specifically, add support for Merkle tree block sizes less than PAGE_SIZE, and make ext4 support fsverity on filesystems where the filesystem block size is less than PAGE_SIZE. Effectively, this means that fsverity can now be used on systems with non-4K pages, at least on ext4. These changes have been tested using the verity group of xfstests, newly updated to cover the new code paths. Also update fs/verity/ to support verifying data from large folios. There's also a similar patch for fs/crypto/, to support decrypting data from large folios, which I'm including in here to avoid a merge conflict between the fscrypt and fsverity branches" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fscrypt: support decrypting data from large folios fsverity: support verifying data from large folios fsverity.rst: update git repo URL for fsverity-utils ext4: allow verity with fs block size < PAGE_SIZE fs/buffer.c: support fsverity in block_read_full_folio() f2fs: simplify f2fs_readpage_limit() ext4: simplify ext4_readpage_limit() fsverity: support enabling with tree block size < PAGE_SIZE fsverity: support verification with tree block size < PAGE_SIZE fsverity: replace fsverity_hash_page() with fsverity_hash_block() fsverity: use EFBIG for file too large to enable verity fsverity: store log2(digest_size) precomputed fsverity: simplify Merkle tree readahead size calculation fsverity: use unsigned long for level_start fsverity: remove debug messages and CONFIG_FS_VERITY_DEBUG fsverity: pass pos and size to ->write_merkle_tree_block fsverity: optimize fsverity_cleanup_inode() on non-verity files fsverity: optimize fsverity_prepare_setattr() on non-verity files fsverity: optimize fsverity_file_open() on non-verity files
2023-02-20Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds7-42/+34
Pull fscrypt updates from Eric Biggers: "Simplify the implementation of the test_dummy_encryption mount option by adding the 'test dummy key' on-demand" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: clean up fscrypt_add_test_dummy_key() fs/super.c: stop calling fscrypt_destroy_keyring() from __put_super() f2fs: stop calling fscrypt_add_test_dummy_key() ext4: stop calling fscrypt_add_test_dummy_key() fscrypt: add the test dummy encryption key on-demand
2023-02-20Merge tag 'erofs-for-6.3-rc1' of ↵Linus Torvalds14-748/+653
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "The most noticeable feature for this cycle is per-CPU kthread decompression since Android use cases need low-latency I/O handling in order to ensure the app runtime performance, currently unbounded workqueue latencies are not quite good for production on many aarch64 hardwares and thus we need to introduce a deterministic expectation for these. Decompression is CPU-intensive and it is sleepable for EROFS, so other alternatives like decompression under softirq contexts are not considered. More details are in the corresponding commit message. Others are random cleanups around the whole codebase and we will continue to clean up further in the next few months. Due to Lunar New Year holidays, some other new features were not completely reviewed and solidified as expected and we may delay them into the next version. Summary: - Add per-cpu kthreads for low-latency decompression for Android use cases - Get rid of tagged pointer helpers since they are rarely used now - Several code cleanups to reduce codebase - Documentation and MAINTAINERS updates" * tag 'erofs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (21 commits) erofs: fix an error code in z_erofs_init_zip_subsystem() erofs: unify anonymous inodes for blob erofs: relinquish volume with mutex held erofs: maintain cookies of share domain in self-contained list erofs: remove unused device mapping in meta routine MAINTAINERS: erofs: Add Documentation/ABI/testing/sysfs-fs-erofs Documentation/ABI: sysfs-fs-erofs: update supported features erofs: remove unused EROFS_GET_BLOCKS_RAW flag erofs: update print symbols for various flags in trace erofs: make kobj_type structures constant erofs: add per-cpu threads for decompression as an option erofs: tidy up internal.h erofs: get rid of z_erofs_do_map_blocks() forward declaration erofs: move zdata.h into zdata.c erofs: remove tagged pointer helpers erofs: avoid tagged pointers to mark sync decompression erofs: get rid of erofs_inode_datablocks() erofs: simplify iloc() erofs: get rid of debug_one_dentry() erofs: remove linux/buffer_head.h dependency ...
2023-02-20Merge tag 'fs.acl.v6.3' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs acl update from Christian Brauner: "This contains a single update to the internal get acl method and replaces an open-coded cmpxchg() comparison with with try_cmpxchg(). It's clearer and also beneficial on some architectures" * tag 'fs.acl.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: posix_acl: Use try_cmpxchg in get_acl
2023-02-20Merge tag 'fs.v6.3' of ↵Linus Torvalds2-6/+20
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs hardening update from Christian Brauner: "Jan pointed out that during shutdown both filp_close() and super block destruction will use basic printk logging when bugs are detected. This causes issues in a few scenarios: - Tools like syzkaller cannot figure out that the logged message indicates a bug. - Users that explicitly opt in to have the kernel bug on data corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see the kernel crash when they did actually select that option. - When there are busy inodes after the superblock is shut down later access to such a busy inodes walks through freed memory. It would be better to cleanly crash instead. All of this can be addressed by using the already existing CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are detected. Its logging improvement is useful for all users. Otherwise this only has a meaningful behavioral effect when users do select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward compatible for regular users" * tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds293-2000/+2154
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-20Merge tag 'iversion-v6.3' of ↵Linus Torvalds9-48/+126
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull i_version updates from Jeff Layton: "This overhauls how we handle i_version queries from nfsd. Instead of having special routines and grabbing the i_version field directly out of the inode in some cases, we've moved most of the handling into the various filesystems' getattr operations. As a bonus, this makes ceph's change attribute usable by knfsd as well. This should pave the way for future work to make this value queryable by userland, and to make it more resilient against rolling back on a crash" * tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: nfsd: remove fetch_iversion export operation nfsd: use the getattr operation to fetch i_version nfsd: move nfsd4_change_attribute to nfsfh.c ceph: report the inode version in getattr if requested nfs: report the inode version in getattr if requested vfs: plumb i_version handling into struct kstat fs: clarify when the i_version counter must be updated fs: uninline inode_query_iversion
2023-02-20Merge tag 'locks-v6.3' of ↵Linus Torvalds41-26/+64
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change here is that I've broken out most of the file locking definitions into a new header file. I also went ahead and completed the removal of locks_inode function" * tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove locks_inode filelock: move file locking definitions to separate header file
2023-02-18Merge tag 'mm-hotfixes-stable-2023-02-17-15-16-2' of ↵Linus Torvalds3-1/+23
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Six hotfixes. Five are cc:stable: four for MM, one for nilfs2. Also a MAINTAINERS update" * tag 'mm-hotfixes-stable-2023-02-17-15-16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nilfs2: fix underflow in second superblock position calculations hugetlb: check for undefined shift on 32 bit architectures mm/migrate: fix wrongly apply write bit after mkdirty on sparc64 MAINTAINERS: update FPU EMULATOR web page mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount mm/filemap: fix page end in filemap_get_read_batch
2023-02-18nilfs2: fix underflow in second superblock position calculationsRyusuke Konishi3-1/+23
Macro NILFS_SB2_OFFSET_BYTES, which computes the position of the second superblock, underflows when the argument device size is less than 4096 bytes. Therefore, when using this macro, it is necessary to check in advance that the device size is not less than a lower limit, or at least that underflow does not occur. The current nilfs2 implementation lacks this check, causing out-of-bound block access when mounting devices smaller than 4096 bytes: I/O error, dev loop0, sector 36028797018963960 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 NILFS (loop0): unable to read secondary superblock (blocksize = 1024) In addition, when trying to resize the filesystem to a size below 4096 bytes, this underflow occurs in nilfs_resize_fs(), passing a huge number of segments to nilfs_sufile_resize(), corrupting parameters such as the number of segments in superblocks. This causes excessive loop iterations in nilfs_sufile_resize() during a subsequent resize ioctl, causing semaphore ns_segctor_sem to block for a long time and hang the writer thread: INFO: task segctord:5067 blocked for more than 143 seconds. Not tainted 6.2.0-rc8-syzkaller-00015-gf6feea56f66d #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:segctord state:D stack:23456 pid:5067 ppid:2 flags:0x00004000 Call Trace: <TASK> context_switch kernel/sched/core.c:5293 [inline] __schedule+0x1409/0x43f0 kernel/sched/core.c:6606 schedule+0xc3/0x190 kernel/sched/core.c:6682 rwsem_down_write_slowpath+0xfcf/0x14a0 kernel/locking/rwsem.c:1190 nilfs_transaction_lock+0x25c/0x4f0 fs/nilfs2/segment.c:357 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2486 [inline] nilfs_segctor_thread+0x52f/0x1140 fs/nilfs2/segment.c:2570 kthread+0x270/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> ... Call Trace: <TASK> folio_mark_accessed+0x51c/0xf00 mm/swap.c:515 __nilfs_get_page_block fs/nilfs2/page.c:42 [inline] nilfs_grab_buffer+0x3d3/0x540 fs/nilfs2/page.c:61 nilfs_mdt_submit_block+0xd7/0x8f0 fs/nilfs2/mdt.c:121 nilfs_mdt_read_block+0xeb/0x430 fs/nilfs2/mdt.c:176 nilfs_mdt_get_block+0x12d/0xbb0 fs/nilfs2/mdt.c:251 nilfs_sufile_get_segment_usage_block fs/nilfs2/sufile.c:92 [inline] nilfs_sufile_truncate_range fs/nilfs2/sufile.c:679 [inline] nilfs_sufile_resize+0x7a3/0x12b0 fs/nilfs2/sufile.c:777 nilfs_resize_fs+0x20c/0xed0 fs/nilfs2/super.c:422 nilfs_ioctl_resize fs/nilfs2/ioctl.c:1033 [inline] nilfs_ioctl+0x137c/0x2440 fs/nilfs2/ioctl.c:1301 ... This fixes these issues by inserting appropriate minimum device size checks or anti-underflow checks, depending on where the macro is used. Link: https://lkml.kernel.org/r/0000000000004e1dfa05f4a48e6b@google.com Link: https://lkml.kernel.org/r/20230214224043.24141-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: <syzbot+f0c4082ce5ebebdac63b@syzkaller.appspotmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17Revert "NFSv4.2: Change the default KConfig value for READ_PLUS"Anna Schumaker1-4/+4
This reverts commit 7fd461c47c6cfab4ca4d003790ec276209e52978. Unfortunately, it has come to our attention that there is still a bug somewhere in the READ_PLUS code that can result in nfsroot systems on ARM to crash during boot. Let's do the right thing and revert this change so we don't break people's nfsroot setups. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-02-16erofs: fix an error code in z_erofs_init_zip_subsystem()Dan Carpenter1-1/+3
Return -ENOMEM if alloc_workqueue() fails. Don't return success. Fixes: d8a650adf429 ("erofs: add per-cpu threads for decompression as an option") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/Y+4d0FRsUq8jPoOu@kili Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-02-15Merge tag 'nfsd-6.2-6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix a teardown bug in the new nfs4_file hashtable * tag 'nfsd-6.2-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: don't destroy global nfs4_file table in per-net shutdown
2023-02-15btrfs: make kobj_type structures constantThomas Weißschuh1-6/+6
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: remove the bdev argument to btrfs_rmap_blockChristoph Hellwig3-10/+4
The only user in the zoned remap code is gone now, so remove the argument. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: don't rely on unchanging ->bi_bdev for zone append remapsChristoph Hellwig3-26/+26
btrfs_record_physical_zoned relies on a bio->bi_bdev samples in the bio_end_io handler to find the reverse map for remapping the zone append write, but stacked block device drivers can and usually do change bi_bdev when sending on the bio to a lower device. This can happen e.g. with the nvme-multipath driver when a NVMe SSD sets the shared namespace bit. But there is no real need for the bdev in btrfs_record_physical_zoned, as it is only passed to btrfs_rmap_block, which uses it to pick the mapping to report if there are multiple reverse mappings. As zone writes can only do simple non-mirror writes right now, and anything more complex will use the stripe tree there is no chance of the multiple mappings case actually happening. Instead open code the subset of btrfs_rmap_block in btrfs_record_physical_zoned, which also removes a memory allocation and remove the bdev field in the ordered extent. Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: never return true for reads in btrfs_use_zone_appendChristoph Hellwig1-0/+3
Using Zone Append only makes sense for writes to the device, so check that in btrfs_use_zone_append. This avoids the possibility of artificially limited read size on zoned file systems. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: pass a btrfs_bio to btrfs_use_appendChristoph Hellwig4-6/+7
struct btrfs_bio has all the information needed for btrfs_use_append, so pass that instead of a btrfs_inode and file_offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: set bbio->file_offset in alloc_new_bioChristoph Hellwig1-2/+1
Instead of digging into the bio_vec in submit_one_bio, set file_offset at bio allocation time from the provided parameter. This also ensures that the file_offset is available all the time when building up the bio payload. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: use file_offset to limit bios size in calc_bio_boundariesChristoph Hellwig1-2/+2
btrfs_ordered_extent->disk_bytenr can be rewritten by the zoned I/O completion handler, and thus in general is not a good idea to limit I/O size. But the maximum bio size calculation can easily be done using the file_offset fields in the btrfs_ordered_extent and btrfs_bio structures, so switch to that instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: do unsigned integer division in the extent buffer binary search loopFilipe Manana2-7/+12
In the search loop of the binary search function, we are doing a division by 2 of the sum of the high and low slots. Because the slots are integers, the generated assembly code for it is the following on x86_64: 0x00000000000141f1 <+145>: mov %eax,%ebx 0x00000000000141f3 <+147>: shr $0x1f,%ebx 0x00000000000141f6 <+150>: add %eax,%ebx 0x00000000000141f8 <+152>: sar %ebx It's a few more instructions than a simple right shift, because signed integer division needs to round towards zero. However we know that slots can never be negative (btrfs_header_nritems() returns an u32), so we can instead use unsigned types for the low and high slots and therefore use unsigned integer division, which results in a single instruction on x86_64: 0x00000000000141f0 <+144>: shr %ebx So use unsigned types for the slots and therefore unsigned division. This is part of a small patchset comprised of the following two patches: btrfs: eliminate extra call when doing binary search on extent buffer btrfs: do unsigned integer division in the extent buffer binary search loop The following fs_mark test was run on a non-debug kernel (Debian's default kernel config) before and after applying the patchset: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 6 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Results before applying patchset: FSUse% Count Size Files/sec App Overhead 2 1200000 0 174472.0 11549868 4 2400000 0 253503.0 11694618 4 3600000 0 257833.1 11611508 6 4800000 0 247089.5 11665983 6 6000000 0 211296.1 12121244 10 7200000 0 187330.6 12548565 Results after applying patchset: FSUse% Count Size Files/sec App Overhead 2 1200000 0 207556.0 11393252 4 2400000 0 266751.1 11347909 4 3600000 0 274397.5 11270058 6 4800000 0 259608.4 11442250 6 6000000 0 238895.8 11635921 8 7200000 0 211942.2 11873825 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: eliminate extra call when doing binary search on extent bufferFilipe Manana2-13/+18
The function btrfs_bin_search() is just a wrapper around the function generic_bin_search(), which passes the same arguments plus a default low slot with a value of 0. This adds an unnecessary extra function call, since btrfs_bin_search() is not static. So improve on this by making btrfs_bin_search() an inline function that calls generic_bin_search(), renaming the later to btrfs_generic_bin_search() and exporting it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: handle endio in scrub_rbioChristoph Hellwig1-11/+7
The only caller of scrub_rbio calls rbio_orig_end_io right after it, move it into scrub_rbio to match the other work item helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: handle endio in recover_rbioChristoph Hellwig1-18/+9
Both callers of recover_rbio call rbio_orig_end_io right after it, so move the call into the shared function. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: handle endio in rmw_rbioChristoph Hellwig1-20/+10
Both callers of rmv_rbio call rbio_orig_end_io right after it, so move the call into the shared function. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: submit the read bios from scrub_assemble_read_biosChristoph Hellwig1-23/+13
Instead of filling in a bio_list and submitting the bios in the only caller, do that in scrub_assemble_read_bios. This removes the need to pass the bio_list, and also makes it clear that the extra bio_list cleanup in the caller is entirely pointless. Rename the function to scrub_read_bios to make it clear that the bios are not only assembled. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: fold rmw_read_wait_recover into rmw_read_biosChristoph Hellwig1-46/+23
There is very little extra code in rmw_read_bios, and a large part of it is the superfluous extra cleanup of the bio list. Merge the two functions, and only clean up the bio list after it has been added to but before it has been emptied again by submit_read_wait_bio_list. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: fold recover_assemble_read_bios into recover_rbioChristoph Hellwig1-40/+21
There is very little extra code in recover_rbio, and a large part of it is the superfluous extra cleanup of the bio list. Merge the two functions, and only clean up the bio list after it has been added to but before it has been emptied again by submit_read_wait_bio_list. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: add a bio_list_put helperChristoph Hellwig1-28/+16
Add a helper to put all bios in a list. This does not need to be added to block layer as there are no other users of such code. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: wait for I/O completion in submit_read_biosChristoph Hellwig1-7/+6
In addition to setting up the end_io handler and submitting the bios in submit_read_bios, also wait for them to be completed instead of waiting for the completion manually in all three callers. Rename submit_read_bios to submit_read_wait_bio_list to make it clear it waits for the bios as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: simplify code flow in rmw_rbioChristoph Hellwig1-15/+13
Remove the write goto label by moving the data page allocation and data read into the branch. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: raid56: simplify error handling and code flow in raid56_parity_writeChristoph Hellwig1-22/+15
Handle the error return on alloc_rbio failure directly instead of using a goto and remove the queue_rbio goto label by moving the plugged check into the if branch. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: replace btrfs_wait_tree_block_writeback by ↵Josef Bacik1-9/+3
wait_on_extent_buffer_writeback This is used in the tree-log code and is a holdover from previous iterations of extent buffer writeback. We can simply use wait_on_extent_buffer_writeback here, and remove btrfs_wait_tree_block_writeback completely as it's equivalent (waiting on page write writeback). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: combine btrfs_clear_buffer_dirty and clear_extent_buffer_dirtyJosef Bacik3-18/+19
btrfs_clear_buffer_dirty just does the test_clear_bit() and then calls clear_extent_buffer_dirty and does the dirty metadata accounting. Combine this into clear_extent_buffer_dirty and make the result btrfs_clear_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirtyJosef Bacik8-22/+22
btrfs_clean_tree_block is a misnomer, it's just clear_extent_buffer_dirty with some extra accounting around it. Rename this to btrfs_clear_buffer_dirty to make it more clear it belongs with it's setter, btrfs_mark_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: do not increment dirty_metadata_bytes in set_btree_ioerrJosef Bacik1-7/+0
We only add if we set the extent buffer dirty, and we subtract when we clear the extent buffer dirty. If we end up in set_btree_ioerr we have already cleared the buffer dirty, and we aren't resetting dirty on the extent buffer, so this is simply wrong. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: replace clearing extent buffer dirty bit with btrfs_clean_blockJosef Bacik2-23/+20
Now that we're passing in the trans into btrfs_clean_tree_block, we can easily roll in the handling of the !trans case and replace all occurrences of if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) clear_extent_buffer_dirty(eb); with btrfs_tree_lock(eb); btrfs_clean_tree_block(eb); btrfs_tree_unlock(eb); We need the lock because if we are actually dirty we need to make sure we aren't racing with anything that's starting writeout currently. This also makes sure that we're accounting fs_info->dirty_metadata_bytes appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: add trans argument to btrfs_clean_tree_blockJosef Bacik8-27/+29
We check the header generation in the extent buffer against the current running transaction id to see if it's safe to clear DIRTY on this buffer. Generally speaking if we're clearing the buffer dirty we're holding the transaction open, but in the case of cleaning up an aborted transaction we don't, so we have extra checks in that path to check the transid. To allow for a future cleanup go ahead and pass in the trans handle so we don't have to rely on ->running_transaction being set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: always lock the block before calling btrfs_clean_tree_blockJosef Bacik1-2/+1
We want to clean up the dirty handling for extent buffers so it's a little more consistent, so skip the check for generation == transid and simply always lock the extent buffer before calling btrfs_clean_tree_block. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15iomap: remove IOMAP_F_ZONE_APPENDChristoph Hellwig1-8/+2
No users left now that btrfs takes REQ_OP_WRITE bios from iomap and splits and converts them to REQ_OP_ZONE_APPEND internally. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: split zone append bios in btrfs_submit_bioChristoph Hellwig7-200/+64
The current btrfs zoned device support is a little cumbersome in the data I/O path as it requires the callers to not issue I/O larger than the supported ZONE_APPEND size of the underlying device. This leads to a lot of extra accounting. Instead change btrfs_submit_bio so that it can take write bios of arbitrary size and form from the upper layers, and just split them internally to the ZONE_APPEND queue limits. Then remove all the upper layer warts catering to limited write sized on zoned devices, including the extra refcount in the compressed_bio. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: calculate file system wide queue limit for zoned modeChristoph Hellwig3-28/+30
To be able to split a write into properly sized zone append commands, we need a queue_limits structure that contains the least common denominator suitable for all devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>