summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2023-04-12xfs: create traced helper to get extra perag referencesDarrick J. Wong9-20/+22
There are a few places in the XFS codebase where a caller has either an active or a passive reference to a perag structure and wants to give a passive reference to some other piece of code. Btree cursor creation and inode walks are good examples of this. Replace the open-coded logic with a helper to do this. The new function adds a few safeguards -- it checks that there's at least one reference to the perag structure passed in, and it records the refcount bump in the ftrace information. This makes it much easier to debug perag refcounting problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_refcount_intent its own perag referenceDarrick J. Wong3-23/+50
Give the xfs_refcount_intent a passive reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Any space being modified by a refcount intent is already allocated, so we need to be able to operate even if the AG is being shrunk or offlined. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_rmap_intent its own perag referenceDarrick J. Wong3-21/+44
Give the xfs_rmap_intent a passive reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. The space we're (reverse) mapping is already allocated, so we need to be able to operate even if the AG is being shrunk or offlined. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_extfree_intent its own perag referenceDarrick J. Wong3-22/+47
Give the xfs_extfree_intent an passive reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. The space being freed must already be allocated, so we need to able to run even if the AG is being offlined or shrunk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: pass per-ag references to xfs_free_extentDarrick J. Wong7-24/+28
Pass a reference to the per-AG structure to xfs_free_extent. Most callers already have one, so we can eliminate unnecessary lookups. The one exception to this is the EFI code, which the next patch will fix. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_bmap_intent its own perag referenceDarrick J. Wong3-1/+33
Give the xfs_bmap_intent an active reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Later, shrink will use these passive references to know if an AG is quiesced or not. The reason why we take a passive ref for a file mapping operation is simple: we're committing to some sort of action involving space in an AG, so we want to indicate our interest in that AG. The space is already allocated, so we need to be able to operate on AGs that are offline or being shrunk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-06xfs: remove xfs_filemap_map_pages() wrapperMatthew Wilcox (Oracle)1-16/+1
Patch series "Prevent ->map_pages from sleeping", v2. In preparation for a larger patch series which will handle (some, easy) page faults protected only by RCU, change the two filesystems which have sleeping locks to not take them and hold the RCU lock around calls to ->map_page to prevent other filesystems from adding sleeping locks. This patch (of 3): XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED to do this. filemap_map_pages() cannot bring new folios into the page cache and the folio lock is taken during filemap_map_pages() which provides sufficient protection against a truncation or hole punch. Link: https://lkml.kernel.org/r/20230327174515.1811532-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230327174515.1811532-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-03fs: add FMODE_DIO_PARALLEL_WRITE flagJens Axboe1-1/+2
Some filesystems support multiple threads writing to the same file with O_DIRECT without requiring exclusive access to it. io_uring can use this hint to avoid serializing dio writes to this inode, instead allowing them to run in parallel. XFS and ext4 both fall into this category, so set the flag for both of them. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-24xfs: fix mismerged tracepointsDarrick J. Wong1-4/+4
At some point in between sending this patch to the list and merging it into for-next, the tracepoints got all mixed up because I've over-reliant on automated tools not sucking. The end result is that the tracepoints are all wrong, so fix them. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-24xfs: clear incore AGFL_RESET state if it's not neededDarrick J. Wong1-0/+2
Prior to commit 7ac2ff8bb371, when we loaded the incore perag structure with information from the AGF header, we would set or clear the pagf_agfl_reset field based on whether or not the AGFL list was misaligned within the block. IOWs, it's an incore state bit that's supposed to cache something in the ondisk metadata. Therefore, the code still needs to support clearing the incore bit if (somehow) the AGFL were to correct itself. It turns out that xfs_repair does exactly this -- phase 4 loads the AGF to scan the rmapbt for corrupt records, which can set NEEDS_AGFL_RESET. The scan unsets AGF_INIT but doesn't unset NEEDS_AGFL_RESET. Phase 5 totally rewrites the AGFL and fixes the alignment problem, didn't clear NEEDS_AGFL_RESET historically, and reloads the perag state to fix the freelist. This results in the AGFL being reset based on stale data, which then causes the new AGFL blocks to be leaked. A subsequent xfs_repair -n then complains about the leaks. One could argue that phase 5 ought to clear this bit directly when it reloads the perag AGF data after rewriting the AGFL, but libxfs used to handle this for us, so it should go back to doing that. Found by fuzzing flfirst = ones in xfs/352. Fixes: 7ac2ff8bb371 ("xfs: perags need atomic operational state") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-24xfs: pass the correct cursor to xfs_iomap_prealloc_sizeDarrick J. Wong1-1/+4
In xfs_buffered_write_iomap_begin, @icur is the iext cursor for the data fork and @ccur is the cursor for the cow fork. Pass in whichever cursor corresponds to allocfork, because otherwise the xfs_iext_prev_extent call can use the data fork cursor to walk off the end of the cow fork structure. Best case it returns the wrong results, worst case it does this: stack segment: 0000 [#1] PREEMPT SMP CPU: 2 PID: 3141909 Comm: fsstress Tainted: G W 6.3.0-rc2-xfsx #6.3.0-rc2 7bf5cc2e98997627cae5c930d890aba3aeec65dd Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014 RIP: 0010:xfs_iext_prev+0x71/0x150 [xfs] RSP: 0018:ffffc90002233aa8 EFLAGS: 00010297 RAX: 000000000000000f RBX: 000000000000000e RCX: 000000000000000c RDX: 0000000000000002 RSI: 000000000000000e RDI: ffff8883d0019ba0 RBP: 989642409af8a7a7 R08: ffffea0000000001 R09: 0000000000000002 R10: 0000000000000000 R11: 000000000000000c R12: ffffc90002233b00 R13: ffff8883d0019ba0 R14: 989642409af8a6bf R15: 000ffffffffe0000 FS: 00007fdf8115f740(0000) GS:ffff88843fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdf8115e000 CR3: 0000000357256000 CR4: 00000000003506e0 Call Trace: <TASK> xfs_iomap_prealloc_size.constprop.0.isra.0+0x1a6/0x410 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c] xfs_buffered_write_iomap_begin+0xa87/0xc60 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c] iomap_iter+0x132/0x2f0 iomap_file_buffered_write+0x92/0x330 xfs_file_buffered_write+0xb1/0x330 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c] vfs_write+0x2eb/0x410 ksys_write+0x65/0xe0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Found by xfs/538 in alwayscow mode, but this doesn't seem particular to that test. Fixes: 590b16516ef3 ("xfs: refactor xfs_iomap_prealloc_size") Actually-Fixes: 66ae56a53f0e ("xfs: introduce an always_cow mode") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-19xfs: test dir/attr hash when loading moduleDarrick J. Wong4-0/+680
Back in the 6.2-rc1 days, Eric Whitney reported a fstests regression in ext4 against generic/454. The cause of this test failure was the unfortunate combination of setting an xattr name containing UTF8 encoded emoji, an xattr hash function that accepted a char pointer with no explicit signedness, signed type extension of those chars to an int, and the 6.2 build tools maintainers deciding to mandate -funsigned-char across the board. As a result, the ondisk extended attribute structure written out by 6.1 and 6.2 were not the same. This discrepancy, in fact, had been noticeable if a filesystem with such an xattr were moved between any two architectures that don't employ the same signedness of a raw "char" declaration. The only reason anyone noticed is that x86 gcc defaults to signed, and no such -funsigned-char update was made to e2fsprogs, so e2fsck immediately started reporting data corruption. After a day and a half of discussing how to handle this use case (xattrs with bit 7 set anywhere in the name) without breaking existing users, Linus merged his own patch and didn't tell the maintainer. None of the ext4 developers realized this until AUTOSEL announced that the commit had been backported to stable. In the end, this problem could have been detected much earlier if there had been any useful tests of hash function(s) in use inside ext4 to make sure that they always produce the same outputs given the same inputs. The XFS dirent/xattr name hash takes a uint8_t*, so I don't think it's vulnerable to this problem. However, let's avoid all this drama by adding our own self test to check that the da hash produces the same outputs for a static pile of inputs on various platforms. This enables us to fix any breakage that may result in a controlled fashion. The buffer and test data are identical to the patches submitted to xfsprogs. Link: https://lore.kernel.org/linux-ext4/Y8bpkm3jA3bDm3eL@debian-BULLSEYE-live-builder-AMD64/ Link: https://lore.kernel.org/linux-xfs/ZBUKCRR7xvIqPrpX@destitution/T/#md38272cc684e2c0d61494435ccbb91f022e8dee4 Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-19xfs: add tracepoints for each of the externally visible allocatorsDarrick J. Wong2-0/+24
There are now five separate space allocator interfaces exposed to the rest of XFS for five different strategies to find space. Add tracepoints for each of them so that I can tell from a trace dump exactly which ones got called and what happened underneath them. Add a sixth so it's more obvious if an allocation actually happened. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-19xfs: walk all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_agsDarrick J. Wong1-1/+5
Callers of xfs_alloc_vextent_iterate_ags that pass in the TRYLOCK flag want us to perform a non-blocking scan of the AGs for free space. There are no ordering constraints for non-blocking AGF lock acquisition, so the scan can freely start over at AG 0 even when minimum_agno > 0. This manifests fairly reliably on xfs/294 on 6.3-rc2 with the parent pointer patchset applied and the realtime volume enabled. I observed the following sequence as part of an xfs_dir_createname call: 0. Fragment the free space, then allocate nearly all the free space in all AGs except AG 0. 1. Create a directory in AG 2 and let it grow for a while. 2. Try to allocate 2 blocks to expand the dirent part of a directory. The space will be allocated out of AG 0, but the allocation will not be contiguous. This (I think) activates the LOWMODE allocator. 3. The bmapi call decides to convert from extents to bmbt format and tries to allocate 1 block. This allocation request calls xfs_alloc_vextent_start_ag with the inode number, which starts the scan at AG 2. We ignore AG 0 (with all its free space) and instead scrape AG 2 and 3 for more space. We find one block, but this now kicks t_highest_agno to 3. 4. The createname call decides it needs to split the dabtree. It tries to allocate even more space with xfs_alloc_vextent_start_ag, but now we're constrained to AG 3, and we don't find the space. The createname returns ENOSPC and the filesystem shuts down. This change fixes the problem by making the trylock scan wrap around to AG 0 if it doesn't like the AGs that it finds. Since the current transaction itself holds AGF 0, the trylock of AGF 0 will succeed, and we take space from the AG that has plenty. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-16xfs: try to idiot-proof the allocatorsDarrick J. Wong1-0/+13
In porting his development branch to 6.3-rc1, yours truly has repeatedly screwed up the args->pag being fed to the xfs_alloc_vextent* functions. Add some debugging assertions to test the preconditions required of the callers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-06fs: drop unused posix acl handlersChristian Brauner1-4/+0
Remove struct posix_acl_{access,default}_handler for all filesystems that don't depend on the xattr handler in their inode->i_op->listxattr() method in any way. There's nothing more to do than to simply remove the handler. It's been effectively unused ever since we introduced the new posix acl api. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-06xfs: fix off-by-one-block in xfs_discard_folio()Dave Chinner1-7/+14
The recent writeback corruption fixes changed the code in xfs_discard_folio() to calculate a byte range to for punching delalloc extents. A mistake was made in using round_up(pos) for the end offset, because when pos points at the first byte of a block, it does not get rounded up to point to the end byte of the block. hence the punch range is short, and this leads to unexpected behaviour in certain cases in xfs_bmap_punch_delalloc_range. e.g. pos = 0 means we call xfs_bmap_punch_delalloc_range(0,0), so there is no previous extent and it rounds up the punch to the end of the delalloc extent it found at offset 0, not the end of the range given to xfs_bmap_punch_delalloc_range(). Fix this by handling the zero block offset case correctly. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217030 Link: https://lore.kernel.org/linux-xfs/Y+vOfaxIWX1c%2Fyy9@bfoster/ Fixes: 7348b322332d ("xfs: xfs_bmap_punch_delalloc_range() should take a byte range") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-06xfs: quotacheck failure can race with background inode inactivationDave Chinner1-14/+26
The background inode inactivation can attached dquots to inodes, but this can race with a foreground quotacheck failure that leads to disabling quotas and freeing the mp->m_quotainfo structure. The background inode inactivation then tries to allocate a quota, tries to dereference mp->m_quotainfo, and crashes like so: XFS (loop1): Quotacheck: Unsuccessful (Error -5): Disabling quotas. xfs filesystem being mounted at /root/syzkaller.qCVHXV/0/file0 supports timestamps until 2038 (0x7fffffff) BUG: kernel NULL pointer dereference, address: 00000000000002a8 .... CPU: 0 PID: 161 Comm: kworker/0:4 Not tainted 6.2.0-c9c3395d5e3d #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: xfs-inodegc/loop1 xfs_inodegc_worker RIP: 0010:xfs_dquot_alloc+0x95/0x1e0 .... Call Trace: <TASK> xfs_qm_dqread+0x46/0x440 xfs_qm_dqget_inode+0x154/0x500 xfs_qm_dqattach_one+0x142/0x3c0 xfs_qm_dqattach_locked+0x14a/0x170 xfs_qm_dqattach+0x52/0x80 xfs_inactive+0x186/0x340 xfs_inodegc_worker+0xd3/0x430 process_one_work+0x3b1/0x960 worker_thread+0x52/0x660 kthread+0x161/0x1a0 ret_from_fork+0x29/0x50 </TASK> .... Prevent this race by flushing all the queued background inode inactivations pending before purging all the cached dquots when quotacheck fails. Reported-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-01Merge tag 'xfs-6.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds36-1243/+1536
Pull moar xfs updates from Darrick Wong: "This contains a fix for a deadlock in the allocator. It continues the slow march towards being able to offline AGs, and it refactors the interface to the xfs allocator to be less indirection happy. Summary: - Fix a deadlock in the free space allocator due to the AG-walking algorithm forgetting to follow AG-order locking rules - Make the inode allocator prefer existing free inodes instead of failing to allocate new inode chunks when free space is low - Set minleft correctly when setting allocator parameters for bmap changes - Fix uninitialized variable access in the getfsmap code - Make a distinction between active and passive per-AG structure references. For now, active references are taken to perform some work in an AG on behalf of a high level operation; passive references are used by lower level code to finish operations started by other threads. Eventually this will become part of online shrink - Split out all the different allocator strategies into separate functions to move us away from design antipattern of filling out a huge structure for various differentish things and issuing a single function multiplexing call - Various cleanups in the filestreams allocator code, which we might very well want to deprecate instead of continuing - Fix a bug with the agi rotor code that was introduced earlier in this series" * tag 'xfs-6.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits) xfs: restore old agirotor behavior xfs: fix uninitialized variable access xfs: refactor the filestreams allocator pick functions xfs: return a referenced perag from filestreams allocator xfs: pass perag to filestreams tracing xfs: use for_each_perag_wrap in xfs_filestream_pick_ag xfs: track an active perag reference in filestreams xfs: factor out MRU hit case in xfs_filestream_select_ag xfs: remove xfs_filestream_select_ag() longest extent check xfs: merge new filestream AG selection into xfs_filestream_select_ag() xfs: merge filestream AG lookup into xfs_filestream_select_ag() xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c xfs: use xfs_bmap_longest_free_extent() in filestreams xfs: get rid of notinit from xfs_bmap_longest_free_extent xfs: factor out filestreams from xfs_bmap_btalloc_nullfb xfs: convert trim to use for_each_perag_range xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker xfs: move the minimum agno checks into xfs_alloc_vextent_check_args xfs: fold xfs_alloc_ag_vextent() into callers xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() ...
2023-02-27xfs: restore old agirotor behaviorDarrick J. Wong1-1/+2
Prior to the removal of xfs_ialloc_next_ag, we would increment the agi rotor and return the *old* value. atomic_inc_return returns the new value, which causes mkfs to allocate the root directory in AG 1. Put back the old behavior (at least for mkfs) by subtracting 1 here. Fixes: 20a5eab49d35 ("xfs: convert xfs_ialloc_next_ag() to an atomic") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-24Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-23Merge tag 'xfs-6.3-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds18-410/+375
Pull xfs updates from Darrick Wong: "There's a couple of bug fixes, some cleanups for inconsistent variable names and reduction of struct boxing and unboxing in the logging code. More work is pending, which will begin reworking allocation group lifetimes and finally replace confusing indirect calls to the allocator with actual ... function calls. But I want to let that experience another week of testing. Summary: - Eliminate repeated boxing and unboxing of log item parameters - Clean up some confusing variable names in the log item code - Fix a deadlock when doing unwritten extent conversion that causes a bmbt split when there are sustained memory shortages and the worker pool runs out of worker threads - Fix the panic_mask debug knob not being able to trigger on verifier errors - Constify kobj_type objects" * tag 'xfs-6.3-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: revert commit 8954c44ff477 xfs: make kobj_type structures constant xfs: allow setting full range of panic tags xfs: don't use BMBT btree split workers for IO completion xfs: fix confusing variable names in xfs_refcount_item.c xfs: pass refcount intent directly through the log intent code xfs: fix confusing variable names in xfs_rmap_item.c xfs: pass rmap space mapping directly through the log intent code xfs: fix confusing xfs_extent_item variable names xfs: pass xfs_extent_free_item directly through the log intent code xfs: fix confusing variable names in xfs_bmap_item.c xfs: pass the xfs_bmbt_irec directly through the log intent code xfs: use strscpy() to instead of strncpy()
2023-02-23Merge tag 'iomap-6.3-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-2/+2
Pull iomap updates from Darrick Wong: "This is mostly rearranging things to make life easier for gfs2, nothing all that mindblowing for this release. - Change when the iomap page_done function is called so that we still have a locked folio in the success case. This fixes a writeback race in gfs2 - Change when the iomap page_prepare function is called so that gfs2 can recover from OOM scenarios more gracefully - Rename the iomap page_ops to folio_ops, since they operate on folios now" * tag 'iomap-6.3-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Rename page_ops to folio_ops iomap: Rename page_prepare handler to get_folio iomap: Add __iomap_get_folio helper iomap/gfs2: Get page in page_prepare handler iomap: Add iomap_get_folio helper iomap: Rename page_done handler to put_folio iomap/gfs2: Unlock and put folio in page_done handler iomap: Add __iomap_put_folio helper
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds17-90/+89
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-15xfs: fix uninitialized variable accessDarrick J. Wong1-0/+1
If the end position of a GETFSMAP query overlaps an allocated space and we're using the free space info to generate fsmap info, the akeys information gets fed into the fsmap formatter with bad results. Zero-init the space. Reported-by: syzbot+090ae72d552e6bd93cfe@syzkaller.appspotmail.com Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: refactor the filestreams allocator pick functionsDave Chinner2-132/+145
Now that the filestreams allocator is largely rewritten, restructure the main entry point and pick function to seperate out the different operations cleanly. The MRU lookup function should not handle the start AG selection on MRU lookup failure, and nor should the pick function handle building the association that is inserted into the MRU. This leaves the filestreams allocator fairly clean and easy to understand, returning to the caller with an active perag reference and a target block to allocate at. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: return a referenced perag from filestreams allocatorDave Chinner2-47/+85
Now that the filestreams AG selection tracks active perags, we need to return an active perag to the core allocator code. This is because the file allocation the filestreams code will run are AG specific allocations and so need to pin the AG until the allocations complete. We cannot rely on the filestreams item reference to do this - the filestreams association can be torn down at any time, hence we need to have a separate reference for the allocation process to pin the AG after it has been selected. This means there is some perag juggling in allocation failure fallback paths as they will do all AG scans in the case the AG specific allocation fails. Hence we need to track the perag reference that the filestream allocator returned to make sure we don't leak it on repeated allocation failure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: pass perag to filestreams tracingDave Chinner3-42/+25
Pass perags instead of raw ag numbers, avoiding the need for the special peek function for the tracing code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use for_each_perag_wrap in xfs_filestream_pick_agDave Chinner1-60/+41
xfs_filestream_pick_ag() is now ready to rework to use for_each_perag_wrap() for iterating the perags during the AG selection scan. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: track an active perag reference in filestreamsDave Chinner1-57/+43
Rather than just track the agno of the reference, track a referenced perag pointer instead. This will allow active filestreams to prevent AGs from going away until the filestreams have been torn down. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: factor out MRU hit case in xfs_filestream_select_agDave Chinner1-50/+83
Because it now stands out like a sore thumb. Factoring out this case starts the process of simplifying xfs_filestream_select_ag() again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: remove xfs_filestream_select_ag() longest extent checkDave Chinner1-17/+1
Picking a new AG checks the longest free extent in the AG is valid, so there's no need to repeat the check in xfs_filestream_select_ag(). Remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: merge new filestream AG selection into xfs_filestream_select_ag()Dave Chinner1-72/+40
This is largely a wrapper around xfs_filestream_pick_ag() that repeats a lot of the lookups that we just merged back into xfs_filestream_select_ag() from the lookup code. Merge the xfs_filestream_new_ag() code back into _select_ag() to get rid of all the unnecessary logic. Indeed, this makes it obvious that if we have no parent inode, the filestreams allocator always selects AG 0 regardless of whether it is fit for purpose or not. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: merge filestream AG lookup into xfs_filestream_select_ag()Dave Chinner1-114/+70
The lookup currently either returns the cached filestream AG or it calls xfs_filestreams_select_lengths() to looks up a new AG. This has verify the AG that is selected, so we end up doing "select a new AG loop in a couple of places when only one really is needed. Merge the initial lookup functionality with the length selection so that we only need to do a single pick loop on lookup or verification failure. This undoes a lot of the factoring that enabled the selection to be moved over to the filestreams code. It makes xfs_filestream_select_ag() an awful messier, but it has to be made worse before it can get better in future patches... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.cDave Chinner4-87/+115
xfs_bmap_btalloc_filestreams() calls two filestreams functions to select the AG to allocate from. Both those functions end up in the same selection function that iterates all AGs multiple times. Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4 times just to select the initial AG to allocate in. Move the AG selection to fs/xfs/xfs_filestreams.c as a single interface so that the inefficient AG interation is contained entirely within the filestreams code. This will allow the implementation to be simplified and made more efficient in future patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_bmap_longest_free_extent() in filestreamsDave Chinner3-15/+11
The code in xfs_bmap_longest_free_extent() is open coded in xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and call it from the filestreams code instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: get rid of notinit from xfs_bmap_longest_free_extentDave Chinner1-48/+37
It is only set if reading the AGF gets a EAGAIN error. Just return the EAGAIN error and handle that error in the callers. This means we can remove the not_init parameter from xfs_bmap_select_minlen(), too, because the use of not_init there is pessimistic. If we can't read the agf, it won't increase blen. The only time we actually care whether we checked all the AGFs for contiguous free space is when the best length is less than the minimum allocation length. If not_init is set, then we ignore blen and set the minimum alloc length to the absolute minimum, not the best length we know already is present. However, if blen is less than the minimum we're going to ignore it anyway, regardless of whether we scanned all the AGFs or not. Hence not_init can go away, because we only use if blen is good from the scanned AGs otherwise we ignore it altogether and use minlen. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: factor out filestreams from xfs_bmap_btalloc_nullfbDave Chinner1-71/+96
There's many if (filestreams) {} else {} branches in this function. Split it out into a filestreams specific function so that we can then work directly on cleaning up the filestreams code without impacting the rest of the allocation algorithms. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: convert trim to use for_each_perag_rangeDave Chinner1-27/+23
To convert it to using active perag references and hence make it shrink safe. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walkerDave Chinner2-61/+57
Now that the AG iteration code in the core allocation code has been cleaned up, we can easily convert it to use a for_each_perag..() variant to use active references and skip AGs that it can't get active references on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: move the minimum agno checks into xfs_alloc_vextent_check_argsDave Chinner1-55/+33
All of the allocation functions now extract the minimum allowed AG from the transaction and then use it in some way. The allocation functions that are restricted to a single AG all check if the AG requested can be allocated from and return an error if so. These all set args->agno appropriately. All the allocation functions that iterate AGs use it to calculate the scan start AG. args->agno is not set until the iterator starts walking AGs. Hence we can easily set up a conditional check against the minimum AG allowed in xfs_alloc_vextent_check_args() based on whether args->agno contains NULLAGNUMBER or not and move all the repeated setup code to xfs_alloc_vextent_check_args(), further simplifying the allocation functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: fold xfs_alloc_ag_vextent() into callersDave Chinner4-106/+29
We don't need the multiplexing xfs_alloc_ag_vextent() provided anymore - we can just call the exact/near/size variants directly. This allows us to remove args->type completely and stop using args->fsbno as an input to the allocator algorithms. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()Dave Chinner1-59/+63
Move it from xfs_alloc_ag_vextent() so we can get rid of that layer. Rename xfs_alloc_vextent_set_fsbno() to xfs_alloc_vextent_finish() to indicate that it's function is finishing off the allocation that we've run now that it contains much more functionality. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_prepare()Dave Chinner1-44/+76
Now that we have wrapper functions for each type of allocation we can ask for, we can start unravelling xfs_alloc_ag_vextent(). That is essentially just a prepare stage, the allocation multiplexer and a post-allocation accounting step is the allocation proceeded. The current xfs_alloc_vextent*() wrappers all have a prepare stage, the allocation operation and a post-allocation accounting step. We can consolidate this by moving the AG alloc prep code into the wrapper functions, the accounting code in the wrapper accounting functions, and cut out the multiplexer layer entirely. This patch consolidates the AG preparation stage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_exact_bno()Dave Chinner6-26/+72
Two of the callers to xfs_alloc_vextent_this_ag() actually want exact block number allocation, not anywhere-in-ag allocation. Split this out from _this_ag() as a first class citizen so no external extent allocation code needs to care about args->type anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_near_bno()Dave Chinner6-54/+55
The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO allocations. We can replace that function with a new xfs_alloc_vextent_near_bno() function that does this explicitly. We also multiplex NEAR_BNO allocations through xfs_alloc_vextent_this_ag via args->type. Replace all of these with direct calls to xfs_alloc_vextent_near_bno(), too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_start_bno() where appropriateDave Chinner4-38/+51
Change obvious callers of single AG allocation to use xfs_alloc_vextent_start_bno(). Callers no long need to specify XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and removed. While doing this, also pass the allocation target fsb as a parameter rather than encoding it in args->fsbno. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_first_ag() where appropriateDave Chinner3-32/+42
Change obvious callers of single AG allocation to use xfs_alloc_vextent_first_ag(). This gets rid of XFS_ALLOCTYPE_FIRST_AG as the type used within xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we can remove the setting of args->type from all the callers of _first_ag() and remove the alloctype. While doing this, pass the allocation target fsb as a parameter rather than encoding it in args->fsbno. This starts the process of making args->fsbno an output only variable rather than input/output. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: factor xfs_bmap_btalloc()Dave Chinner1-137/+196
There are several different contexts xfs_bmap_btalloc() handles, and large chunks of the code execute independent allocation contexts. Try to untangle this mess a bit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_this_ag() where appropriateDave Chinner9-53/+74
Change obvious callers of single AG allocation to use xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the callers, too, so that callers with active references don't need to do new lookups just for an allocation in a context that already has a perag reference. The only remaining caller that does single AG allocation through xfs_alloc_vextent() is xfs_bmap_btalloc() with XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before it can be converted cleanly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>