summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_inode_item.c
AgeCommit message (Collapse)AuthorFilesLines
2024-07-04xfs: skip flushing log items during pushDave Chinner1-1/+5
The AIL pushing code spends a huge amount of time skipping over items that are already marked as flushing. It is not uncommon to see hundreds of thousands of items skipped every second due to inode clustering marking all the inodes in a cluster as flushing when the first one is flushed. However, to discover an item is already flushing and should be skipped we have to call the iop_push() method for it to try to flush the item. For inodes (where this matters most), we have to first check that inode is flushable first. We can optimise this overhead away by tracking whether the log item is flushing internally. This allows xfsaild_push() to check the log item directly for flushing state and immediately skip the log item. Whilst this doesn't remove the CPU cache misses for loading the log item, it does avoid the overhead of an indirect function call and the cache misses involved in accessing inode and backing cluster buffer structures to determine flushing state. When trying to flush hundreds of thousands of inodes each second, this CPU overhead saving adds up quickly. It's so noticeable that the biggest issue with pushing on the AIL on fast storage becomes the 10ms back-off wait when we hit enough pinned buffers to break out of the push loop but not enough for the AIL pushing to be considered stuck. This limits the xfsaild to about 70% total CPU usage, and on fast storage this isn't enough to keep the storage 100% busy. The xfsaild will block on IO submission on slow storage and so is self throttling - it does not need a backoff in the case where we are really just breaking out of the walk to submit the IO we have gathered. Further with no backoff we don't need to gather huge delwri lists to mitigate the impact of backoffs, so we can submit IO more frequently and reduce the time log items spend in flushing state by breaking out of the item push loop once we've gathered enough IO to batch submission effectively. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-02xfs: verify buffer, inode, and dquot items every tx commitDarrick J. Wong1-0/+32
generic/388 has an annoying tendency to fail like this during log recovery: XFS (sda4): Unmounting Filesystem 435fe39b-82b6-46ef-be56-819499585130 XFS (sda4): Mounting V5 Filesystem 435fe39b-82b6-46ef-be56-819499585130 XFS (sda4): Starting recovery (logdev: internal) 00000000: 49 4e 81 b6 03 02 00 00 00 00 00 07 00 00 00 07 IN.............. 00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 10 ................ 00000020: 35 9a 8b c1 3e 6e 81 00 35 9a 8b c1 3f dc b7 00 5...>n..5...?... 00000030: 35 9a 8b c1 3f dc b7 00 00 00 00 00 00 3c 86 4f 5...?........<.O 00000040: 00 00 00 00 00 00 02 f3 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 1f 01 00 00 00 00 00 00 00 02 b2 74 c9 0b .............t.. 00000060: ff ff ff ff d7 45 73 10 00 00 00 00 00 00 00 2d .....Es........- 00000070: 00 00 07 92 00 01 fe 30 00 00 00 00 00 00 00 1a .......0........ 00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000090: 35 9a 8b c1 3b 55 0c 00 00 00 00 00 04 27 b2 d1 5...;U.......'.. 000000a0: 43 5f e3 9b 82 b6 46 ef be 56 81 94 99 58 51 30 C_....F..V...XQ0 XFS (sda4): Internal error Bad dinode after recovery at line 539 of file fs/xfs/xfs_inode_item_recover.c. Caller xlog_recover_items_pass2+0x4e/0xc0 [xfs] CPU: 0 PID: 2189311 Comm: mount Not tainted 6.9.0-rc4-djwx #rc4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4f/0x60 xfs_corruption_error+0x90/0xa0 xlog_recover_inode_commit_pass2+0x5f1/0xb00 xlog_recover_items_pass2+0x4e/0xc0 xlog_recover_commit_trans+0x2db/0x350 xlog_recovery_process_trans+0xab/0xe0 xlog_recover_process_data+0xa7/0x130 xlog_do_recovery_pass+0x398/0x840 xlog_do_log_recovery+0x62/0xc0 xlog_do_recover+0x34/0x1d0 xlog_recover+0xe9/0x1a0 xfs_log_mount+0xff/0x260 xfs_mountfs+0x5d9/0xb60 xfs_fs_fill_super+0x76b/0xa30 get_tree_bdev+0x124/0x1d0 vfs_get_tree+0x17/0xa0 path_mount+0x72b/0xa90 __x64_sys_mount+0x112/0x150 do_syscall_64+0x49/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> XFS (sda4): Corruption detected. Unmount and run xfs_repair XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0x739/0x920 [xfs], inode 0x427b2d1 XFS (sda4): Filesystem has been shut down due to log error (0x2). XFS (sda4): Please unmount the filesystem and rectify the problem(s). XFS (sda4): log mount/recovery failed: error -117 XFS (sda4): log mount failed This inode log item recovery failing the dinode verifier after replaying the contents of the inode log item into the ondisk inode. Looking back into what the kernel was doing at the time of the fs shutdown, a thread was in the middle of running a series of transactions, each of which committed changes to the inode. At some point in the middle of that chain, an invalid (at least according to the verifier) change was committed. Had the filesystem not shut down in the middle of the chain, a subsequent transaction would have corrected the invalid state and nobody would have noticed. But that's not what happened here. Instead, the invalid inode state was committed to the ondisk log, so log recovery tripped over it. The actual defect here was an overzealous inode verifier, which was fixed in a separate patch. This patch adds some transaction precommit functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds of transient errors at transaction commit time, where it's much easier to find the root cause. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-19xfs: Replace xfs_isilocked with xfs_assert_ilockedMatthew Wilcox (Oracle)1-2/+2
To use the new rwsem_assert_held()/rwsem_assert_held_write(), we can't use the existing ASSERT macro. Add a new xfs_assert_ilocked() and convert all the callers. Fix an apparent bug in xfs_isilocked(): If the caller specifies XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL, xfs_assert_ilocked() will check both the IOLOCK and the ILOCK are held for write. xfs_isilocked() only checked that the ILOCK was held for write. xfs_assert_ilocked() is always on, even if DEBUG or XFS_WARN aren't defined. It's a cheap check, so I don't think it's worth defining it away. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert kmem_free() for kvmalloc users to kvfree()Dave Chinner1-1/+1
Start getting rid of kmem_free() by converting all the cases where memory can come from vmalloc interfaces to calling kvfree() directly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29xfs: make if_data a void pointerChristoph Hellwig1-6/+4
The xfs_ifork structure currently has a union of the if_root void pointer and the if_data char pointer. In either case it is an opaque pointer that depends on the fork format. Replace the union with a single if_data void pointer as that is what almost all callers want. Only the symlink NULL termination code in xfs_init_local_fork actually needs a new local variable now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-15xfs: initialise di_crc in xfs_log_dinodeDave Chinner1-0/+3
Alexander Potapenko report that KMSAN was issuing these warnings: kmalloc-ed xlog buffer of size 512 : ffff88802fc26200 kmalloc-ed xlog buffer of size 368 : ffff88802fc24a00 kmalloc-ed xlog buffer of size 648 : ffff88802b631000 kmalloc-ed xlog buffer of size 648 : ffff88802b632800 kmalloc-ed xlog buffer of size 648 : ffff88802b631c00 xlog_write_iovec: copying 12 bytes from ffff888017ddbbd8 to ffff88802c300400 xlog_write_iovec: copying 28 bytes from ffff888017ddbbe4 to ffff88802c30040c xlog_write_iovec: copying 68 bytes from ffff88802fc26274 to ffff88802c300428 xlog_write_iovec: copying 188 bytes from ffff88802fc262bc to ffff88802c30046c ===================================================== BUG: KMSAN: uninit-value in xlog_write_iovec fs/xfs/xfs_log.c:2227 BUG: KMSAN: uninit-value in xlog_write_full fs/xfs/xfs_log.c:2263 BUG: KMSAN: uninit-value in xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532 xlog_write_iovec fs/xfs/xfs_log.c:2227 xlog_write_full fs/xfs/xfs_log.c:2263 xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532 xlog_cil_write_chain fs/xfs/xfs_log_cil.c:918 xlog_cil_push_work+0x30f2/0x44e0 fs/xfs/xfs_log_cil.c:1263 process_one_work kernel/workqueue.c:2630 process_scheduled_works+0x1188/0x1e30 kernel/workqueue.c:2703 worker_thread+0xee5/0x14f0 kernel/workqueue.c:2784 kthread+0x391/0x500 kernel/kthread.c:388 ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 Uninit was created at: slab_post_alloc_hook+0x101/0xac0 mm/slab.h:768 slab_alloc_node mm/slub.c:3482 __kmem_cache_alloc_node+0x612/0xae0 mm/slub.c:3521 __do_kmalloc_node mm/slab_common.c:1006 __kmalloc+0x11a/0x410 mm/slab_common.c:1020 kmalloc ./include/linux/slab.h:604 xlog_kvmalloc fs/xfs/xfs_log_priv.h:704 xlog_cil_alloc_shadow_bufs fs/xfs/xfs_log_cil.c:343 xlog_cil_commit+0x487/0x4dc0 fs/xfs/xfs_log_cil.c:1574 __xfs_trans_commit+0x8df/0x1930 fs/xfs/xfs_trans.c:1017 xfs_trans_commit+0x30/0x40 fs/xfs/xfs_trans.c:1061 xfs_create+0x15af/0x2150 fs/xfs/xfs_inode.c:1076 xfs_generic_create+0x4cd/0x1550 fs/xfs/xfs_iops.c:199 xfs_vn_create+0x4a/0x60 fs/xfs/xfs_iops.c:275 lookup_open fs/namei.c:3477 open_last_lookups fs/namei.c:3546 path_openat+0x29ac/0x6180 fs/namei.c:3776 do_filp_open+0x24d/0x680 fs/namei.c:3809 do_sys_openat2+0x1bc/0x330 fs/open.c:1440 do_sys_open fs/open.c:1455 __do_sys_openat fs/open.c:1471 __se_sys_openat fs/open.c:1466 __x64_sys_openat+0x253/0x330 fs/open.c:1466 do_syscall_x64 arch/x86/entry/common.c:51 do_syscall_64+0x4f/0x140 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b arch/x86/entry/entry_64.S:120 Bytes 112-115 of 188 are uninitialized Memory access of size 188 starts at ffff88802fc262bc This is caused by the struct xfs_log_dinode not having the di_crc field initialised. Log recovery never uses this field (it is only present these days for on-disk format compatibility reasons) and so it's value is never checked so nothing in XFS has caught this. Further, none of the uninitialised memory access warning tools have caught this (despite catching other uninit memory accesses in the struct xfs_log_dinode back in 2017!) until recently. Alexander annotated the XFS code to get the dump of the actual bytes that were detected as uninitialised, and from that report it took me about 30s to realise what the issue was. The issue was introduced back in 2016 and every inode that is logged fails to initialise this field. This is no actual bad behaviour caused by this issue - I find it hard to even classify it as a bug... Reported-and-tested-by: Alexander Potapenko <glider@google.com> Fixes: f8d55aa0523a ("xfs: introduce inode log format object") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-11-09Merge tag 'xfs-6.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+2
Pull xfs updates from Chandan Babu: - Realtime device subsystem: - Cleanup usage of xfs_rtblock_t and xfs_fsblock_t data types - Replace open coded conversions between rt blocks and rt extents with calls to static inline helpers - Replace open coded realtime geometry compuation and macros with helper functions - CPU usage optimizations for realtime allocator - Misc bug fixes associated with Realtime device - Allow read operations to execute while an FICLONE ioctl is being serviced - Misc bug fixes: - Alert user when xfs_droplink() encounters an inode with a link count of zero - Handle the case where the allocator could return zero extents when servicing an fallocate request * tag 'xfs-6.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (40 commits) xfs: allow read IO and FICLONE to run concurrently xfs: handle nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space xfs: introduce protection for drop nlink xfs: don't look for end of extent further than necessary in xfs_rtallocate_extent_near() xfs: don't try redundant allocations in xfs_rtallocate_extent_near() xfs: limit maxlen based on available space in xfs_rtallocate_extent_near() xfs: return maximum free size from xfs_rtany_summary() xfs: invert the realtime summary cache xfs: simplify rt bitmap/summary block accessor functions xfs: simplify xfs_rtbuf_get calling conventions xfs: cache last bitmap block in realtime allocator xfs: use accessor functions for summary info words xfs: consolidate realtime allocation arguments xfs: create helpers for rtsummary block/wordcount computations xfs: use accessor functions for bitmap words xfs: create helpers for rtbitmap block/wordcount computations xfs: create a helper to handle logging parts of rt bitmap/summary blocks xfs: convert rt summary macros to helpers xfs: convert open-coded xfs_rtword_t pointer accesses to helper xfs: remove XFS_BLOCKWSIZE and XFS_BLOCKWMASK macros ...
2023-10-18xfs: convert to new timestamp accessorsJeff Layton1-2/+2
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-75-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18xfs: create a helper to compute leftovers of realtime extentsDarrick J. Wong1-1/+2
Create a helper to compute the misalignment between a file extent (xfs_extlen_t) and a realtime extent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-07-24xfs: convert to ctime accessor functionsJeff Layton1-1/+1
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-80-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-04xfs: fix AGF vs inode cluster buffer deadlockDave Chinner1-0/+149
Lock order in XFS is AGI -> AGF, hence for operations involving inode unlinked list operations we always lock the AGI first. Inode unlinked list operations operate on the inode cluster buffer, so the lock order there is AGI -> inode cluster buffer. For O_TMPFILE operations, this now means the lock order set down in xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the unlinked ops are done before the directory modifications that may allocate space and lock the AGF. Unfortunately, we also now lock the inode cluster buffer when logging an inode so that we can attach the inode to the cluster buffer and pin it in memory. This creates a lock order of AGF -> inode cluster buffer in directory operations as we have to log the inode after we've allocated new space for it. This creates a lock inversion between the AGF and the inode cluster buffer. Because the inode cluster buffer is shared across multiple inodes, the inversion is not specific to individual inodes but can occur when inodes in the same cluster buffer are accessed in different orders. To fix this we need move all the inode log item cluster buffer interactions to the end of the current transaction. Unfortunately, xfs_trans_log_inode() calls are littered throughout the transactions with no thought to ordering against other items or locking. This makes it difficult to do anything that involves changing the call sites of xfs_trans_log_inode() to change locking orders. However, we do now have a mechanism that allows is to postpone dirty item processing to just before we commit the transaction: the ->iop_precommit method. This will be called after all the modifications are done and high level objects like AGI and AGF buffers have been locked and modified, thereby providing a mechanism that guarantees we don't lock the inode cluster buffer before those high level objects are locked. This change is largely moving the guts of xfs_trans_log_inode() to xfs_inode_item_precommit() and providing an extra flag context in the inode log item to track the dirty state of the inode in the current transaction. This also means we do a lot less repeated work in xfs_trans_log_inode() by only doing it once per transaction when all the work is done. Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-09-18xfs: remove the redundant word in commentZeng Heng1-1/+1
Just remove the redundant word "being" in comment. Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-07-12xfs: replace inode fork size macros with functionsDarrick J. Wong1-2/+2
Replace the shouty macros here with typechecked helper functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-12xfs: replace XFS_IFORK_Q with a proper predicate functionDarrick J. Wong1-2/+2
Replace this shouty macro with a real C function that has a more descriptive name. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-10xfs: make inode attribute forks a permanent part of struct xfs_inodeDarrick J. Wong1-25/+25
Syzkaller reported a UAF bug a while back: ================================================================== BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958 CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted 5.15.0-0.30.3-20220406_1406 #3 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106 print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459 xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159 xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36 __vfs_getxattr+0xdf/0x13d fs/xattr.c:399 cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300 security_inode_need_killpriv+0x4c/0x97 security/security.c:1408 dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912 dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908 do_truncate+0xc3/0x1e0 fs/open.c:56 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 RIP: 0033:0x7f7ef4bb753d Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0 RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0 </TASK> Allocated by task 2953: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467 kasan_slab_alloc include/linux/kasan.h:254 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3213 [inline] slab_alloc mm/slub.c:3221 [inline] kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226 kmem_cache_zalloc include/linux/slab.h:711 [inline] xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287 xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098 xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_setxattr+0x11b/0x177 fs/xattr.c:180 __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214 __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275 vfs_setxattr+0x154/0x33d fs/xattr.c:301 setxattr+0x216/0x29f fs/xattr.c:575 __do_sys_fsetxattr fs/xattr.c:632 [inline] __se_sys_fsetxattr fs/xattr.c:621 [inline] __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 Freed by task 2949: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track+0x1c/0x21 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:230 [inline] slab_free_hook mm/slub.c:1700 [inline] slab_free_freelist_hook mm/slub.c:1726 [inline] slab_free mm/slub.c:3492 [inline] kmem_cache_free+0xdc/0x3ce mm/slub.c:3508 xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773 xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822 xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413 xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684 xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_removexattr+0x106/0x16a fs/xattr.c:468 cap_inode_killpriv+0x24/0x47 security/commoncap.c:324 security_inode_killpriv+0x54/0xa1 security/security.c:1414 setattr_prepare+0x1a6/0x897 fs/attr.c:146 xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682 xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065 xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093 notify_change+0xae5/0x10a1 fs/attr.c:410 do_truncate+0x134/0x1e0 fs/open.c:64 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 The buggy address belongs to the object at ffff88802cec9188 which belongs to the cache xfs_ifork of size 40 The buggy address is located 20 bytes inside of 40-byte region [ffff88802cec9188, ffff88802cec91b0) The buggy address belongs to the page: page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2cec9 flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80 raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb ^ ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb ================================================================== The root cause of this bug is the unlocked access to xfs_inode.i_afp from the getxattr code paths while trying to determine which ILOCK mode to use to stabilize the xattr data. Unfortunately, the VFS does not acquire i_rwsem when vfs_getxattr (or listxattr) call into the filesystem, which means that getxattr can race with a removexattr that's tearing down the attr fork and crash: xfs_attr_set: xfs_attr_get: xfs_attr_fork_remove: xfs_ilock_attr_map_shared: xfs_idestroy_fork(ip->i_afp); kmem_cache_free(xfs_ifork_cache, ip->i_afp); if (ip->i_afp && ip->i_afp = NULL; xfs_need_iread_extents(ip->i_afp)) <KABOOM> ip->i_forkoff = 0; Regrettably, the VFS is much more lax about i_rwsem and getxattr than is immediately obvious -- not only does it not guarantee that we hold i_rwsem, it actually doesn't guarantee that we *don't* hold it either. The getxattr system call won't acquire the lock before calling XFS, but the file capabilities code calls getxattr with and without i_rwsem held to determine if the "security.capabilities" xattr is set on the file. Fixing the VFS locking requires a treewide investigation into every code path that could touch an xattr and what i_rwsem state it expects or sets up. That could take years or even prove impossible; fortunately, we can fix this UAF problem inside XFS. An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to ensure that i_forkoff is always zeroed before i_afp is set to null and changed the read paths to use smp_rmb before accessing i_forkoff and i_afp, which avoided these UAF problems. However, the patch author was too busy dealing with other problems in the meantime, and by the time he came back to this issue, the situation had changed a bit. On a modern system with selinux, each inode will always have at least one xattr for the selinux label, so it doesn't make much sense to keep incurring the extra pointer dereference. Furthermore, Allison's upcoming parent pointer patchset will also cause nearly every inode in the filesystem to have extended attributes. Therefore, make the inode attribute fork structure part of struct xfs_inode, at a cost of 40 more bytes. This patch adds a clunky if_present field where necessary to maintain the existing logic of xattr fork null pointer testing in the existing codebase. The next patch switches the logic over to XFS_IFORK_Q and it all goes away. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-05-04xfs: hide log iovec alignment constraintsDave Chinner1-18/+7
Callers currently have to round out the size of buffers to match the aligment constraints of log iovecs and xlog_write(). They should not need to know this detail, so introduce a new function to calculate the iovec length (for use in ->iop_size implementations). Also modify xlog_finish_iovec() to round up the length to the correct alignment so the callers don't need to do this, either. Convert the only user - inode forks - of this alignment rounding to use the new interface. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-11xfs: Introduce per-inode 64-bit extent countersChandan Babu R1-4/+19
This commit introduces new fields in the on-disk inode format to support 64-bit data fork extent counters and 32-bit attribute fork extent counters. The new fields will be used only when an inode has XFS_DIFLAG2_NREXT64 flag set. Otherwise we continue to use the regular 32-bit data fork extent counters and 16-bit attribute fork extent counters. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Suggested-by: Dave Chinner <dchinner@redhat.com>
2022-03-30xfs: aborting inodes on shutdown may need buffer lockDave Chinner1-29/+133
Most buffer io list operations are run with the bp->b_lock held, but xfs_iflush_abort() can be called without the buffer lock being held resulting in inodes being removed from the buffer list while other list operations are occurring. This causes problems with corrupted bp->b_io_list inode lists during filesystem shutdown, leading to traversals that never end, double removals from the AIL, etc. Fix this by passing the buffer to xfs_iflush_abort() if we have it locked. If the inode is attached to the buffer, we're going to have to remove it from the buffer list and we'd have to get the buffer off the inode log item to do that anyway. If we don't have a buffer passed in (e.g. from xfs_reclaim_inode()) then we can determine if the inode has a log item and if it is attached to a buffer before we do anything else. If it does have an attached buffer, we can lock it safely (because the inode has a reference to it) and then perform the inode abort. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20xfs: xfs_is_shutdown vs xlog_is_shutdown cage fightDave Chinner1-0/+12
I've been chasing a recent resurgence in generic/388 recovery failure and/or corruption events. The events have largely been uninitialised inode chunks being tripped over in log recovery such as: XFS (pmem1): User initiated shutdown received. pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096 XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem. XFS (pmem1): Please unmount the filesystem and rectify the problem(s) XFS (pmem1): Unmounting Filesystem XFS (pmem1): Mounting V5 Filesystem XFS (pmem1): Starting recovery (logdev: internal) XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818) XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify XFS (pmem1): Unmount and run xfs_repair XFS (pmem1): First 128 bytes of corrupted metadata buffer: 00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ 00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117 XFS (pmem1): log mount/recovery failed: error -117 XFS (pmem1): log mount failed There have been isolated random other issues, too - xfs_repair fails because it finds some corruption in symlink blocks, rmap inconsistencies, etc - but they are nowhere near as common as the uninitialised inode chunk failure. The problem has clearly happened at runtime before recovery has run; I can see the ICREATE log item in the log shortly before the actively recovered range of the log. This means the ICREATE was definitely created and written to the log, but for some reason the tail of the log has been moved past the ordered buffer log item that tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the log moving past the ICREATE log item before the inode chunk buffer is written to disk. Tracing the fsstress processes that are running when the filesystem shut down immediately pin-pointed the problem: user shutdown marks xfs_mount as shutdown godown-213341 [008] 6398.022871: console: [ 6397.915392] XFS (pmem1): User initiated shutdown received. ..... aild tries to push ordered inode cluster buffer xfsaild/pmem1-213314 [001] 6398.022974: xfs_buf_trylock: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e xfsaild/pmem1-213314 [001] 6398.022976: xfs_ilock_nowait: dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae xfs_iflush_cluster() checks xfs_is_shutdown(), returns true, calls xfs_iflush_abort() to kill writeback of the inode. Inode is removed from AIL, drops cluster buffer reference. xfsaild/pmem1-213314 [001] 6398.022977: xfs_ail_delete: dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL xfsaild/pmem1-213314 [001] 6398.022978: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7 ..... All inodes on cluster buffer are aborted, then the cluster buffer itself is aborted and removed from the AIL *without writeback*: xfsaild/pmem1-213314 [001] 6398.023011: xfs_buf_error_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33 xfsaild/pmem1-213314 [001] 6398.023012: xfs_ail_delete: dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL The inode buffer was at 7/20344 when it was removed from the AIL. xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_item_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31 xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39 ..... Userspace is still running, doing stuff. an fsstress process runs syncfs() or sync() and we end up in sync_fs_one_sb() which issues a log force. This pushes on the CIL: fsstress-213322 [001] 6398.024430: xfs_fs_sync_fs: dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26 fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82 fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x5f caller xfs_log_force+0x7c <...>-194402 [001] 6398.024467: kmem_alloc: size 176 flags 0x14 caller xlog_cil_push_work+0x9f And the CIL fills up iclogs with pending changes. This picks up the current tail from the AIL: <...>-194402 [001] 6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags caller xlog_write+0x149 <...>-194402 [001] 6398.024498: xlog_iclog_switch: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags caller xlog_state_get_iclog_space+0x37e <...>-194402 [001] 6398.024521: xlog_iclog_release: dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags caller xlog_write+0x5f9 <...>-194402 [001] 6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448 And it moves the tail of the log to 7/21000 from 7/20344. This *moves the tail of the log beyond the ICREATE transaction* that was at 7/20344 and pinned by the inode cluster buffer that was cancelled above. .... godown-213341 [008] 6398.027005: xfs_force_shutdown: dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500 godown-213341 [008] 6398.027022: console: [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096 godown-213341 [008] 6398.030551: console: [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/ And finally the log itself is now shutdown, stopping all further writes to the log. But this is too late to prevent the corruption that moving the tail of the log forwards after we start cancelling writeback causes. The fundamental problem here is that we are using the wrong shutdown checks for log items. We've long conflated mount shutdown with log shutdown state, and I started separating that recently with the atomic shutdown state changes in commit b36d4651e165 ("xfs: make forced shutdown processing atomic"). The changes in that commit series are directly responsible for being able to diagnose this issue because it clearly separated mount shutdown from log shutdown. Essentially, once we start cancelling writeback of log items and removing them from the AIL because the filesystem is shut down, we *cannot* update the journal because we may have cancelled the items that pin the tail of the log. That moves the tail of the log forwards without having written the metadata back, hence we have corrupt in memory state and writing to the journal propagates that to the on-disk state. What commit b36d4651e165 makes clear is that log item state needs to change relative to log shutdown, not mount shutdown. IOWs, anything that aborts metadata writeback needs to check log shutdown state because log items directly affect log consistency. Having them check mount shutdown state introduces the above race condition where we cancel metadata writeback before the log shuts down. To fix this, this patch works through all log items and converts shutdown checks to use xlog_is_shutdown() rather than xfs_is_shutdown(), so that we don't start aborting metadata writeback before we shut off journal writes. AFAICT, this race condition is a zero day IO error handling bug in XFS that dates back to the introduction of XLOG_IO_ERROR, XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-23xfs: rename _zone variables to _cacheDarrick J. Wong1-3/+3
Now that we've gotten rid of the kmem_zone_t typedef, rename the variables to _cache since that's what they are. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-23xfs: remove kmem_zone typedefDarrick J. Wong1-1/+1
Remove these typedefs by referencing kmem_cache directly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-08-19xfs: replace xfs_sb_version checks with feature flag checksDave Chinner1-1/+1
Convert the xfs_sb_version_hasfoo() to checks against mp->m_features. Checks of the superblock itself during disk operations (e.g. in the read/write verifiers and the to/from disk formatters) are not converted - they operate purely on the superblock state. Everything else should use the mount features. Large parts of this conversion were done with sed with commands like this: for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f done With manual cleanups for things like "xfs_has_extflgbit" and other little inconsistencies in naming. The result is ia lot less typing to check features and an XFS binary size reduced by a bit over 3kB: $ size -t fs/xfs/built-in.a text data bss dec hex filenam before 1130866 311352 484 1442702 16038e (TOTALS) after 1127727 311352 484 1439563 15f74b (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-21xfs: xfs_log_force_lsn isn't passed a LSNDave Chinner1-2/+2
In doing an investigation into AIL push stalls, I was looking at the log force code to see if an async CIL push could be done instead. This lead me to xfs_log_force_lsn() and looking at how it works. xfs_log_force_lsn() is only called from inode synchronisation contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn value as the LSN to sync the log to. This gets passed to xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the journal, and then used by xfs_log_force_lsn() to flush the iclogs to the journal. The problem is that ip->i_itemp->ili_last_lsn does not store a log sequence number. What it stores is passed to it from the ->iop_committing method, which is called by xfs_log_commit_cil(). The value this passes to the iop_committing method is the CIL context sequence number that the item was committed to. As it turns out, xlog_cil_force_lsn() converts the sequence to an actual commit LSN for the related context and returns that to xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" variable that contained a sequence with an actual LSN and then uses that to sync the iclogs. This caused me some confusion for a while, even though I originally wrote all this code a decade ago. ->iop_committing is only used by a couple of log item types, and only inode items use the sequence number it is passed. Let's clean up the API, CIL structures and inode log item to call it a sequence number, and make it clear that the high level code is using CIL sequence numbers and not on-disk LSNs for integrity synchronisation purposes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-21xfs: Fix CIL throttle hang when CIL space used going backwardsDave Chinner1-0/+14
A hang with tasks stuck on the CIL hard throttle was reported and largely diagnosed by Donald Buczek, who discovered that it was a result of the CIL context space usage decrementing in committed transactions once the hard throttle limit had been hit and processes were already blocked. This resulted in the CIL push not waking up those waiters because the CIL context was no longer over the hard throttle limit. The surprising aspect of this was the CIL space usage going backwards regularly enough to trigger this situation. Assumptions had been made in design that the relogging process would only increase the size of the objects in the CIL, and so that space would only increase. This change and commit message fixes the issue and documents the result of an audit of the triggers that can cause the CIL space to go backwards, how large the backwards steps tend to be, the frequency in which they occur, and what the impact on the CIL accounting code is. Even though the CIL ctx->space_used can go backwards, it will only do so if the log item is already logged to the CIL and contains a space reservation for it's entire logged state. This is tracked by the shadow buffer state on the log item. If the item is not previously logged in the CIL it has no shadow buffer nor log vector, and hence the entire size of the logged item copied to the log vector is accounted to the CIL space usage. i.e. it will always go up in this case. If the item has a log vector (i.e. already in the CIL) and the size decreases, then the existing log vector will be overwritten and the space usage will go down. This is the only condition where the space usage reduces, and it can only occur when an item is already tracked in the CIL. Hence we are safe from CIL space usage underruns as a result of log items decreasing in size when they are relogged. Typically this reduction in CIL usage occurs from metadata blocks being free, such as when a btree block merge occurs or a directory enter/xattr entry is removed and the da-tree is reduced in size. This generally results in a reduction in size of around a single block in the CIL, but also tends to increase the number of log vectors because the parent and sibling nodes in the tree needs to be updated when a btree block is removed. If a multi-level merge occurs, then we see reduction in size of 2+ blocks, but again the log vector count goes up. The other vector is inode fork size changes, which only log the current size of the fork and ignore the previously logged size when the fork is relogged. Hence if we are removing items from the inode fork (dir/xattr removal in shortform, extent record removal in extent form, etc) the relogged size of the inode for can decrease. No other log items can decrease in size either because they are a fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging an intent actually creates a new intent log item and doesn't relog the old item at all.) Hence the only two vectors for CIL context size reduction are relogging inode forks and marking buffers active in the CIL as stale. Long story short: the majority of the code does the right thing and handles the reduction in log item size correctly, and only the CIL hard throttle implementation is problematic and needs fixing. This patch makes that fix, as well as adds comments in the log item code that result in items shrinking in size when they are relogged as a clear reminder that this can and does happen frequently. The throttle fix is based upon the change Donald proposed, though it goes further to ensure that once the throttle is activated, it captures all tasks until the CIL push issues a wakeup, regardless of whether the CIL space used has gone back under the throttle threshold. This ensures that we prevent tasks reducing the CIL slightly under the throttle threshold and then making more changes that push it well over the throttle limit. This is acheived by checking if the throttle wait queue is already active as a condition of throttling. Hence once we start throttling, we continue to apply the throttle until the CIL context push wakes everything on the wait queue. We can use waitqueue_active() for the waitqueue manipulations and checks as they are all done under the ctx->xc_push_lock. Hence the waitqueue has external serialisation and we can safely peek inside the wait queue without holding the internal waitqueue locks. Many thanks to Donald for his diagnostic and analysis work to isolate the cause of this hang. Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-23xfs: rename struct xfs_legacy_ictimestampChristoph Hellwig1-2/+2
Rename struct xfs_legacy_ictimestamp to struct xfs_log_legacy_timestamp as it is a type used for logging timestamps with no relationship to the in-core inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-23xfs: rename xfs_ictimestamp_tChristoph Hellwig1-2/+2
Rename xfs_ictimestamp_t to xfs_log_timestamp_t as it is a type used for logging timestamps with no relationship to the in-core inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_crtime field to struct xfs_inodeChristoph Hellwig1-2/+1
Move the crtime field from struct xfs_icdinode into stuct xfs_inode and remove the now entirely unused struct xfs_icdinode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_flags2 field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the flags2 field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_flags field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the flags field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_forkoff field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the forkoff field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_flushiter field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the flushiter field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_cowextsize field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the cowextsize field into the containing xfs_inode structure. Also switch to use the xfs_extlen_t instead of a uint32_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_extsize field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the extsize field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_nblocks field to struct xfs_inodeChristoph Hellwig1-1/+1
In preparation of removing the historic icinode struct, move the nblocks field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_size field to struct xfs_inodeChristoph Hellwig1-2/+2
In preparation of removing the historic icinode struct, move the on-disk size field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: move the di_projid field to struct xfs_inodeChristoph Hellwig1-2/+2
In preparation of removing the historic icinode struct, move the projid field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-08xfs: remove the di_dmevmask and di_dmstate fields from struct xfs_icdinodeChristoph Hellwig1-2/+29
The legacy DMAPI fields were never set by upstream Linux XFS, and have no way to be read using the kernel APIs. So instead of bloating the in-core inode for them just copy them from the on-disk inode into the log when logging the inode. The only caveat is that we need to make sure to zero the fields for newly read or deleted inodes, which is solved using a new flag in the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2020-09-16xfs: widen ondisk inode timestamps to deal with y2038+Darrick J. Wong1-4/+8
Redesign the ondisk inode timestamps to be a simple unsigned 64-bit counter of nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix time epoch). This enables us to handle dates up to 2486, which solves the y2038 problem. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-16xfs: redefine xfs_ictimestamp_tDarrick J. Wong1-8/+22
Redefine xfs_ictimestamp_t as a uint64_t typedef in preparation for the bigtime functionality. Preserve the legacy structure format so that we can let the compiler take care of the masking and shifting. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-16xfs: move the buffer retry logic to xfs_buf.cChristoph Hellwig1-1/+11
Move the buffer retry state machine logic to xfs_buf.c and call it once from xfs_ioend instead of duplicating it three times for the three kinds of buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-07xfs: xfs_iflock is no longer a completionDave Chinner1-8/+7
With the recent rework of the inode cluster flushing, we no longer ever wait on the the inode flush "lock". It was never a lock in the first place, just a completion to allow callers to wait for inode IO to complete. We now never wait for flush completion as all inode flushing is non-blocking. Hence we can get rid of all the iflock infrastructure and instead just set and check a state flag. Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the xfs_iflock_nowait() test-and-set operations on that flag, and replace all the xfs_ifunlock() calls to clear operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-08-05xfs: delete duplicated words + other fixesRandy Dunlap1-2/+2
Delete repeated words in fs/xfs/. {we, that, the, a, to, fork} Change "it it" to "it is" in one location. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-29xfs: Remove kmem_zone_zalloc() usageCarlos Maiolino1-1/+2
Use kmem_cache_zalloc() directly. With the exception of xlog_ticket_alloc() which will be dealt on the next patch for readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-07-07xfs: factor xfs_iflush_doneDave Chinner1-75/+82
xfs_iflush_done() does 3 distinct operations to the inodes attached to the buffer. Separate these operations out into functions so that it is easier to modify these operations independently in future. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: rework xfs_iflush_cluster() dirty inode iterationDave Chinner1-1/+7
Now that we have all the dirty inodes attached to the cluster buffer, we don't actually have to do radix tree lookups to find them. Sure, the radix tree is efficient, but walking a linked list of just the dirty inodes attached to the buffer is much better. We are also no longer dependent on having a locked inode passed into the function to determine where to start the lookup. This means we can drop it from the function call and treat all inodes the same. We also make xfs_iflush_cluster skip inodes marked with XFS_IRECLAIM. This we avoid races with inodes that reclaim is actively referencing or are being re-initialised by inode lookup. If they are actually dirty, they'll get written by a future cluster flush.... We also add a shutdown check after obtaining the flush lock so that we catch inodes that are dirty in memory and may have inconsistent state due to the shutdown in progress. We abort these inodes directly and so they remove themselves directly from the buffer list and the AIL rather than having to wait for the buffer to be failed and callbacks run to be processed correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: xfs_iflush() is no longer necessaryDave Chinner1-33/+18
Now we have a cached buffer on inode log items, we don't need to do buffer lookups when flushing inodes anymore - all we need to do is lock the buffer and we are ready to go. This largely gets rid of the need for xfs_iflush(), which is essentially just a mechanism to look up the buffer and flush the inode to it. Instead, we can just call xfs_iflush_cluster() with a few modifications to ensure it also flushes the inode we already hold locked. This allows the AIL inode item pushing to be almost entirely non-blocking in XFS - we won't block unless memory allocation for the cluster inode lookup blocks or the block device queues are full. Writeback during inode reclaim becomes a little more complex because we now have to lock the buffer ourselves, but otherwise this change is largely a functional no-op that removes a whole lot of code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: attach inodes to the cluster buffer when dirtiedDave Chinner1-2/+14
Rather than attach inodes to the cluster buffer just when we are doing IO, attach the inodes to the cluster buffer when they are dirtied. The means the buffer always carries a list of dirty inodes that reference it, and we can use that list to make more fundamental changes to inode writeback that aren't otherwise possible. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: pin inode backing buffer to the inode log itemDave Chinner1-12/+49
When we dirty an inode, we are going to have to write it disk at some point in the near future. This requires the inode cluster backing buffer to be present in memory. Unfortunately, under severe memory pressure we can reclaim the inode backing buffer while the inode is dirty in memory, resulting in stalling the AIL pushing because it has to do a read-modify-write cycle on the cluster buffer. When we have no memory available, the read of the cluster buffer blocks the AIL pushing process, and this causes all sorts of issues for memory reclaim as it requires inode writeback to make forwards progress. Allocating a cluster buffer causes more memory pressure, and results in more cluster buffers to be reclaimed, resulting in more RMW cycles to be done in the AIL context and everything then backs up on AIL progress. Only the synchronous inode cluster writeback in the the inode reclaim code provides some level of forwards progress guarantees that prevent OOM-killer rampages in this situation. Fix this by pinning the inode backing buffer to the inode log item when the inode is first dirtied (i.e. in xfs_trans_log_inode()). This may mean the first modification of an inode that has been held in cache for a long time may block on a cluster buffer read, but we can do that in transaction context and block safely until the buffer has been allocated and read. Once we have the cluster buffer, the inode log item takes a reference to it, pinning it in memory, and attaches it to the log item for future reference. This means we can always grab the cluster buffer from the inode log item when we need it. When the inode is finally cleaned and removed from the AIL, we can drop the reference the inode log item holds on the cluster buffer. Once all inodes on the cluster buffer are clean, the cluster buffer will be unpinned and it will be available for memory reclaim to reclaim again. This avoids the issues with needing to do RMW cycles in the AIL pushing context, and hence allows complete non-blocking inode flushing to be performed by the AIL pushing context. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()Dave Chinner1-2/+1
xfs_ail_delete_one() is called directly from dquot and inode IO completion, as well as from the generic xfs_trans_ail_delete() function. Inodes are about to have their own failure handling, and dquots will in future, too. Pull the clearing of the LI_FAILED flag up into the callers so we can customise the code appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: unwind log item error flaggingDave Chinner1-18/+0
When an buffer IO error occurs, we want to mark all the log items attached to the buffer as failed. Open code the error handling loop so that we can modify the flagging for the different types of objects directly and independently of each other. This also allows us to remove the ->iop_error method from the log item operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>