summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-02-26fs/coredump: use kmap_local_page()Ira Weiny1-2/+2
In dump_user_range() there is no reason for the mapping to be global. Use kmap_local_page() rather than kmap. Link: https://lkml.kernel.org/r/20210203223328.558945-1-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26proc: use kvzalloc for our kernel bufferJosef Bacik1-2/+2
Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blindly kzalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, so use kvzalloc for these buffers. [willy@infradead.org: changelog tweaks] Link: https://lkml.kernel.org/r/6345270a2c1160b89dd5e6715461f388176899d1.1612972413.git.josef@toxicpanda.com Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> CC: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26proc/wchan: use printk format instead of lookup_symbol_name()Helge Deller1-11/+8
To resolve the symbol fuction name for wchan, use the printk format specifier %ps instead of manually looking up the symbol function name via lookup_symbol_name(). Link: https://lkml.kernel.org/r/20201217165413.GA1959@ls3530.fritz.box Signed-off-by: Helge Deller <deller@gmx.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26iomap: use mapping_seek_hole_dataMatthew Wilcox (Oracle)1-114/+11
Enhance mapping_seek_hole_data() to handle partially uptodate pages and convert the iomap seek code to call it. Link: https://lkml.kernel.org/r/20201112212641.27837-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds12-190/+261
Pull NFS Client Updates from Anna Schumaker: "New Features: - Support for eager writes, and the write=eager and write=wait mount options - Other Bugfixes and Cleanups: - Fix typos in some comments - Fix up fall-through warnings for Clang - Cleanups to the NFS readpage codepath - Remove FMR support in rpcrdma_convert_iovs() - Various other cleanups to xprtrdma - Fix xprtrdma pad optimization for servers that don't support RFC 8797 - Improvements to rpcrdma tracepoints - Fix up nfs4_bitmask_adjust() - Optimize sparse writes past the end of files" * tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits) NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo NFS: Set the stable writes flag when initialising the super block NFS: Add mount options supporting eager writes NFS: Add support for eager writes NFS: 'flags' field should be unsigned in struct nfs_server NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache NFS: Always clear an invalid mapping when attempting a buffered write NFS: Optimise sparse writes past the end of file NFS: Fix documenting comment for nfs_revalidate_file_size() NFSv4: Fixes for nfs4_bitmask_adjust() xprtrdma: Clean up rpcrdma_prepare_readch() rpcrdma: Capture bytes received in Receive completion tracepoints xprtrdma: Pad optimization, revisited rpcrdma: Fix comments about reverse-direction operation xprtrdma: Refactor invocations of offset_in_page() xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map() xprtrdma: Remove FMR support in rpcrdma_convert_iovs() NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async() NFS: Call readpage_async_filler() from nfs_readpage_async() NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc ...
2021-02-26btrfs: use copy_highpage() instead of 2 kmaps()Ira Weiny1-9/+1
There are many places where kmap/memove/kunmap patterns occur. This pattern exists in the core common function copy_highpage(). Use copy_highpage to avoid open coding the use of kmap and leverages the core functions use of kmap_local_page(). Development of this patch was aided by the following coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/copypage/kunmap pattern and replace with copy_highpage calls // // NOTE: The expressions in the copy page version of this kmap pattern are // overly complex and so these all need individual attention. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // Then a copy_page where we have 2 pages involved. // @ copy_page_rule @ expression page, page2, To, From, Size; identifier ptr, ptr2; type VP, VP2; @@ /* kmap */ ( -VP ptr = kmap(page); ... -VP2 ptr2 = kmap(page2); | -VP ptr = kmap_atomic(page); ... -VP2 ptr2 = kmap_atomic(page2); | -ptr = kmap(page); ... -ptr2 = kmap(page2); | -ptr = kmap_atomic(page); ... -ptr2 = kmap_atomic(page2); ) // 1 or more copy versions of the entire page <+... ( -copy_page(To, From); +copy_highpage(To, From); | -memmove(To, From, Size); +memmoveExtra(To, From, Size); ) ...+> /* kunmap */ ( -kunmap(page2); ... -kunmap(page); | -kunmap(page); ... -kunmap(page2); | -kmap_atomic(ptr2); ... -kmap_atomic(ptr); ) // Remove any pointers left unused @ depends on copy_page_rule @ identifier copy_page_rule.ptr; identifier copy_page_rule.ptr2; type VP, VP1; type VP2, VP21; @@ -VP ptr; ... when != ptr; ? VP1 ptr; -VP2 ptr2; ... when != ptr2; ? VP21 ptr2; // </smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-26btrfs: use memcpy_[to|from]_page() and kmap_local_page()Ira Weiny6-23/+11
There are many places where the pattern kmap/memcpy/kunmap occurs. This pattern was lifted to the core common functions memcpy_[to|from]_page(). Use these new functions to reduce the code, eliminate direct uses of kmap, and leverage the new core functions use of kmap_local_page(). Also, there is 1 place where a kmap/memcpy is followed by an optional memset. Here we leave the kmap open coded to avoid remapping the page but use kmap_local_page() directly. Development of this patch was aided by the coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // simple memcpy version // @ memcpy_rule1 @ expression page, T, F, B, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memcpy(ptr + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(ptr, F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, ptr + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, ptr, B); +memcpy_from_page(T, page, 0, B); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule1 @ identifier memcpy_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Some callers kmap without a temp pointer // @ memcpy_rule2 @ expression page, T, Off, F, B; @@ <+... ( -memcpy(kmap(page) + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(kmap(page), F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, kmap(page) + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, kmap(page), B); +memcpy_from_page(T, page, 0, B); ) ...+> -kunmap(page); // No need for the ptr variable removal // // Catch all // @ memcpy_rule3 @ expression page; expression GenTo, GenFrom, GenSize; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memcpy // match a catch all to be evaluated by hand. // -memcpy(GenTo, GenFrom, GenSize); +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize); +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule3 @ identifier memcpy_rule3.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // <smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-26cifs: update internal version numberSteve French1-1/+1
To 2.31 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-26cifs: use discard iterator to discard unneeded network data more efficientlyDavid Howells3-3/+22
The iterator, ITER_DISCARD, that can only be used in READ mode and just discards any data copied to it, was added to allow a network filesystem to discard any unwanted data sent by a server. Convert cifs_discard_from_socket() to use this. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: introduce helper for finding referral server to improve DFS target ↵Paulo Alcantara2-17/+51
resolution Some servers seem to mistakenly report different values for capabilities and share flags, so we can't always rely on those values to decide whether the resolved target can handle any new DFS referrals. Add a new helper is_referral_server() to check if all resolved targets can handle new DFS referrals by directly looking at the GET_DFS_REFERRAL.ReferralHeaderFlags value as specified in MS-DFSC 2.2.4 RESP_GET_DFS_REFERRAL in addition to is_tcon_dfs(). Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: check all path components in resolved dfs targetPaulo Alcantara1-60/+33
Handle the case where a resolved target share is like //server/users/dir, and the user "foo" has no read permission to access the parent folder "users" but has access to the final path component "dir". is_path_remote() already implements that, so call it directly. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: fix DFS failoverPaulo Alcantara1-64/+59
In do_dfs_failover(), the mount_get_conns() function requires the full fs context in order to get new connection to server, so clone the original context and change it accordingly when retrying the DFS targets in the referral. If failover was successful, then update original context with the new UNC, prefix path and ip address. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: fix nodfs mount optionPaulo Alcantara1-4/+4
Skip DFS resolving when mounting with 'nodfs' even if CONFIG_CIFS_DFS_UPCALL is enabled. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.11 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: fix handling of escaped ',' in the password mount argumentRonnie Sahlberg1-13/+30
Passwords can contain ',' which are also used as the separator between mount options. Mount.cifs will escape all ',' characters as the string ",,". Update parsing of the mount options to detect ",," and treat it as a single 'c' character. Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Cc: stable@vger.kernel.org # 5.11 Reported-by: Simon Taylor <simon@simon-taylor.me.uk> Tested-by: Simon Taylor <simon@simon-taylor.me.uk> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25Merge tag 'ext4_for_linus' of ↵Linus Torvalds6-49/+59
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Miscellaneous ext4 cleanups and bug fixes. Pretty boring this cycle..." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: add .kunitconfig fragment to enable ext4-specific tests ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it ext4: reset retry counter when ext4_alloc_file_blocks() makes progress ext4: fix potential htree index checksum corruption ext4: factor out htree rep invariant check ext4: Change list_for_each* to list_for_each_entry* ext4: don't try to processed freed blocks until mballoc is initialized ext4: use DEFINE_MUTEX() for mutex lock
2021-02-25cifs: Add new parameter "acregmax" for distinct file and directory metadata ↵Steve French5-13/+37
timeout The new optional mount parameter "acregmax" allows a different timeout for file metadata ("acdirmax" now allows controlling timeout for directory metadata). Setting "actimeo" still works as before, and changes timeout for both files and directories, but specifying "acregmax" or "acdirmax" allows overriding the default more granularly which can be a big performance benefit on some workloads. "acregmax" is already used by NFS as a mount parameter (albeit with a larger default and thus looser caching). Suggested-by: Tom Talpey <tom@talpey.com> Reviewed-By: Tom Talpey <tom@talpey.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: convert revalidate of directories to using directory metadata cache ↵Steve French1-6/+17
timeout The new optional mount parm, "acdirmax" allows caching the metadata for a directory longer than file metadata, which can be very helpful for performance. Convert cifs_inode_needs_reval to check acdirmax for revalidating directory metadata. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-By: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-25cifs: Add new mount parameter "acdirmax" to allow caching directory metadataSteve French4-2/+16
nfs and cifs on Linux currently have a mount parameter "actimeo" to control metadata (attribute) caching but cifs does not have additional mount parameters to allow distinguishing between caching directory metadata (e.g. needed to revalidate paths) and that for files. Add new mount parameter "acdirmax" to allow caching metadata for directories more loosely than file data. NFS adjusts metadata caching from acdirmin to acdirmax (and another two mount parms for files) but to reduce complexity, it is safer to just introduce the one mount parm to allow caching directories longer. The defaults for acdirmax and actimeo (for cifs.ko) are conservative, 1 second (NFS defaults acdirmax to 60 seconds). For many workloads, setting acdirmax to a higher value is safe and will improve performance. This patch leaves unchanged the default values for caching metadata for files and directories but gives the user more flexibility in adjusting them safely for their workload via the new mount parm. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-By: Tom Talpey <tom@talpey.com>
2021-02-25io-wq: remove now unused IO_WQ_BIT_ERRORJens Axboe1-10/+0
This flag is now dead, remove it. Fixes: 1cbd9c2bcf02 ("io-wq: don't create any IO workers upfront") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io_uring: fix SQPOLL thread handling over execJens Axboe1-1/+34
Just like the changes for io-wq, ensure that we re-fork the SQPOLL thread if the owner execs. Mark the ctx sq thread as sqo_exec if it dies, and the ring as needing a wakeup which will force the task to enter the kernel. When it does, setup the new thread and proceed as usual. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io-wq: improve manager/worker handling over execJens Axboe3-23/+45
exec will cancel any threads, including the ones that io-wq is using. This isn't a problem, in fact we'd prefer it to be that way since it means we know that any async work cancels naturally without having to handle it proactively. But it does mean that we need to setup a new manager, as the manager and workers are gone. Handle this at queue time, and cancel work if we fail. Since the manager can go away without us noticing, ensure that the manager itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to io_wq_put() to reflect that. In the future we can now simplify exec cancelation handling, for now just leave it the same. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io_uring: ensure SQPOLL startup is triggered before error shutdownJens Axboe1-1/+2
syzbot reports the following hang: INFO: task syz-executor.0:12538 can't die for more than 143 seconds. task:syz-executor.0 state:D stack:28352 pid:12538 ppid: 8423 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4324 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5075 schedule+0xcf/0x270 kernel/sched/core.c:5154 schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x168/0x270 kernel/sched/completion.c:138 io_sq_thread_finish+0x96/0x580 fs/io_uring.c:7152 io_sq_offload_create fs/io_uring.c:7929 [inline] io_uring_create fs/io_uring.c:9465 [inline] io_uring_setup+0x1fb2/0x2c20 fs/io_uring.c:9550 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae which is due to exiting after the SQPOLL thread has been created, but hasn't been started yet. Ensure that we always complete the startup side when waiting for it to exit. Reported-by: syzbot+c927c937cba8ef66dd4a@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25io-wq: make buffered file write hashed work map per-ctxJens Axboe3-11/+107
Before the io-wq thread change, we maintained a hash work map and lock per-node per-ring. That wasn't ideal, as we really wanted it to be per ring. But now that we have per-task workers, the hash map ends up being just per-task. That'll work just fine for the normal case of having one task use a ring, but if you share the ring between tasks, then it's considerably worse than it was before. Make the hash map per ctx instead, which provides full per-ctx buffered write serialization on hashed writes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25xfs: use current->journal_info for detecting transaction recursionDave Chinner5-26/+60
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction recursion in XFS is just wrong. Remove it from the iomap code and replace it with XFS specific internal checks using current->journal_info instead. [djwong: This change also realigns the lifetime of NOFS flag changes to match the incore transaction, instead of the inconsistent scheme we have now.] Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25xfs: don't nest transactions when scanning for eofblocksDarrick J. Wong1-3/+10
Brian Foster reported a lockdep warning on xfs/167: ============================================ WARNING: possible recursive locking detected 5.11.0-rc4 #35 Tainted: G W I -------------------------------------------- fsstress/17733 is trying to acquire lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs] but task is already holding lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs] stack backtrace: CPU: 38 PID: 17733 Comm: fsstress Tainted: G W I 5.11.0-rc4 #35 Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018 Call Trace: dump_stack+0x8b/0xb0 __lock_acquire.cold+0x159/0x2ab lock_acquire+0x116/0x370 xfs_trans_alloc+0x1ad/0x310 [xfs] xfs_free_eofblocks+0x104/0x1d0 [xfs] xfs_blockgc_scan_inode+0x24/0x60 [xfs] xfs_inode_walk_ag+0x202/0x4b0 [xfs] xfs_inode_walk+0x66/0xc0 [xfs] xfs_trans_alloc+0x160/0x310 [xfs] xfs_trans_alloc_inode+0x5f/0x160 [xfs] xfs_alloc_file_space+0x105/0x300 [xfs] xfs_file_fallocate+0x270/0x460 [xfs] vfs_fallocate+0x14d/0x3d0 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The cause of this is the new code that spurs a scan to garbage collect speculative preallocations if we fail to reserve enough blocks while allocating a transaction. While the warning itself is a fairly benign lockdep complaint, it does expose a potential livelock if the rwsem behavior ever changes with regards to nesting read locks when someone's waiting for a write lock. Fix this by freeing the transaction and jumping back to xfs_trans_alloc like this patch in the V4 submission[1]. [1] https://lore.kernel.org/linux-xfs/161142798066.2171939.9311024588681972086.stgit@magnolia/ Fixes: a1a7d05a0576 ("xfs: flush speculative space allocations when we run out of space") Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25xfs: don't reuse busy extents on extent trimBrian Foster1-14/+0
Freed extents are marked busy from the point the freeing transaction commits until the associated CIL context is checkpointed to the log. This prevents reuse and overwrite of recently freed blocks before the changes are committed to disk, which can lead to corruption after a crash. The exception to this rule is that metadata allocation is allowed to reuse busy extents because metadata changes are also logged. As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS has allowed modification or complete invalidation of outstanding busy extents for metadata allocations. This implementation assumes that use of the associated extent is imminent, which is not always the case. For example, the trimmed extent might not satisfy the minimum length of the allocation request, or the allocation algorithm might be involved in a search for the optimal result based on locality. generic/019 reproduces a corruption caused by this scenario. First, a metadata block (usually a bmbt or symlink block) is freed from an inode. A subsequent bmbt split on an unrelated inode attempts a near mode allocation request that invalidates the busy block during the search, but does not ultimately allocate it. Due to the busy state invalidation, the block is no longer considered busy to subsequent allocation. A direct I/O write request immediately allocates the block and writes to it. Finally, the filesystem crashes while in a state where the initial metadata block free had not committed to the on-disk log. After recovery, the original metadata block is in its original location as expected, but has been corrupted by the aforementioned dio. This demonstrates that it is fundamentally unsafe to modify busy extent state for extents that are not guaranteed to be allocated. This applies to pretty much all of the code paths that currently trim busy extents for one reason or another. Therefore to address this problem, drop the reuse mechanism from the busy extent trim path. This code already knows how to return partial non-busy ranges of the targeted free extent and higher level code tracks the busy state of the allocation attempt. If a block allocation fails where one or more candidate extents is busy, we force the log and retry the allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25Revert "io_uring: wait potential ->release() on resurrect"Jens Axboe1-18/+8
This reverts commit 88f171ab7798a1ed0b9e39867ee16f307466e870. I ran into a case where the ref resurrect now spins, so revert this change for now until we can further investigate why it's broken. The bug seems to indicate spinning on the lock itself, likely there's some ABBA deadlock involved: [<0>] __percpu_ref_switch_mode+0x45/0x180 [<0>] percpu_ref_resurrect+0x46/0x70 [<0>] io_refs_resurrect+0x25/0xa0 [<0>] __io_uring_register+0x135/0x10c0 [<0>] __x64_sys_io_uring_register+0xc2/0x1a0 [<0>] do_syscall_64+0x42/0x110 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds20-92/+79
Merge misc updates from Andrew Morton: "A few small subsystems and some of MM. 172 patches. Subsystems affected by this patch series: hexagon, scripts, ntfs, ocfs2, vfs, and mm (slab-generic, slab, slub, debug, pagecache, swap, memcg, pagemap, mprotect, mremap, page-reporting, vmalloc, kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction, mempolicy, oom-kill, hugetlbfs, and migration)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (172 commits) mm/migrate: remove unneeded semicolons hugetlbfs: remove unneeded return value of hugetlb_vmtruncate() hugetlbfs: fix some comment typos hugetlbfs: correct some obsolete comments about inode i_mutex hugetlbfs: make hugepage size conversion more readable hugetlbfs: remove meaningless variable avoid_reserve hugetlbfs: correct obsolete function name in hugetlbfs_read_iter() hugetlbfs: use helper macro default_hstate in init_hugetlbfs_fs hugetlbfs: remove useless BUG_ON(!inode) in hugetlbfs_setattr() hugetlbfs: remove special hugetlbfs_set_page_dirty() mm/hugetlb: change hugetlb_reserve_pages() to type bool mm, oom: fix a comment in dump_task() mm/mempolicy: use helper range_in_vma() in queue_pages_test_walk() numa balancing: migrate on fault among multiple bound nodes mm, compaction: make fast_isolate_freepages() stay within zone mm/compaction: fix misbehaviors of fast_find_migrateblock() mm/compaction: correct deferral logic for proactive compaction mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked mm/compaction: remove rcu_read_lock during page compaction z3fold: simplify the zhdr initialization code in init_z3fold_page() ...
2021-02-25hugetlbfs: remove unneeded return value of hugetlb_vmtruncate()Miaohe Lin1-5/+2
The function hugetlb_vmtruncate() is guaranteed to always success since commit 7aa91e104028 ("hugetlb: allow extending ftruncate on hugetlbfs"). So we should remove the unneeded return value which is always 0. Link: https://lkml.kernel.org/r/20210208084637.47789-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: fix some comment typosMiaohe Lin1-4/+4
Fix typos reserv to reserve, minimim to minimum. No functional change intended. Link: https://lkml.kernel.org/r/20210130092351.28072-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: correct some obsolete comments about inode i_mutexMiaohe Lin1-2/+2
Since commit 9902af79c01a ("parallel lookups: actual switch to rwsem"), i_mutex of inode is converted to i_rwsem. So replace i_mutex with i_rwsem to make comments up to date. Link: https://lkml.kernel.org/r/20210127093111.36672-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: make hugepage size conversion more readableMiaohe Lin1-2/+2
The calculation 1U << (h->order + PAGE_SHIFT - 10) is actually equal to (PAGE_SHIFT << (h->order)) >> 10. So we can make it more readable by replace it with huge_page_size(h) >> 10. Link: https://lkml.kernel.org/r/20210122083141.24548-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: remove meaningless variable avoid_reserveMiaohe Lin1-3/+9
The variable avoid_reserve is meaningless because we never changed its value and just passed it to alloc_huge_page(). So remove it to make code more clear that in hugetlbfs_fallocate, we never avoid reserve when alloc hugepage yet. Also add a comment offered by Mike Kravetz to explain this. Link: https://lkml.kernel.org/r/20210120071508.9078-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: correct obsolete function name in hugetlbfs_read_iter()Miaohe Lin1-1/+1
Since commit 36e789144267 ("kill do_generic_mapping_read"), the function do_generic_mapping_read() is renamed to do_generic_file_read(). And then commit 47c27bc46946 ("fs: pass iocb to do_generic_file_read") renamed it to generic_file_buffered_read(). So replace do_generic_mapping_read() with generic_file_buffered_read() to keep comment uptodate. Link: https://lkml.kernel.org/r/20210118063210.47118-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: use helper macro default_hstate in init_hugetlbfs_fsMiaohe Lin1-1/+1
Since commit e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes"), we can use macro default_hstate to get the struct hstate which we use by default. But init_hugetlbfs_fs() forgot to use it. Link: https://lkml.kernel.org/r/20210116091827.20982-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: remove useless BUG_ON(!inode) in hugetlbfs_setattr()Miaohe Lin1-2/+0
When we reach here with inode = NULL, we should have crashed as inode has already been dereferenced via hstate_inode. So this BUG_ON(!inode) does not take effect and should be removed. Link: https://lkml.kernel.org/r/20210118110700.52506-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlbfs: remove special hugetlbfs_set_page_dirty()Mike Kravetz1-12/+1
Matthew Wilcox noticed that hugetlbfs_set_page_dirty always returns 0. Instead, it should return 1 or 0 depending on the previous state of the dirty bit. In addition, the call to compound_head is redundant as it is also performed in calling routine set_page_dirty. Replace the hugetlbfs specific routine hugetlbfs_set_page_dirty with __set_page_dirty_no_writeback as it addresses both of these issues. Link: https://lkml.kernel.org/r/20201221192542.15732-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm/hugetlb: change hugetlb_reserve_pages() to type boolMike Kravetz1-2/+2
While reviewing a bug in hugetlb_reserve_pages, it was noticed that all callers ignore the return value. Any failure is considered an ENOMEM error by the callers. Change the function to be of type bool. The function will return true if the reservation was successful, false otherwise. Callers currently assume a zero return code indicates success. Change the callers to look for true to indicate success. No functional change, only code cleanup. Link: https://lkml.kernel.org/r/20201221192542.15732-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlb: convert page_huge_active() HPageMigratable flagMike Kravetz1-1/+1
Use the new hugetlb page specific flag HPageMigratable to replace the page_huge_active interfaces. By it's name, page_huge_active implied that a huge page was on the active list. However, that is not really what code checking the flag wanted to know. It really wanted to determine if the huge page could be migrated. This happens when the page is actually added to the page cache and/or task page table. This is the reasoning behind the name change. The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not really necessary as we KNOW the page is a hugetlb page. Therefore, they are removed. The routine page_huge_active checked for PageHeadHuge before testing the active bit. This is unnecessary in the case where we hold a reference or lock and know it is a hugetlb head page. page_huge_active is also called without holding a reference or lock (scan_movable_pages), and can race with code freeing the page. The extra check in page_huge_active shortened the race window, but did not prevent the race. Offline code calling scan_movable_pages already deals with these races, so removing the check is acceptable. Add comment to racy code. [songmuchun@bytedance.com: remove set_page_huge_active() declaration from include/linux/hugetlb.h] Link: https://lkml.kernel.org/r/CAMZfGtUda+KoAZscU0718TN61cSFwp4zy=y2oZ=+6Z2TAZZwng@mail.gmail.com Link: https://lkml.kernel.org/r/20210122195231.324857-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25hugetlb: use page.private for hugetlb specific page flagsMike Kravetz1-9/+3
Patch series "create hugetlb flags to consolidate state", v3. While discussing a series of hugetlb fixes in [1], it became evident that the hugetlb specific page state information is stored in a somewhat haphazard manner. Code dealing with state information would be easier to read, understand and maintain if this information was stored in a consistent manner. This series uses page.private of the hugetlb head page for storing a set of hugetlb specific page flags. Routines are priovided for test, set and clear of the flags. [1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com This patch (of 4): As hugetlbfs evolved, state information about hugetlb pages was added. One 'convenient' way of doing this was to use available fields in tail pages. Over time, it has become difficult to know the meaning or contents of fields simply by looking at a small bit of code. Sometimes, the naming is just confusing. For example: The PagePrivate flag indicates a huge page reservation was consumed and needs to be restored if an error is encountered and the page is freed before it is instantiated. The page.private field contains the pointer to a subpool if the page is associated with one. In an effort to make the code more readable, use page.private to contain hugetlb specific page flags. These flags will have test, set and clear functions similar to those used for 'normal' page flags. More importantly, an enum of flag values will be created with names that actually reflect their purpose. In this patch, - Create infrastructure for hugetlb specific page flag functions - Move subpool pointer to page[1].private to make way for flags Create routines with meaningful names to modify subpool field - Use new HPageRestoreReserve flag instead of PagePrivate Conversion of other state information will happen in subsequent patches. Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25vmalloc: remove redundant NULL checkYang Li1-5/+2
Fix below warnings reported by coccicheck: fs/proc/vmcore.c:1503:2-7: WARNING: NULL check before some freeing functions is not needed. Link: https://lkml.kernel.org/r/1611216753-44598-1-git-send-email-abaci-bugfix@linux.alibaba.com Signed-off-by: Yang Li <abaci-bugfix@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25fs: buffer: use raw page_memcg() on locked pageJohannes Weiner1-2/+2
alloc_page_buffers() currently uses get_mem_cgroup_from_page() for charging the buffers to the page owner, which does an rcu-protected page->memcg lookup and acquires a reference. But buffer allocation has the page lock held throughout, which pins the page to the memcg and thereby the memcg - neither rcu nor holding an extra reference during the allocation are necessary. Use a raw page_memcg() instead. This was the last user of get_mem_cgroup_from_page(), delete it. Link: https://lkml.kernel.org/r/20210209190126.97842-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm: memcontrol: convert NR_FILE_PMDMAPPED account to pagesMuchun Song1-1/+1
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_FILE_PMDMAPPED account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pagesMuchun Song1-1/+1
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_SHMEM_PMDMAPPED account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm: memcontrol: convert NR_SHMEM_THPS account to pagesMuchun Song1-1/+1
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_SHMEM_THPS account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm: memcontrol: convert NR_FILE_THPS account to pagesMuchun Song1-1/+1
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with if hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_FILE_THPS account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm: memcontrol: convert NR_ANON_THPS account to pagesMuchun Song1-1/+1
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_ANON_THPS account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: NeilBrown <neilb@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25fs/buffer.c: add checking buffer head stat before clearYang Guo1-1/+2
clear_buffer_new() is used to clear buffer new stat. When PAGE_SIZE is 64K, most buffer heads in the list are not needed to clear. clear_buffer_new() has an enpensive atomic modification operation, Let's add checking buffer head before clear it as __block_write_begin_int does which is good for performance. Link: https://lkml.kernel.org/r/1612332890-57918-1-git-send-email-zhangshaokun@hisilicon.com Signed-off-by: Yang Guo <guoyang2@huawei.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm/filemap: rename generic_file_buffered_read to filemap_readChristoph Hellwig1-1/+1
Rename generic_file_buffered_read to match the naming of filemap_fault, also update the written parameter to a more descriptive name and improve the kerneldoc comment. Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25mm/filemap: remove unused parameter and change to void type for ↵Baolin Wang1-5/+1
replace_page_cache_page() Since commit 74d609585d8b ("page cache: Add and replace pages using the XArray") was merged, the replace_page_cache_page() can not fail and always return 0, we can remove the redundant return value and void it. Moreover remove the unused gfp_mask. Link: https://lkml.kernel.org/r/609c30e5274ba15d8b90c872fd0d8ac437a9b2bb.1610071401.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>