summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2025-07-14NFSD: Remove definitions for unused trace_nfsd_file_lru trace pointsChuck Lever1-2/+0
Events nfsd_file_lru_add_disposed and nfsd_file_lru_del_disposed were added by commit 4a0e73e635e3 ("NFSD: Leave open files out of the filecache LRU") but they were never used. Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/linux-nfs/5ccae2f9-1560-4ac5-b506-b235ed4e4f4f@oracle.com/T/#t Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14NFSD: Remove definition for trace_nfsd_file_unhash_and_queueChuck Lever1-1/+0
trace_nfsd_file_unhash_and_queue() was removed by commit ac3a2585f018 ("nfsd: rework refcounting in filecache"). Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/linux-nfs/5ccae2f9-1560-4ac5-b506-b235ed4e4f4f@oracle.com/T/#t Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14nfsd: Use correct error code when decoding extentsSergey Bashirov4-27/+73
Update error codes in decoding functions of block and scsi layout drivers to match the core nfsd code. NFS4ERR_EINVAL means that the server was able to decode the request, but the decoded values are invalid. Use NFS4ERR_BADXDR instead to indicate a decoding error. And ENOMEM is changed to nfs code NFS4ERR_DELAY. Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14NFSD: Remove the cap on number of operations per NFSv4 COMPOUNDChuck Lever5-20/+3
This limit has always been a sanity check; in nearly all cases a large COMPOUND is a sign of a malfunctioning client. The only real limit on COMPOUND size and complexity is the size of NFSD's send and receive buffers. However, there are a few cases where a large COMPOUND is sane. For example, when a client implementation wants to walk down a long file pathname in a single round trip. A small risk is that now a client can construct a COMPOUND request that can keep a single nfsd thread busy for quite some time. Suggested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14NFSD: Make nfsd_genl_rqstp::rq_ops array best-effortChuck Lever2-2/+3
To enable NFSD to handle NFSv4 COMPOUNDs of unrestricted size, resize the array in struct nfsd_genl_rqstp so it saves only up to 16 operations per COMPOUND. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14NFSD: Rename a function parameterChuck Lever1-14/+14
Clean up: A function parameter called "rqstp" typically refers to an object of type "struct svc_rqst", so it's confusing when such an parameter refers to a different struct type with field names that are very similar to svc_rqst. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14NFSD: detect mismatch of file handle and delegation stateid in OPEN opDai Ngo1-0/+14
When the client sends an OPEN with claim type CLAIM_DELEG_CUR_FH or CLAIM_DELEGATION_CUR, the delegation stateid and the file handle must belong to the same file, otherwise return NFS4ERR_INVAL. Note that RFC8881, section 8.2.4, mandates the server to return NFS4ERR_BAD_STATEID if the selected table entry does not match the current filehandle. However returning NFS4ERR_BAD_STATEID in the OPEN causes the client to retry the operation and therefor get the client into a loop. To avoid this situation we return NFS4ERR_INVAL instead. Reported-by: Petro Pavlov <petro.pavlov@vastdata.com> Fixes: c44c5eeb2c02 ("[PATCH] nfsd4: add open state code for CLAIM_DELEGATE_CUR") Cc: stable@vger.kernel.org Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14nfsd: handle get_client_locked() failure in nfsd4_setclientid_confirm()Jeff Layton1-5/+15
Lei Lu recently reported that nfsd4_setclientid_confirm() did not check the return value from get_client_locked(). a SETCLIENTID_CONFIRM could race with a confirmed client expiring and fail to get a reference. That could later lead to a UAF. Fix this by getting a reference early in the case where there is an extant confirmed client. If that fails then treat it as if there were no confirmed client found at all. In the case where the unconfirmed client is expiring, just fail and return the result from get_client_locked(). Reported-by: lei lu <llfamsec@gmail.com> Closes: https://lore.kernel.org/linux-nfs/CAEBF3_b=UvqzNKdnfD_52L05Mqrqui9vZ2eFamgAbV0WG+FNWQ@mail.gmail.com/ Fixes: d20c11d86d8f ("nfsd: Protect session creation and client confirm using client_lock") Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14nfsd: Change the type of ek_fsidtype from int to u8 and use kstrtou8Su Hui3-8/+6
The valid values for ek_fsidtype are actually 0-7 so it's better to change the type to u8. Also using kstrtou8() to relpace simple_strtoul(), kstrtou8() is safer and more suitable for u8. Suggested-by: NeilBrown <neil@brown.name> Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14sunrpc: simplify xdr_init_encode_pagesChristoph Hellwig2-2/+2
The rqst argument to xdr_init_encode_pages is set to NULL by all callers, and pages is always set to buf->pages. Remove the two arguments and hardcode the assignments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14NFSD: release read access of nfs4_file when a write delegation is returnedDai Ngo2-1/+9
When a write delegation is returned, check if read access was added to nfs4_file when client opens file with WRONLY, and release it. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14NFSD: Offer write delegation for OPEN with OPEN4_SHARE_ACCESS_WRITEDai Ngo1-28/+47
RFC8881, section 9.1.2 says: "In the case of READ, the server may perform the corresponding check on the access mode, or it may choose to allow READ for OPEN4_SHARE_ACCESS_WRITE, to accommodate clients whose WRITE implementation may unavoidably do reads (e.g., due to buffer cache constraints)." and in section 10.4.1: "Similarly, when closing a file opened for OPEN4_SHARE_ACCESS_WRITE/ OPEN4_SHARE_ACCESS_BOTH and if an OPEN_DELEGATE_WRITE delegation is in effect" This patch allows READ using write delegation stateid granted on OPENs with OPEN4_SHARE_ACCESS_WRITE only, to accommodate clients whose WRITE implementation may unavoidably do (e.g., due to buffer cache constraints). For write delegation granted for OPEN with OPEN4_SHARE_ACCESS_WRITE a new nfsd_file and a struct file are allocated to use for reads. The nfsd_file is freed when the file is closed by release_all_access. Suggested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14netfs: Fix race between cache write completion and ALL_QUEUED being setDavid Howells1-0/+4
When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup. Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector. However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates. This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there. If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging. Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs. Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250711151005.2956810-3-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14netfs: Fix copy-to-cache so that it performs collection with ceph+fscacheDavid Howells1-0/+1
The netfs copy-to-cache that is used by Ceph with local caching sets up a new request to write data just read to the cache. The request is started and then left to look after itself whilst the app continues. The request gets notified by the backing fs upon completion of the async DIO write, but then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't set - but the app isn't waiting there, and so the request just hangs. Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the notification from the backing filesystem to put the collection onto a work queue instead. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250711151005.2956810-2-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: build the writeback code without CONFIG_BLOCKChristoph Hellwig2-55/+64
Allow fuse to use the iomap writeback code even when CONFIG_BLOCK is not enabled. Do this with an ifdef instead of a separate file to keep the iomap_folio_state local to buffered-io.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-15-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: add read_folio_range() handler for buffered writesChristoph Hellwig1-4/+9
Add a read_folio_range() handler for buffered writes that filesystems may pass in if they wish to provide a custom handler for synchronously reading in the contents of a folio. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> [hch: renamed to read_folio_range, pass less arguments] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-14-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: improve argument passing to iomap_read_folio_syncChristoph Hellwig1-8/+8
Pass the iomap_iter and derive the map inside iomap_read_folio_sync instead of in the caller, and use the more descriptive srcmap name for the source iomap. Stop passing the offset into folio argument as it can be derived from the folio and the file offset. Rename the variables for the offset into the file and the length to be more descriptive and match the rest of the code. Rename the function itself to iomap_read_folio_range to make the use more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-13-hch@lst.de Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: replace iomap_folio_ops with iomap_write_opsChristoph Hellwig9-53/+76
The iomap_folio_ops are only used for buffered writes, including the zero and unshare variants. Rename them to iomap_write_ops to better describe the usage, and pass them through the call chain like the other operation specific methods instead of through the iomap. xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior that never attached the folio_ops to a iomap representing a hole. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: export iomap_writeback_folioChristoph Hellwig1-2/+2
Allow fuse to use iomap_writeback_folio for folio laundering. Note that the caller needs to manually submit the pending writeback context. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-11-hch@lst.de Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: move folio_unlock out of iomap_writeback_folioJoanne Koong1-5/+4
Move unlocking the folio out of iomap_writeback_folio into the caller. This means the end writeback machinery is now run with the folio locked when no writeback happened, or writeback completed extremely fast. Note that having the folio locked over the call to folio_end_writeback in iomap_writeback_folio means that the dropbehind handling there will never run because the trylock fails. The only way this can happen is if the writepage either never wrote back any dirty data at all, in which case the dropbehind handling isn't needed, or if all writeback finished instantly, which is rather unlikely. Even in the latter case the dropbehind handling is an optional optimization so skipping it will not cause correctness issues. This prepares for exporting iomap_writeback_folio for use in folio laundering. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-10-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: rename iomap_writepage_map to iomap_writeback_folioChristoph Hellwig2-6/+6
->writepage is gone, and our naming wasn't always that great to start with. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-9-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: move all ioend handling to ioend.cChristoph Hellwig3-218/+219
Now that the writeback code has the proper abstractions, all the ioend code can be self-contained in ioend.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-8-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: add public helpers for uptodate state manipulationJoanne Koong1-5/+15
Add a new iomap_start_folio_write helper to abstract away the write_bytes_pending handling, and export it and the existing iomap_finish_folio_write for non-iomap writeback in fuse. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-7-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: hide ioends from the generic writeback codeChristoph Hellwig4-72/+81
Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx one to facilitate non-block, non-ioend writeback for use. Rename the submit_ioend method to writeback_submit and make it mandatory so that the generic writeback code stops seeing ioends and bios. Co-developed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: refactor the writeback interfaceChristoph Hellwig5-123/+157
Replace ->map_blocks with a new ->writeback_range, which differs in the following ways: - it must also queue up the I/O for writeback, that is called into the slightly refactored and extended in scope iomap_add_to_ioend for each region - can handle only a part of the requested region, that is the retry loop for partial mappings moves to the caller - handles cleanup on failures as well, and thus also replaces the discard_folio method only implemented by XFS. This will allow to use the iomap writeback code also for file systems that are not block based like fuse. Co-developed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> # zonefs Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: cleanup the pending writeback tracking in iomap_writepage_map_blocksJoanne Koong1-6/+6
We don't care about the count of outstanding ioends, just if there is one. Replace the count variable passed to iomap_writepage_map_blocks with a boolean to make that more clear. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> [hch: rename the variable, update the commit message] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-4-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: pass more arguments using the iomap writeback contextChristoph Hellwig4-40/+52
Add inode and wpc fields to pass the inode and writeback context that are needed in the entire writeback call chain, and let the callers initialize all fields in the writeback context before calling iomap_writepages to simplify the argument passing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-3-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: header dietChristoph Hellwig7-25/+0
Drop various unused #include statements. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-2-hch@lst.de Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14fix a leak in fcntl_dirnotify()Al Viro1-4/+4
[into #fixes, unless somebody objects] Lifetime of new_dn_mark is controlled by that of its ->fsn_mark, pointed to by new_fsn_mark. Unfortunately, a failure exit had been inserted between the allocation of new_dn_mark and the call of fsnotify_init_mark(), ending up with a leak. Fixes: 1934b212615d "file: reclaim 24 bytes from f_owner" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250712171843.GB1880847@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()Zhang Yi1-2/+2
The calculation of journal credits in ext4_meta_trans_blocks() should include pextents, as each extent separately may be allocated from a different group and thus need to update different bitmap and group descriptor block. Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks") Reported-by: Jan Kara <jack@suse.cz> Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: replace ext4_writepage_trans_blocks()Zhang Yi6-27/+25
After ext4 supports large folios, the semantics of reserving credits in pages is no longer applicable. In most scenarios, reserving credits in extents is sufficient. Therefore, introduce ext4_chunk_trans_extent() to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the only remaining location where we are still processing extents in pages. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: reserved credits for one extent during the folio writebackZhang Yi1-17/+8
After ext4 supports large folios, reserving journal credits for one maximum-ordered folio based on the worst case cenario during the writeback process can easily exceed the maximum transaction credits. Additionally, reserving journal credits for one page is also no longer appropriate. Currently, the folio writeback process can either extend the journal credits or initiate a new transaction if the currently reserved journal credits are insufficient. Therefore, it can be modified to reserve credits for only one extent at the outset. In most cases involving continuous mapping, these credits are generally adequate, and we may only need to perform some basic credit expansion. However, in extreme cases where the block size and folio size differ significantly, or when the folios are sufficiently discontinuous, it may be necessary to restart a new transaction and resubmit the folios. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: correct the reserved credits for extent conversionZhang Yi1-3/+3
Now, we reserve journal credits for converting extents in only one page to written state when the I/O operation is complete. This is insufficient when large folio is enabled. Fix this by reserving credits for converting up to one extent per block in the largest 2MB folio, this calculation should only involve extents index and leaf blocks, so it should not estimate too many credits. Fixes: 7ac67301e82f ("ext4: enable large folio for regular file") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: enhance tracepoints during the folios writebackZhang Yi1-1/+4
After mpage_map_and_submit_extent() supports restarting handle if credits are insufficient during allocating blocks, it is more likely to exit the current mapping iteration and continue to process the current processing partially mapped folio again. The existing tracepoints are not sufficient to track this situation, so enhance the tracepoints to track the writeback position and the return value before and after submitting the folios. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: restart handle if credits are insufficient during allocating blocksZhang Yi1-5/+36
After large folios are supported on ext4, writing back a sufficiently large and discontinuous folio may consume a significant number of journal credits, placing considerable strain on the journal. For example, in a 20GB filesystem with 1K block size and 1MB journal size, writing back a 2MB folio could require thousands of credits in the worst-case scenario (when each block is discontinuous and distributed across different block groups), potentially exceeding the journal size. This issue can also occur in ext4_write_begin() and ext4_page_mkwrite() when delalloc is not enabled. Fix this by ensuring that there are sufficient journal credits before allocating an extent in mpage_map_one_extent() and ext4_block_write_begin(). If there are not enough credits, return -EAGAIN, exit the current mapping loop, restart a new handle and a new transaction, and allocating blocks on this folio again in the next iteration. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: refactor the block allocation process of ext4_page_mkwrite()Zhang Yi1-45/+50
The block allocation process and error handling in ext4_page_mkwrite() is complex now. Refactor it by introducing a new helper function, ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to allocate blocks instead of directly calling block_page_mkwrite(). Preparing to implement retry logic in a subsequent patch to address situations where the reserved journal credits are insufficient. Additionally, this modification will help prevent potential deadlocks that may occur when waiting for folio writeback while holding the transaction handle. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: fix stale data if it bail out of the extents mapping loopZhang Yi1-1/+50
During the process of writing back folios, if mpage_map_and_submit_extent() exits the extent mapping loop due to an ENOSPC or ENOMEM error, it may result in stale data or filesystem inconsistency in environments where the block size is smaller than the folio size. When mapping a discontinuous folio in mpage_map_and_submit_extent(), some buffers may have already be mapped. If we exit the mapping loop prematurely, the folio data within the mapped range will not be written back, and the file's disk size will not be updated. Once the transaction that includes this range of extents is committed, this can lead to stale data or filesystem inconsistency. Fix this by submitting the current processing partially mapped folio. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()Zhang Yi1-2/+1
mpage_folio_done() should be a more appropriate place than mpage_submit_folio() for updating the wbc->nr_to_write after we have submitted a fully mapped folio. Preparing to make mpage_submit_folio() allows to submit partially mapped folio that is still under processing. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14ext4: process folios writeback in bytesZhang Yi1-34/+36
Since ext4 supports large folios, processing writebacks in pages is no longer appropriate, it can be modified to process writebacks in bytes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14mm: rename PAGE_MAPPING_* to FOLIO_MAPPING_*David Hildenbrand1-2/+2
Now that the mapping flags are only used for folios, let's rename the defines. Link: https://lkml.kernel.org/r/20250704102524.326966-27-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-14mm/hugetlb: remove prepare_hugepage_range()Peter Xu1-6/+2
Only mips and loongarch implemented this API, however what it does was checking against stack overflow for either len or addr. That's already done in arch's arch_get_unmapped_area*() functions, even though it may not be 100% identical checks. For example, for both of the architectures, there will be a trivial difference on how stack top was defined. The old code uses STACK_TOP which may be slightly smaller than TASK_SIZE on either of them, but the hope is that shouldn't be a problem. It means the whole API is pretty much obsolete at least now, remove it completely. Link: https://lkml.kernel.org/r/20250627160707.2124580-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-14smb: invalidate and close cached directory when creating child entriesBharath SM1-2/+4
When a parent lease key is passed to the server during a create operation while holding a directory lease, the server may not send a lease break to the client. In such cases, it becomes the client’s responsibility to ensure cache consistency. This led to a problem where directory listings (e.g., `ls` or `readdir`) could return stale results after a new file is created. eg: ls /mnt/share/ touch /mnt/share/file1 ls /mnt/share/ In this scenario, the final `ls` may not show `file1` due to the stale directory cache. For now, fix this by marking the cached directory as invalid if using the parent lease key during create, and explicitly closing the cached directory after successful file creation. Fixes: 037e1bae588eacf ("smb: client: use ParentLeaseKey in cifs_do_create") Signed-off-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-14smb: client: fix use-after-free in crypt_message when using async cryptoWang Zhaolong1-1/+6
The CVE-2024-50047 fix removed asynchronous crypto handling from crypt_message(), assuming all crypto operations are synchronous. However, when hardware crypto accelerators are used, this can cause use-after-free crashes: crypt_message() // Allocate the creq buffer containing the req creq = smb2_get_aead_req(..., &req); // Async encryption returns -EINPROGRESS immediately rc = enc ? crypto_aead_encrypt(req) : crypto_aead_decrypt(req); // Free creq while async operation is still in progress kvfree_sensitive(creq, ...); Hardware crypto modules often implement async AEAD operations for performance. When crypto_aead_encrypt/decrypt() returns -EINPROGRESS, the operation completes asynchronously. Without crypto_wait_req(), the function immediately frees the request buffer, leading to crashes when the driver later accesses the freed memory. This results in a use-after-free condition when the hardware crypto driver later accesses the freed request structure, leading to kernel crashes with NULL pointer dereferences. The issue occurs because crypto_alloc_aead() with mask=0 doesn't guarantee synchronous operation. Even without CRYPTO_ALG_ASYNC in the mask, async implementations can be selected. Fix by restoring the async crypto handling: - DECLARE_CRYPTO_WAIT(wait) for completion tracking - aead_request_set_callback() for async completion notification - crypto_wait_req() to wait for operation completion This ensures the request buffer isn't freed until the crypto operation completes, whether synchronous or asynchronous, while preserving the CVE-2024-50047 fix. Fixes: b0abcd65ec54 ("smb: client: fix UAF in async decryption") Link: https://lore.kernel.org/all/8b784a13-87b0-4131-9ff9-7a8993538749@huaweicloud.com/ Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-14smb: client: fix use-after-free in cifs_oplock_breakWang Zhaolong1-1/+9
A race condition can occur in cifs_oplock_break() leading to a use-after-free of the cinode structure when unmounting: cifs_oplock_break() _cifsFileInfo_put(cfile) cifsFileInfo_put_final() cifs_sb_deactive() [last ref, start releasing sb] kill_sb() kill_anon_super() generic_shutdown_super() evict_inodes() dispose_list() evict() destroy_inode() call_rcu(&inode->i_rcu, i_callback) spin_lock(&cinode->open_file_lock) <- OK [later] i_callback() cifs_free_inode() kmem_cache_free(cinode) spin_unlock(&cinode->open_file_lock) <- UAF cifs_done_oplock_break(cinode) <- UAF The issue occurs when umount has already released its reference to the superblock. When _cifsFileInfo_put() calls cifs_sb_deactive(), this releases the last reference, triggering the immediate cleanup of all inodes under RCU. However, cifs_oplock_break() continues to access the cinode after this point, resulting in use-after-free. Fix this by holding an extra reference to the superblock during the entire oplock break operation. This ensures that the superblock and its inodes remain valid until the oplock break completes. Link: https://bugzilla.kernel.org/show_bug.cgi?id=220309 Fixes: b98749cac4a6 ("CIFS: keep FileInfo handle live during oplock break") Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-14bcachefs: io_read: remove from async obj list in rbio_done()Kent Overstreet1-0/+5
Previously, only split rbios allocated in io_read.c would be removed from the async obj list. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-13ext4: remove unused EXT_STATS macro from ext4_extents.hBaolin Liu1-7/+0
The EXT_STATS macro in fs/ext4/ext4_extents.h has been defined but never used in the codebase since its introduction. This patch removes it. Analysis: 1. No references found in fs/ext4/ or other kernel code. 2. No impact on compilation or functionality. 3. Git history shows it was never utilized. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250527053805.1550912-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13Merge branch 'mm-hotfixes-stable' into mm-stable to pick up changes whichAndrew Morton1-7/+7
are required for a merge of the series "mm: folio_pte_batch() improvements".
2025-07-12Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. A whopping 16 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 14 are for MM. Three gdb-script fixes and a kallsyms build fix" * tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "sched/numa: add statistics of numa balance task" mm: fix the inaccurate memory statistics issue for users mm/damon: fix divide by zero in damon_get_intervals_score() samples/damon: fix damon sample mtier for start failure samples/damon: fix damon sample wsse for start failure samples/damon: fix damon sample prcl for start failure kasan: remove kasan_find_vm_area() to prevent possible deadlock scripts: gdb: vfs: support external dentry names mm/migrate: fix do_pages_stat in compat mode mm/damon/core: handle damon_call_control as normal under kdmond deactivation mm/rmap: fix potential out-of-bounds page table access during batched unmap mm/hugetlb: don't crash when allocating a folio if there are no resv scripts/gdb: de-reference per-CPU MCE interrupts scripts/gdb: fix interrupts.py after maple tree conversion maple_tree: fix mt_destroy_walk() on root leaf node mm/vmalloc: leave lazy MMU mode on PTE mapping error scripts/gdb: fix interrupts display after MCP on x86 lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users() kallsyms: fix build without execinfo
2025-07-12Merge tag 'erofs-for-6.16-rc6-fixes' of ↵Linus Torvalds7-35/+41
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Fix for a cache aliasing issue by adding missing flush_dcache_folio(), which causes execution failures on some arm32 setups. Fix for large compressed fragments, which could be generated by -Eall-fragments option (but should be rare) and was rejected by mistake due to an on-disk hardening commit. The remaining ones are small fixes. Summary: - Address cache aliasing for mappable page cache folios - Allow readdir() to be interrupted - Fix large fragment handling which was errored out by mistake - Add missing tracepoints - Use memcpy_to_folio() to replace copy_to_iter() for inline data" * tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix large fragment handling erofs: allow readdir() to be interrupted erofs: address D-cache aliasing erofs: use memcpy_to_folio() to replace copy_to_iter() erofs: fix to add missing tracepoint in erofs_read_folio() erofs: fix to add missing tracepoint in erofs_readahead()
2025-07-12Merge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefsLinus Torvalds14-108/+138
Pull bcachefs fixes from Kent Overstreet. * tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs: bcachefs: Don't set BCH_FS_error on transaction restart bcachefs: Fix additional misalignment in journal space calculations bcachefs: Don't schedule non persistent passes persistently bcachefs: Fix bch2_btree_transactions_read() synchronization bcachefs: btree read retry fixes bcachefs: btree node scan no longer uses btree cache bcachefs: Tweak btree cache helpers for use by btree node scan bcachefs: Fix btree for nonexistent tree depth bcachefs: Fix bch2_io_failures_to_text() bcachefs: bch2_fpunch_snapshot()