kernel/linux.git/fs/nfs/blocklayout, branch v6.12.80

pnfs/blocklayout: Fix memory leak in bl_parse_scsi()

2026-01-23T10:18:36+00:00

[ Upstream commit 5a74af51c3a6f4cd22c128b0c1c019f68fa90011 ] In bl_parse_scsi(), if the block device length is zero, the function returns immediately without releasing the file reference obtained via bl_open_path(), leading to a memory leak. Fix this by jumping to the out_blkdev_put label to ensure the file reference is properly released. Fixes: d76c769c8db4c ("pnfs/blocklayout: Don't add zero-length pnfs_block_dev") Signed-off-by: Zilin Guan Signed-off-by: Trond Myklebust Signed-off-by: Sasha Levin

pNFS: Fix uninited ptr deref in block/scsi layout

2025-08-20T16:30:50+00:00

[ Upstream commit 9768797c219326699778fba9cd3b607b2f1e7950 ] The error occurs on the third attempt to encode extents. When function ext_tree_prepare_commit() reallocates a larger buffer to retry encoding extents, the "layoutupdate_pages" page array is initialized only after the retry loop. But ext_tree_free_commitdata() is called on every iteration and tries to put pages in the array, thus dereferencing uninitialized pointers. An additional problem is that there is no limit on the maximum possible buffer_size. When there are too many extents, the client may create a layoutcommit that is larger than the maximum possible RPC size accepted by the server. During testing, we observed two typical scenarios. First, one memory page for extents is enough when we work with small files, append data to the end of the file, or preallocate extents before writing. But when we fill a new large file without preallocating, the number of extents can be huge, and counting the number of written extents in ext_tree_encode_commit() does not help much. Since this number increases even more between unlocking and locking of ext_tree, the reallocated buffer may not be large enough again and again. Co-developed-by: Konstantin Evtushenko Signed-off-by: Konstantin Evtushenko Signed-off-by: Sergey Bashirov Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20250630183537.196479-2-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust Signed-off-by: Sasha Levin

pNFS: Fix disk addr range check in block/scsi layout

2025-08-20T16:30:50+00:00

[ Upstream commit 7db6e66663681abda54f81d5916db3a3b8b1a13d ] At the end of the isect translation, disc_addr represents the physical disk offset. Thus, end calculated from disk_addr is also a physical disk offset. Therefore, range checking should be done using map->disk_offset, not map->start. Signed-off-by: Sergey Bashirov Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20250702133226.212537-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust Signed-off-by: Sasha Levin

pNFS: Fix stripe mapping in block/scsi layout

2025-08-20T16:30:49+00:00

[ Upstream commit 81438498a285759f31e843ac4800f82a5ce6521f ] Because of integer division, we need to carefully calculate the disk offset. Consider the example below for a stripe of 6 volumes, a chunk size of 4096, and an offset of 70000. chunk = div_u64(offset, dev->chunk_size) = 70000 / 4096 = 17 offset = chunk * dev->chunk_size = 17 * 4096 = 69632 disk_offset_wrong = div_u64(offset, dev->nr_children) = 69632 / 6 = 11605 disk_chunk = div_u64(chunk, dev->nr_children) = 17 / 6 = 2 disk_offset = disk_chunk * dev->chunk_size = 2 * 4096 = 8192 Signed-off-by: Sergey Bashirov Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20250701122341.199112-1-sergeybashirov@gmail.com Signed-off-by: Trond Myklebust Signed-off-by: Sasha Levin

nfs/blocklayout: Limit repeat device registration on failure

2024-12-05T13:03:09+00:00

[ Upstream commit 614733f9441ed53bb442d4734112ec1e24bd6da7 ] Every pNFS SCSI IO wants to do LAYOUTGET, then within the layout find the device which can drive GETDEVINFO, then finally may need to prep the device with a reservation. This slow work makes a mess of IO latencies if one of the later steps is going to fail for awhile. If we're unable to register a SCSI device, ensure we mark the device as unavailable so that it will timeout and be re-added via GETDEVINFO. This avoids repeated doomed attempts to register a device in the IO path. Add some clarifying comments as well. Fixes: d869da91cccb ("nfs/blocklayout: Fix premature PR key unregistration") Signed-off-by: Benjamin Coddington Reviewed-by: Christoph Hellwig Reviewed-by: Chuck Lever Signed-off-by: Trond Myklebust Signed-off-by: Sasha Levin

nfs/blocklayout: Don't attempt unregister for invalid block device

2024-12-05T13:03:09+00:00

[ Upstream commit 3a4ce14d9a6b868e0787e4582420b721c04ee41e ] Since commit d869da91cccb ("nfs/blocklayout: Fix premature PR key unregistration") an unmount of a pNFS SCSI layout-enabled NFS may dereference a NULL block_device in: bl_unregister_scsi+0x16/0xe0 [blocklayoutdriver] bl_free_device+0x70/0x80 [blocklayoutdriver] bl_free_deviceid_node+0x12/0x30 [blocklayoutdriver] nfs4_put_deviceid_node+0x60/0xc0 [nfsv4] nfs4_deviceid_purge_client+0x132/0x190 [nfsv4] unset_pnfs_layoutdriver+0x59/0x60 [nfsv4] nfs4_destroy_server+0x36/0x70 [nfsv4] nfs_free_server+0x23/0xe0 [nfs] deactivate_locked_super+0x30/0xb0 cleanup_mnt+0xba/0x150 task_work_run+0x59/0x90 syscall_exit_to_user_mode+0x217/0x220 do_syscall_64+0x8e/0x160 This happens because even though we were able to create the nfs4_deviceid_node, the lookup for the device was unable to attach the block device to the pnfs_block_dev. If we never found a block device to register, we can avoid this case with the PNFS_BDEV_REGISTERED flag. Move the deref behind the test for the flag. Fixes: d869da91cccb ("nfs/blocklayout: Fix premature PR key unregistration") Signed-off-by: Benjamin Coddington Reviewed-by: Christoph Hellwig Reviewed-by: Chuck Lever Signed-off-by: Trond Myklebust Signed-off-by: Sasha Levin

nfs/blocklayout: add support for NVMe

2024-07-12T15:35:50+00:00

Look for the udev generated persistent device name for NVMe devices in addition to the SCSI ones and the Redhat-specific device mapper name. This is the client side implementation of RFC 9561 "Using the Parallel NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe) Storage Devices". Note that the udev rules for nvme are a bit of a mess and udev will only create a link for the uuid if the NVMe namespace has one, and not the NGUID. As the current RFCs don't support UUID based identifications this means the layout can't be used on such namespaces out of the box. A small tweak to the udev rules can work around it, and as the real fix I will submit a draft to the IETF NFSv4 working group to support UUID-based identifiers for SCSI and NVMe. Signed-off-by: Christoph Hellwig Reviewed-by: Sagi Grimberg Reviewed-by: Benjamin Coddington Signed-off-by: Anna Schumaker

nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg

2024-07-08T17:47:27+00:00

An administrator cannot take action on these messages, but the reported errors might be helpful for troubleshooting. Transition them to trace points so these events appear in the trace log and can be easily lined up with other traced NFS client operations. Examples: append_writer-6147 [000] 80.247393: bl_pr_key_reg: dev=8,0 (sda) key=0x6675bfcf59112e98 append_writer-6147 [000] 80.247842: bl_pr_key_unreg: dev=8,0 (sda) key=0x6675bfcf59112e98 umount.nfs4-6172 [002] 84.950409: bl_pr_key_unreg_err: dev=8,0 (sda) key=0x6675bfcf59112e98 status=RESERVATION_CONFLICT Reviewed-by: Benjamin Coddington Reviewed-by: Christoph Hellwig Signed-off-by: Chuck Lever Signed-off-by: Anna Schumaker

nfs/blocklayout: Report only when /no/ device is found

2024-07-08T17:47:27+00:00

Since commit f931d8374cad ("nfs/blocklayout: refactor block device opening"), an error is reported when no multi-path device is found. But this isn't a fatal error if the subsequent device open is successful. On systems without multi-path devices, this message always appears whether there is a problem or not. Instead, generate less system journal noise by reporting an error only when both open attempts fail. The new error message is more actionable since it indicates that there is a real configuration issue to be addressed. Reviewed-by: Christoph Hellwig Reviewed-by: Benjamin Coddington Signed-off-by: Chuck Lever Signed-off-by: Anna Schumaker

nfs/blocklayout: Fix premature PR key unregistration

2024-07-08T17:47:27+00:00

During generic/069 runs with pNFS SCSI layouts, the NFS client emits the following in the system journal: kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2) kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3) kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2) kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3) kernel: sd 6:0:0:1: reservation conflict kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00 kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2 kernel: sd 6:0:0:1: reservation conflict kernel: sd 6:0:0:1: reservation conflict kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00 kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00 kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 systemd[1]: fstests-generic-069.scope: Deactivated successfully. systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time. systemd[1]: media-test.mount: Deactivated successfully. systemd[1]: media-scratch.mount: Deactivated successfully. kernel: sd 6:0:0:1: reservation conflict kernel: failed to unregister PR key. This appears to be due to a race. bl_alloc_lseg() calls this: 561 static struct nfs4_deviceid_node * 562 bl_find_get_deviceid(struct nfs_server *server, 563 const struct nfs4_deviceid *id, const struct cred *cred, 564 gfp_t gfp_mask) 565 { 566 struct nfs4_deviceid_node *node; 567 unsigned long start, end; 568 569 retry: 570 node = nfs4_find_get_deviceid(server, id, cred, gfp_mask); 571 if (!node) 572 return ERR_PTR(-ENODEV); nfs4_find_get_deviceid() does a lookup without the spin lock first. If it can't find a matching deviceid, it creates a new device_info (which calls bl_alloc_deviceid_node, and that registers the device's PR key). Then it takes the nfs4_deviceid_lock and looks up the deviceid again. If it finds it this time, bl_find_get_deviceid() frees the spare (new) device_info, which unregisters the PR key for the same device. Any subsequent I/O from this client on that device gets EBADE. The umount later unregisters the device's PR key again. To prevent this problem, register the PR key after the deviceid_node lookup. Signed-off-by: Christoph Hellwig Signed-off-by: Chuck Lever Reviewed-by: Christoph Hellwig Reviewed-by: Benjamin Coddington Signed-off-by: Anna Schumaker