summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-04-29io_uring: fix unchecked error in switch_start()Pavel Begunkov1-1/+3
io_rsrc_node_switch_start() can fail, don't forget to check returned error code. Reported-by: syzbot+a4715dd4b7c866136f79@syzkaller.appspotmail.com Fixes: eae071c9b4cef ("io_uring: prepare fixed rw for dynanic buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4c06e2f3f0c8e43bd8d0a266c79055bcc6b6e60.1619693112.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29io_uring: allow empty slots for reg buffersPavel Begunkov1-7/+29
Allow empty reg buffer slots any request using which should fail. This allows users to not register all buffers in advance, but do it lazily and/or on demand via updates. That is achieved by setting iov_base and iov_len to zero for registration and/or buffer updates. Empty buffer can't have a non-zero tag. Implementation details: to not add extra overhead to io_import_fixed(), create a dummy buffer crafted to fail any request using it, and set it to all empty buffer slots. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e95e4d700082baaf010c648c72ac764c9cc8826.1619611868.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29io_uring: add more build check for uapiPavel Begunkov1-0/+7
Add a couple of BUILD_BUG_ON() checking some rsrc uapi structs and SQE flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff960df4d5026b9fb5bfd80994b9d3667d3926da.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29io_uring: dont overlap internal and user req flagsPavel Begunkov1-6/+3
CQE flags take one byte that we store in req->flags together with other REQ_F_* internal flags. CQE flags are copied directly into req and then verified that requires some handling on failures, e.g. to make sure that that copy doesn't set some of the internal flags. Move all internal flags to take bits after the first byte, so we don't need extra handling and make it safer overall. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b8b5b02d1ab9d786fcc7db4a3fe86db6b70b8987.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29io_uring: fix drain with rsrc CQEsPavel Begunkov1-0/+1
Resource emitted CQEs are not bound to requests, so fix up counters used for DRAIN/defer logic. Fixes: b60c8dce33895 ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b32f5f0a40d5928c3466d028f936e167f0654be.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29Merge tag 'net-next-5.13' of ↵Linus Torvalds4-51/+24
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - bpf: - allow bpf programs calling kernel functions (initially to reuse TCP congestion control implementations) - enable task local storage for tracing programs - remove the need to store per-task state in hash maps, and allow tracing programs access to task local storage previously added for BPF_LSM - add bpf_for_each_map_elem() helper, allowing programs to walk all map elements in a more robust and easier to verify fashion - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT redirection - lpm: add support for batched ops in LPM trie - add BTF_KIND_FLOAT support - mostly to allow use of BTF on s390 which has floats in its headers files - improve BPF syscall documentation and extend the use of kdoc parsing scripts we already employ for bpf-helpers - libbpf, bpftool: support static linking of BPF ELF files - improve support for encapsulation of L2 packets - xdp: restructure redirect actions to avoid a runtime lookup, improving performance by 4-8% in microbenchmarks - xsk: build skb by page (aka generic zerocopy xmit) - improve performance of software AF_XDP path by 33% for devices which don't need headers in the linear skb part (e.g. virtio) - nexthop: resilient next-hop groups - improve path stability on next-hops group changes (incl. offload for mlxsw) - ipv6: segment routing: add support for IPv4 decapsulation - icmp: add support for RFC 8335 extended PROBE messages - inet: use bigger hash table for IP ID generation - tcp: deal better with delayed TX completions - make sure we don't give up on fast TCP retransmissions only because driver is slow in reporting that it completed transmitting the original - tcp: reorder tcp_congestion_ops for better cache locality - mptcp: - add sockopt support for common TCP options - add support for common TCP msg flags - include multiple address ids in RM_ADDR - add reset option support for resetting one subflow - udp: GRO L4 improvements - improve 'forward' / 'frag_list' co-existence with UDP tunnel GRO, allowing the first to take place correctly even for encapsulated UDP traffic - micro-optimize dev_gro_receive() and flow dissection, avoid retpoline overhead on VLAN and TEB GRO - use less memory for sysctls, add a new sysctl type, to allow using u8 instead of "int" and "long" and shrink networking sysctls - veth: allow GRO without XDP - this allows aggregating UDP packets before handing them off to routing, bridge, OvS, etc. - allow specifing ifindex when device is moved to another namespace - netfilter: - nft_socket: add support for cgroupsv2 - nftables: add catch-all set element - special element used to define a default action in case normal lookup missed - use net_generic infra in many modules to avoid allocating per-ns memory unnecessarily - xps: improve the xps handling to avoid potential out-of-bound accesses and use-after-free when XPS change race with other re-configuration under traffic - add a config knob to turn off per-cpu netdev refcnt to catch underflows in testing Device APIs: - add WWAN subsystem to organize the WWAN interfaces better and hopefully start driving towards more unified and vendor- independent APIs - ethtool: - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt support) - allow network drivers to dump arbitrary SFP EEPROM data, current offset+length API was a poor fit for modern SFP which define EEPROM in terms of pages (incl. mlx5 support) - act_police, flow_offload: add support for packet-per-second policing (incl. offload for nfp) - psample: add additional metadata attributes like transit delay for packets sampled from switch HW (and corresponding egress and policy-based sampling in the mlxsw driver) - dsa: improve support for sandwiched LAGs with bridge and DSA - netfilter: - flowtable: use direct xmit in topologies with IP forwarding, bridging, vlans etc. - nftables: counter hardware offload support - Bluetooth: - improvements for firmware download w/ Intel devices - add support for reading AOSP vendor capabilities - add support for virtio transport driver - mac80211: - allow concurrent monitor iface and ethernet rx decap - set priority and queue mapping for injected frames - phy: add support for Clause-45 PHY Loopback - pci/iov: add sysfs MSI-X vector assignment interface to distribute MSI-X resources to VFs (incl. mlx5 support) New hardware/drivers: - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit interfaces. - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and BCM63xx switches - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches - ath11k: support for QCN9074 a 802.11ax device - Bluetooth: Broadcom BCM4330 and BMC4334 - phy: Marvell 88X2222 transceiver support - mdio: add BCM6368 MDIO mux bus controller - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips - mana: driver for Microsoft Azure Network Adapter (MANA) - Actions Semi Owl Ethernet MAC - can: driver for ETAS ES58X CAN/USB interfaces Pure driver changes: - add XDP support to: enetc, igc, stmmac - add AF_XDP support to: stmmac - virtio: - page_to_skb() use build_skb when there's sufficient tailroom (21% improvement for 1000B UDP frames) - support XDP even without dedicated Tx queues - share the Tx queues with the stack when necessary - mlx5: - flow rules: add support for mirroring with conntrack, matching on ICMP, GTP, flex filters and more - support packet sampling with flow offloads - persist uplink representor netdev across eswitch mode changes - allow coexistence of CQE compression and HW time-stamping - add ethtool extended link error state reporting - ice, iavf: support flow filters, UDP Segmentation Offload - dpaa2-switch: - move the driver out of staging - add spanning tree (STP) support - add rx copybreak support - add tc flower hardware offload on ingress traffic - ionic: - implement Rx page reuse - support HW PTP time-stamping - octeon: support TC hardware offloads - flower matching on ingress and egress ratelimitting. - stmmac: - add RX frame steering based on VLAN priority in tc flower - support frame preemption (FPE) - intel: add cross time-stamping freq difference adjustment - ocelot: - support forwarding of MRP frames in HW - support multiple bridges - support PTP Sync one-step timestamping - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like learning, flooding etc. - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350, SC7280 SoCs) - mt7601u: enable TDLS support - mt76: - add support for 802.3 rx frames (mt7915/mt7615) - mt7915 flash pre-calibration support - mt7921/mt7663 runtime power management fixes" * tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits) net: selftest: fix build issue if INET is disabled net: netrom: nr_in: Remove redundant assignment to ns net: tun: Remove redundant assignment to ret net: phy: marvell: add downshift support for M88E1240 net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255 net/sched: act_ct: Remove redundant ct get and check icmp: standardize naming of RFC 8335 PROBE constants bpf, selftests: Update array map tests for per-cpu batched ops bpf: Add batched ops support for percpu array bpf: Implement formatted output helpers with bstr_printf seq_file: Add a seq_bprintf function sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues net:nfc:digital: Fix a double free in digital_tg_recv_dep_req net: fix a concurrency bug in l2tp_tunnel_register() net/smc: Remove redundant assignment to rc mpls: Remove redundant assignment to err llc2: Remove redundant assignment to rc net/tls: Remove redundant initialization of record rds: Remove redundant assignment to nr_sig dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0 ...
2021-04-29Merge tag 'fsnotify_for_v5.13-rc1' of ↵Linus Torvalds12-139/+395
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - support for limited fanotify functionality for unpriviledged users - faster merging of fanotify events - a few smaller fsnotify improvements * tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: shmem: allow reporting fanotify events with file handles on tmpfs fs: introduce a wrapper uuid_to_fsid() fanotify_user: use upper_32_bits() to verify mask fanotify: support limited functionality for unprivileged users fanotify: configurable limits via sysfs fanotify: limit number of event merge attempts fsnotify: use hash table for faster events merge fanotify: mix event info and pid into merge key hash fanotify: reduce event objectid to 29-bit hash fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
2021-04-29Merge tag 'for_v5.13-rc1' of ↵Linus Torvalds6-59/+139
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota, ext2, reiserfs updates from Jan Kara: - support for path (instead of device) based quotactl syscall (quotactl_path(2)) - ext2 conversion to kmap_local() - other minor cleanups & fixes * tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fs/reiserfs/journal.c: delete useless variables fs/ext2: Replace kmap() with kmap_local_page() ext2: Match up ext2_put_page() with ext2_dotdot() and ext2_find_entry() fs/ext2/: fix misspellings using codespell tool quota: report warning limits for realtime space quotas quota: wire up quotactl_path quota: Add mountpath based quota support
2021-04-29Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds91-1302/+1458
Pull xfs updates from Darrick Wong: "The notable user-visible addition this cycle is ability to remove space from the last AG in a filesystem. This is the first of many changes needed for full-fledged support for shrinking a filesystem. Still needed are (a) the ability to reorganize files and metadata away from the end of the fs; (b) the ability to remove entire allocation groups; (c) shrink support for realtime volumes; and (d) thorough testing of (a-c). There are a number of performance improvements in this code drop: Dave streamlined various parts of the buffer logging code and reduced the cost of various debugging checks, and added the ability to pre-create the xattr structures while creating files. Brian eliminated transaction reservations that were being held across writeback (thus reducing livelock potential. Other random pieces: Pavel fixed the repetitve warnings about deprecated mount options, I fixed online fsck to behave itself when a readonly remount comes in during scrub, and refactored various other parts of that code, Christoph contributed a lot of refactoring this cycle. The xfs_icdinode structure has been absorbed into the (incore) xfs_inode structure, and the format and flags handling around xfs_inode_fork structures has been simplified. Chandan provided a number of fixes for extent count overflow related problems that have been shaken out by debugging knobs added during 5.12. Summary: - Various minor fixes in online scrub. - Prevent metadata files from being automatically inactivated. - Validate btree heights by the computed per-btree limits. - Don't warn about remounting with deprecated mount options. - Initialize attr forks at create time if we suspect we're going to need to store them. - Reduce memory reallocation workouts in the logging code. - Fix some theoretical math calculation errors in logged buffers that span multiple discontig memory ranges but contiguous ondisk regions. - Speedups in dirty buffer bitmap handling. - Make type verifier functions more inline-happy to reduce overhead. - Reduce debug overhead in directory checking code. - Many many typo fixes. - Begin to handle the permanent loss of the very end of a filesystem. - Fold struct xfs_icdinode into xfs_inode. - Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx ioctl. - Remove a broken directory block format check from online scrub. - Fix a bug where we could produce an unnecessarily tall data fork btree when creating an attr fork. - Fix scrub and readonly remounts racing. - Fix a writeback ioend log deadlock problem by dropping the behavior where we could preallocate a setfilesize transaction. - Fix some bugs in the new extent count checking code. - Fix some bugs in the attr fork preallocation code. - Refactor if_flags out of the incore inode fork data structure" * tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (77 commits) xfs: remove xfs_quiesce_attr declaration xfs: remove XFS_IFEXTENTS xfs: remove XFS_IFINLINE xfs: remove XFS_IFBROOT xfs: only look at the fork format in xfs_idestroy_fork xfs: simplify xfs_attr_remove_args xfs: rename and simplify xfs_bmap_one_block xfs: move the XFS_IFEXTENTS check into xfs_iread_extents xfs: drop unnecessary setfilesize helper xfs: drop unused ioend private merge and setfilesize code xfs: open code ioend needs workqueue helper xfs: drop submit side trans alloc for append ioends xfs: fix return of uninitialized value in variable error xfs: get rid of the ip parameter to xchk_setup_* xfs: fix scrub and remount-ro protection when running scrub xfs: move the check for post-EOF mappings into xfs_can_free_eofblocks xfs: move the xfs_can_free_eofblocks call under the IOLOCK xfs: precalculate default inode attribute offset xfs: default attr fork size does not handle device inodes xfs: inode fork allocation depends on XFS_IFEXTENT flag ...
2021-04-29Merge tag 'gfs2-for-5.13' of ↵Linus Torvalds23-243/+312
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Fix some compiler and kernel-doc warnings - Various minor cleanups and optimizations - Add a new sysfs gfs2 status file with some filesystem wide information * tag 'gfs2-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix fall-through warnings for Clang gfs2: Fix a number of kernel-doc warnings gfs2: Make gfs2_setattr_simple static gfs2: Add new sysfs file for gfs2 status gfs2: Silence possible null pointer dereference warning gfs2: Turn gfs2_meta_indirect_buffer into gfs2_meta_buffer gfs2: Replace gfs2_lblk_to_dblk with gfs2_get_extent gfs2: Turn gfs2_extent_map into gfs2_{get,alloc}_extent gfs2: Add new gfs2_iomap_get helper gfs2: Remove unused variable sb_format gfs2: Fix dir.c function parameter descriptions gfs2: Eliminate gh parameter from go_xmote_bh func gfs2: don't create empty buffers for NO_CREATE
2021-04-29Merge tag 'exfat-for-5.13-rc1' of ↵Linus Torvalds8-35/+206
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat updates from Namjae Jeon: - Improve write performance with dirsync mount option - Improve lookup performance - Add support for FITRIM ioctl - Fix a bug with discard option * tag 'exfat-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: speed up iterate/lookup by fixing start point of traversing cluster chain exfat: improve write performance when dirsync enabled exfat: add support ioctl and FITRIM function exfat: introduce bitmap_lock for cluster bitmap access exfat: fix erroneous discard when clear cluster bit
2021-04-29xfs: fix xfs_reflink_unshare usage of filemap_write_and_wait_rangeDarrick J. Wong1-1/+2
The final parameter of filemap_write_and_wait_range is the end of the range to flush, not the length of the range to flush. Fixes: 46afb0628b86 ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29xfs: set aside allocation btree blocks from block reservationBrian Foster1-1/+14
The blocks used for allocation btrees (bnobt and countbt) are technically considered free space. This is because as free space is used, allocbt blocks are removed and naturally become available for traditional allocation. However, this means that a significant portion of free space may consist of in-use btree blocks if free space is severely fragmented. On large filesystems with large perag reservations, this can lead to a rare but nasty condition where a significant amount of physical free space is available, but the majority of actual usable blocks consist of in-use allocbt blocks. We have a record of a (~12TB, 32 AG) filesystem with multiple AGs in a state with ~2.5GB or so free blocks tracked across ~300 total allocbt blocks, but effectively at 100% full because the the free space is entirely consumed by refcountbt perag reservation. Such a large perag reservation is by design on large filesystems. The problem is that because the free space is so fragmented, this AG contributes the 300 or so allocbt blocks to the global counters as free space. If this pattern repeats across enough AGs, the filesystem lands in a state where global block reservation can outrun physical block availability. For example, a streaming buffered write on the affected filesystem continues to allow delayed allocation beyond the point where writeback starts to fail due to physical block allocation failures. The expected behavior is for the delalloc block reservation to fail gracefully with -ENOSPC before physical block allocation failure is a possibility. To address this problem, set aside in-use allocbt blocks at reservation time and thus ensure they cannot be reserved until truly available for physical allocation. This allows alloc btree metadata to continue to reside in free space, but dynamically adjusts reservation availability based on internal state. Note that the logic requires that the allocbt counter is fully populated at reservation time before it is fully effective. We currently rely on the mount time AGF scan in the perag reservation initialization code for this dependency on filesystems where it's most important (i.e. with active perag reservations). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29xfs: introduce in-core global counter of allocbt blocksBrian Foster3-0/+22
Introduce an in-core counter to track the sum of all allocbt blocks used by the filesystem. This value is currently tracked per-ag via the ->agf_btreeblks field in the AGF, which also happens to include rmapbt blocks. A global, in-core count of allocbt blocks is required to identify the subset of global ->m_fdblocks that consists of unavailable blocks currently used for allocation btrees. To support this calculation at block reservation time, construct a similar global counter for allocbt blocks, populate it on first read of each AGF and update it as allocbt blocks are used and released. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29xfs: unconditionally read all AGFs on mounts with perag reservationBrian Foster1-11/+23
perag reservation is enabled at mount time on a per AG basis. The upcoming change to set aside allocbt blocks from block reservation requires a populated allocbt counter as soon as possible after mount to be fully effective against large perag reservations. Therefore as a preparation step, initialize the pagf on all mounts where at least one reservation is active. Note that this already occurs to some degree on most default format filesystems as reservation requirement calculations already depend on the AGF or AGI, depending on the reservation type. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29xfs: count free space btree blocks when scrubbing pre-lazysbcount fsesDarrick J. Wong1-1/+38
Since agf_btreeblks didn't exist before the lazysbcount feature, the fs summary count scrubber needs to walk the free space btrees to determine the amount of space being used by those btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-29xfs: update superblock counters correctly for !lazysbcountDave Chinner2-3/+16
Keep the mount superblock counters up to date for !lazysbcount filesystems so that when we log the superblock they do not need updating in any way because they are already correct. It's found by what Zorro reported: 1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev 2. mount $dev $mnt 3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load) 4. umount $mnt 5. xfs_repair -n $dev and I've seen no problem with this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29xfs: don't check agf_btreeblks on pre-lazysbcount filesystemsDarrick J. Wong2-2/+8
The AGF free space btree block counter wasn't added until the lazysbcount feature was added to XFS midway through the life of the V4 format, so ignore the field when checking. Online AGF repair requires rmapbt, so it doesn't need the feature check. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-04-29xfs: remove obsolete AGF counter debuggingDarrick J. Wong6-31/+0
In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to remove blocks from an overfilled AGFL because there were complaints about transaction overruns that stemmed from trying to free multiple blocks in a single transaction. Unfortunately, that commit missed a subtlety in the debug-mode transaction accounting when a realtime volume is attached. If a realtime file undergoes a data fork mapping change such that realtime extents are allocated (or freed) in the same transaction that a data device block is also allocated (or freed), we can trip a debugging assertion. This can happen (for example) if a realtime extent is allocated and it is necessary to reshape the bmbt to hold the new mapping. When we go to allocate a bmbt block from an AG, the first thing the data device block allocator does is ensure that the freelist is the proper length. If the freelist is too long, it will trim the freelist to the proper length. In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to record the decrement in the AG free list count. Prior to f8f28 we would put the free block back in the free space btrees in the same transaction, which calls xfs_trans_agblocks_delta() to record the increment in the AG free block count. Since AGFL blocks are included in the global free block count (fdblocks), there is no corresponding fdblocks update, so the AGFL free satisfies the following condition in xfs_trans_apply_sb_deltas: /* * Check that superblock mods match the mods made to AGF counters. */ ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) == (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta + tp->t_ag_btree_delta)); The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is the number blocks that were allocated. After commit f8f28 we defer the block freeing to the next chained transaction, which means that the calls to xfs_trans_agflist_delta and xfs_trans_agblocks_delta occur in separate transactions. The (first) transaction that shortens the free list trips on the comparison, which has now become: (X + 0) == ((X) + -1 + 0) because we haven't freed the AGFL block yet; we've only logged an intention to free it. When the second transaction (the deferred free) commits, it will evaluate the expression as: (0 + 0) == (1 + 0 + 0) and trip over that in turn. At this point, the astute reader may note that the two commits tagged by this patch have been in the kernel for a long time but haven't generated any bug reports. How is it that the author became aware of this bug? This originally surfaced as an intermittent failure when I was testing realtime rmap, but a different bug report by Zorro Lang reveals the same assertion occuring on !lazysbcount filesystems. The common factor to both reports (and why this problem wasn't previously reported) becomes apparent if we consider when xfs_trans_apply_sb_deltas is called by __xfs_trans_commit(): if (tp->t_flags & XFS_TRANS_SB_DIRTY) xfs_trans_apply_sb_deltas(tp); With a modern lazysbcount filesystem, transactions update only the percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence xfs_trans_apply_sb_deltas is rarely called. However, updates to the count of free realtime extents are not part of lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or removing data fork mappings to realtime files; similarly, XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems. Dave mentioned in response to an earlier version of this patch: "IIUC, what you are saying is that this debug code is simply not exercised in normal testing and hasn't been for the past decade? And it still won't be exercised on anything other than realtime device testing? "...it was debugging code from 1994 that was largely turned into dead code when lazysbcounters were introduced in 2007. Hence I'm not sure it holds any value anymore." This debugging code isn't especially helpful - you can modify the flcount on one AG and the freeblks of another AG, and it won't trigger. Add the fact that nobody noticed for a decade, and let's just get rid of it (and start testing realtime :P). This bug was found by running generic/051 on either a V4 filesystem lacking lazysbcount; or a V5 filesystem with a realtime volume. Cc: bfoster@redhat.com, zlang@redhat.com Fixes: f8f2835a9cf3 ("xfs: defer agfl block frees when dfops is available") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29orangefs: leave files in the page cache for a few micro seconds at leastMike Marshall1-1/+1
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2021-04-29Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-blockLinus Torvalds3-1429/+1510
Pull io_uring updates from Jens Axboe: - Support for multi-shot mode for POLL requests - More efficient reference counting. This is shamelessly stolen from the mm side. Even though referencing is mostly single/dual user, the 128 count was retained to keep the code the same. Maybe this should/could be made generic at some point. - Removal of the need to have a manager thread for each ring. The manager threads only job was checking and creating new io-threads as needed, instead we handle this from the queue path. - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this thread is "just" a regular application thread, so no need to restrict use of it anymore. - Cleanup of how internal async poll data lifetime is managed. - Fix for syzbot reported crash on SQPOLL cancelation. - Make buffer registration more like file registrations, which includes flexibility in avoiding full set unregistration and re-registration. - Fix for io-wq affinity setting. - Be a bit more defensive in task->pf_io_worker setup. - Various SQPOLL fixes. - Cleanup of SQPOLL creds handling. - Improvements to in-flight request tracking. - File registration cleanups. - Tons of cleanups and little fixes * tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits) io_uring: maintain drain logic for multishot poll requests io_uring: Check current->io_uring in io_uring_cancel_sqpoll io_uring: fix NULL reg-buffer io_uring: simplify SQPOLL cancellations io_uring: fix work_exit sqpoll cancellations io_uring: Fix uninitialized variable up.resv io_uring: fix invalid error check after malloc io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker kernel: always initialize task->pf_io_worker to NULL io_uring: update sq_thread_idle after ctx deleted io_uring: add full-fledged dynamic buffers support io_uring: implement fixed buffers registration similar to fixed files io_uring: prepare fixed rw for dynanic buffers io_uring: keep table of pointers to ubufs io_uring: add generic rsrc update with tags io_uring: add IORING_REGISTER_RSRC io_uring: enumerate dynamic resources io_uring: add generic path for rsrc update io_uring: preparation for rsrc tagging io_uring: decouple CQE filling from requests ...
2021-04-29Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-blockLinus Torvalds1-9/+0
Pull block driver updates from Jens Axboe: - MD changes via Song: - raid5 POWER fix - raid1 failure fix - UAF fix for md cluster - mddev_find_or_alloc() clean up - Fix NULL pointer deref with external bitmap - Performance improvement for raid10 discard requests - Fix missing information of /proc/mdstat - rsxx const qualifier removal (Arnd) - Expose allocated brd pages (Calvin) - rnbd via Gioh Kim: - Change maintainer - Change domain address of maintainers' email - Add polling IO mode and document update - Fix memory leak and some bug detected by static code analysis tools - Code refactoring - Series of floppy cleanups/fixes (Denis) - s390 dasd fixes (Julian) - kerneldoc fixes (Lee) - null_blk double free (Lv) - null_blk virtual boundary addition (Max) - Remove xsysace driver (Michal) - umem driver removal (Davidlohr) - ataflop fixes (Dan) - Revalidate disk removal (Christoph) - Bounce buffer cleanups (Christoph) - Mark lightnvm as deprecated (Christoph) - mtip32xx init cleanups (Shixin) - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang) * tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits) async_xor: increase src_offs when dropping destination page drivers/block/null_blk/main: Fix a double free in null_init. md/raid1: properly indicate failure when ending a failed write request md-cluster: fix use-after-free issue when removing rdev nvme: introduce generic per-namespace chardev nvme: cleanup nvme_configure_apst nvme: do not try to reconfigure APST when the controller is not live nvme: add 'kato' sysfs attribute nvme: sanitize KATO setting nvmet: avoid queuing keep-alive timer if it is disabled brd: expose number of allocated pages in debugfs ataflop: fix off by one in ataflop_probe() ataflop: potential out of bounds in do_format() drbd: Fix fall-through warnings for Clang block/rnbd: Use strscpy instead of strlcpy block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name block/rnbd-clt: Remove max_segment_size block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev Documentation/ABI/rnbd-clt: Add description for nr_poll_queues ...
2021-04-29Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+6
Pull block updates from Jens Axboe: "Pretty quiet round this time, which is nice. In detail: - Series revamping bounce buffer support (Christoph) - Dead code removal (Christoph, Bart) - Partition iteration revamp, now using xarray (Christoph) - Passthrough request scheduler improvements (Lin) - Series of BFQ improvements (Paolo) - Fix ioprio task iteration (Peter) - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max, Nikolay)" * tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits) blk-iocost: don't ignore vrate_min on QD contention blk-mq: Fix spurious debugfs directory creation during initialization bfq/mq-deadline: remove redundant check for passthrough request blk-mq: bypass IO scheduler's limit_depth for passthrough request block: Remove an obsolete comment from sg_io() block: move bio_list_copy_data to pktcdvd block: remove zero_fill_bio_iter block: add queue_to_disk() to get gendisk from request_queue block: remove an incorrect check from blk_rq_append_bio block: initialize ret in bdev_disk_changed block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration block: remove disk_part_iter block: simplify diskstats_show block: simplify show_partition block: simplify printk_all_partitions block: simplify partition_overlaps block: simplify partition removal block: take bd_mutex around delete_partitions in del_gendisk block: refactor blk_drop_partitions block: move more syncing and invalidation to delete_partition ...
2021-04-28Merge tag 'sched-core-2021-04-28' of ↵Linus Torvalds1-0/+91
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces to a unified debugfs interface. - Signals: Allow caching one sigqueue object per task, to improve performance & latencies. - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups. - Improve energy-aware scheduling - Improve the PELT metrics for certain workloads - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous regressions - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature, or the use of the resched_latency_warn_ms=xx boot parameter. - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining balance_push() vs. hotplug holes/races - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well - Fix/improve various load-balancing corner cases vs. capacity margins - Fix sched topology on systems with NUMA diameter of 3 or above - Fix PF_KTHREAD vs to_kthread() race - Minor rseq optimizations - Misc cleanups, optimizations, fixes and smaller updates * tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits) cpumask/hotplug: Fix cpu_dying() state tracking kthread: Fix PF_KTHREAD vs to_kthread() race sched/debug: Fix cgroup_path[] serialization sched,psi: Handle potential task count underflow bugs more gracefully sched: Warn on long periods of pending need_resched sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning sched/debug: Rename the sched_debug parameter to sched_verbose sched,fair: Alternative sched_slice() sched: Move /proc/sched_debug to debugfs sched,debug: Convert sysctl sched_domains to debugfs debugfs: Implement debugfs_create_str() sched,preempt: Move preempt_dynamic to debug.c sched: Move SCHED_DEBUG sysctl to debugfs sched: Don't make LATENCYTOP select SCHED_DEBUG sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG sched: Use cpu_dying() to fix balance_push vs hotplug-rollback cpumask: Introduce DYING mask cpumask: Make cpu_{online,possible,present,active}() inline rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs() ...
2021-04-28Merge tag 'perf-core-2021-04-28' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event updates from Ingo Molnar: - Improve Intel uncore PMU support: - Parse uncore 'discovery tables' - a new hardware capability enumeration method introduced on the latest Intel platforms. This table is in a well-defined PCI namespace location and is read via MMIO. It is organized in an rbtree. These uncore tables will allow the discovery of standard counter blocks, but fancier counters still need to be enumerated explicitly. - Add Alder Lake support - Improve IIO stacks to PMON mapping support on Skylake servers - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived) cores. The CPU-side feature set is entirely symmetrical - but on the PMU side there's core type dependent PMU functionality. - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by fixing the AUX allocation watermark logic. - Improve ring buffer allocation on NUMA systems - Put 'struct perf_event' into their separate kmem_cache pool - Add support for synchronous signals for select perf events. The immediate motivation is to support low-overhead sampling-based race detection for user-space code. The feature consists of the following main changes: - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits inheritance of events to CLONE_THREAD. - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec. - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64 ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is PERF_TYPE_BREAKPOINT. The siginfo support is adequate for breakpoints right now - but the new field can be used to introduce support for other types of metadata passed over siginfo as well. - Misc fixes, cleanups and smaller updates. * tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) signal, perf: Add missing TRAP_PERF case in siginfo_layout() signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures perf/x86: Allow for 8<num_fixed_counters<16 perf/x86/rapl: Add support for Intel Alder Lake perf/x86/cstate: Add Alder Lake CPU support perf/x86/msr: Add Alder Lake CPU support perf/x86/intel/uncore: Add Alder Lake support perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE perf/x86/intel: Add Alder Lake Hybrid support perf/x86: Support filter_match callback perf/x86/intel: Add attr_update for Hybrid PMUs perf/x86: Add structures for the attributes of Hybrid PMUs perf/x86: Register hybrid PMUs perf/x86: Factor out x86_pmu_show_pmu_cap perf/x86: Remove temporary pmu assignment in event_init perf/x86/intel: Factor out intel_pmu_check_extra_regs perf/x86/intel: Factor out intel_pmu_check_event_constraints perf/x86/intel: Factor out intel_pmu_check_num_counters perf/x86: Hybrid PMU support for extra_regs perf/x86: Hybrid PMU support for event constraints ...
2021-04-28btrfs: fix deadlock when cloning inline extents and using qgroupsFilipe Manana5-7/+7
There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when cloning inline extent and low on free metadata space"). However when using qgroups, a transaction also reserves metadata qgroup space, which can also result in flushing delalloc in case there is not enough available space at the moment. When this happens we deadlock, since flushing delalloc requires locking the file range in the inode's iotree and the range was already locked at the very beginning of the clone operation, before attempting to start the transaction. When this issue happens, stack traces like the following are reported: [72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000 [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142) [72747.556271] Call Trace: [72747.556273] __schedule+0x296/0x760 [72747.556277] schedule+0x3c/0xa0 [72747.556279] io_schedule+0x12/0x40 [72747.556284] __lock_page+0x13c/0x280 [72747.556287] ? generic_file_readonly_mmap+0x70/0x70 [72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs] [72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160 [72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs] [72747.556362] ? update_group_capacity+0x25/0x210 [72747.556366] ? cpumask_next_and+0x1a/0x20 [72747.556391] extent_writepages+0x44/0xa0 [btrfs] [72747.556394] do_writepages+0x41/0xd0 [72747.556398] __writeback_single_inode+0x39/0x2a0 [72747.556403] writeback_sb_inodes+0x1ea/0x440 [72747.556407] __writeback_inodes_wb+0x5f/0xc0 [72747.556410] wb_writeback+0x235/0x2b0 [72747.556414] ? get_nr_inodes+0x35/0x50 [72747.556417] wb_workfn+0x354/0x490 [72747.556420] ? newidle_balance+0x2c5/0x3e0 [72747.556424] process_one_work+0x1aa/0x340 [72747.556426] worker_thread+0x30/0x390 [72747.556429] ? create_worker+0x1a0/0x1a0 [72747.556432] kthread+0x116/0x130 [72747.556435] ? kthread_park+0x80/0x80 [72747.556438] ret_from_fork+0x1f/0x30 [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [72747.566961] Call Trace: [72747.566964] __schedule+0x296/0x760 [72747.566968] ? finish_wait+0x80/0x80 [72747.566970] schedule+0x3c/0xa0 [72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs] [72747.566999] ? finish_wait+0x80/0x80 [72747.567024] lock_extent_bits+0x37/0x90 [btrfs] [72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs] [72747.567051] ? find_get_pages_range_tag+0x2cd/0x380 [72747.567076] __extent_writepage+0x203/0x320 [btrfs] [72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs] [72747.567106] ? update_load_avg+0x7e/0x5f0 [72747.567109] ? enqueue_entity+0xf4/0x6f0 [72747.567134] extent_writepages+0x44/0xa0 [btrfs] [72747.567137] ? enqueue_task_fair+0x93/0x6f0 [72747.567140] do_writepages+0x41/0xd0 [72747.567144] __filemap_fdatawrite_range+0xc7/0x100 [72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs] [72747.567200] process_one_work+0x1aa/0x340 [72747.567202] worker_thread+0x30/0x390 [72747.567205] ? create_worker+0x1a0/0x1a0 [72747.567208] kthread+0x116/0x130 [72747.567211] ? kthread_park+0x80/0x80 [72747.567214] ret_from_fork+0x1f/0x30 [72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000 [72747.569689] Call Trace: [72747.569691] __schedule+0x296/0x760 [72747.569694] schedule+0x3c/0xa0 [72747.569721] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569725] ? finish_wait+0x80/0x80 [72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs] [72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs] [72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs] [72747.569810] ? path_lookupat.isra.48+0x97/0x140 [72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs] [72747.569836] ? __kmalloc+0x16a/0x2c0 [72747.569839] do_iter_readv_writev+0x160/0x1c0 [72747.569843] do_iter_write+0x80/0x1b0 [72747.569847] vfs_writev+0x84/0x140 [72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs] [72747.569873] do_writev+0x65/0x100 [72747.569876] do_syscall_64+0x33/0x40 [72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000 [72747.569903] Call Trace: [72747.569906] __schedule+0x296/0x760 [72747.569909] schedule+0x3c/0xa0 [72747.569936] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569940] ? finish_wait+0x80/0x80 [72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs] [72747.569989] start_transaction+0x279/0x580 [btrfs] [72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs] [72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs] [72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs] [72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs] [72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs] [72747.570126] do_clone_file_range+0xed/0x200 [72747.570131] vfs_clone_file_range+0x37/0x110 [72747.570134] ioctl_file_clone+0x7d/0xb0 [72747.570137] do_vfs_ioctl+0x138/0x630 [72747.570140] __x64_sys_ioctl+0x62/0xc0 [72747.570143] do_syscall_64+0x33/0x40 [72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So fix this by skipping the flush of delalloc for an inode that is flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under such a special case of cloning an inline extent, when flushing delalloc during qgroup metadata reservation. The special cases for cloning inline extents were added in kernel 5.7 by by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents"), while having qgroup metadata space reservation flushing delalloc when low on space was added in kernel 5.9 by commit c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable kernel backports. Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/ Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28btrfs: fix race leading to unpersisted data and metadata on fsyncFilipe Manana2-11/+27
When doing a fast fsync on a file, there is a race which can result in the fsync returning success to user space without logging the inode and without durably persisting new data. The following example shows one possible scenario for this: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/bar $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz # Now we have: # file bar == inode 257 # file baz == inode 258 $ mv /mnt/baz /mnt/foo # Now we have: # file bar == inode 257 # file foo == inode 258 $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo # fsync bar before foo, it is important to trigger the race. $ xfs_io -c "fsync" /mnt/bar $ xfs_io -c "fsync" /mnt/foo # After this: # inode 257, file bar, is empty # inode 258, file foo, has 1M filled with 0xcd <power failure> # Replay the log: $ mount /dev/sdc /mnt # After this point file foo should have 1M filled with 0xcd and not 0xab The following steps explain how the race happens: 1) Before the first fsync of inode 258, when it has the "baz" name, its ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1. The inode also has the full sync flag set; 2) After the first fsync, we set inode 258 ->logged_trans to 6, which is the generation of the current transaction, and set ->last_log_commit to 0, which is the current value of ->last_sub_trans (done at btrfs_log_inode()). The full sync flag is cleared from the inode during the fsync. The log sub transaction that was committed had an ID of 0 and when we synced the log, at btrfs_sync_log(), we incremented root->log_transid from 0 to 1; 3) During the rename: We update inode 258, through btrfs_update_inode(), and that causes its ->last_sub_trans to be set to 1 (the current log transaction ID), and ->last_log_commit remains with a value of 0. After updating inode 258, because we have previously logged the inode in the previous fsync, we log again the inode through the call to btrfs_log_new_name(). This results in updating the inode's ->last_log_commit from 0 to 1 (the current value of its ->last_sub_trans). The ->last_sub_trans of inode 257 is updated to 1, which is the ID of the next log transaction; 4) Then a buffered write against inode 258 is made. This leaves the value of ->last_sub_trans as 1 (the ID of the current log transaction, stored at root->log_transid); 5) Then an fsync against inode 257 (or any other inode other than 258), happens. This results in committing the log transaction with ID 1, which results in updating root->last_log_commit to 1 and bumping root->log_transid from 1 to 2; 6) Then an fsync against inode 258 starts. We flush delalloc and wait only for writeback to complete, since the full sync flag is not set in the inode's runtime flags - we do not wait for ordered extents to complete. Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the ordered extent completes. The call returns true: static inline bool btrfs_inode_in_log(...) { bool ret = false; spin_lock(&inode->lock); if (inode->logged_trans == generation && inode->last_sub_trans <= inode->last_log_commit && inode->last_sub_trans <= inode->root->last_log_commit) ret = true; spin_unlock(&inode->lock); return ret; } generation has a value of 6 (fs_info->generation), ->logged_trans also has a value of 6 (set when we logged the inode during the first fsync and when logging it during the rename), ->last_sub_trans has a value of 1, set during the rename (step 3), ->last_log_commit also has a value of 1 (set in step 3) and root->last_log_commit has a value of 1, which was set in step 5 when fsyncing inode 257. As a consequence we don't log the inode, any new extents and do not sync the log, resulting in a data loss if a power failure happens after the fsync and before the current transaction commits. Also, because we do not log the inode, after a power failure the mtime and ctime of the inode do not match those we had before. When the ordered extent completes before we call btrfs_inode_in_log(), then the call returns false and we log the inode and sync the log, since at the end of ordered extent completion we update the inode and set ->last_sub_trans to 2 (the value of root->log_transid) and ->last_log_commit to 1. This problem is found after removing the check for the emptiness of the inode's list of modified extents in the recent commit 209ecbb8585bf6 ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"), added in the 5.13 merge window. However checking the emptiness of the list is not really the way to solve this problem, and was never intended to, because while that solves the problem for COW writes, the problem persists for NOCOW writes because in that case the list is always empty. In the case of NOCOW writes, even though we wait for the writeback to complete before returning from btrfs_sync_file(), we end up not logging the inode, which has a new mtime/ctime, and because we don't sync the log, we never issue disk barriers (send REQ_PREFLUSH to the device) since that only happens when we sync the log (when we write super blocks at btrfs_sync_log()). So effectively, for a NOCOW case, when we return from btrfs_sync_file() to user space, we are not guaranteeing that the data is durably persisted on disk. Also, while the example above uses a rename exchange to show how the problem happens, it is not the only way to trigger it. An alternative could be adding a new hard link to inode 258, since that also results in calling btrfs_log_new_name() and updating the inode in the log. An example reproducer using the addition of a hard link instead of a rename operation: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/bar $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo $ ln /mnt/foo /mnt/foo_link $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo $ xfs_io -c "fsync" /mnt/bar $ xfs_io -c "fsync" /mnt/foo <power failure> # Replay the log: $ mount /dev/sdc /mnt # After this point file foo often has 1M filled with 0xab and not 0xcd The reasons leading to the final fsync of file foo, inode 258, not persisting the new data are the same as for the previous example with a rename operation. So fix by never skipping logging and log syncing when there are still any ordered extents in flight. To avoid making the conditional if statement that checks if logging an inode is needed harder to read, place all the logic into an helper function with separate if statements to make it more manageable and easier to read. A test case for fstests will follow soon. For NOCOW writes, the problem existed before commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at fsync time"), introduced in kernel 4.19, then it went away with that commit since we started to always wait for ordered extent completion before logging. The problem came back again once the fast fsync path was changed again to avoid waiting for ordered extent completion, in commit 487781796d3022 ("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10. However, for COW writes, the race only happens after the recent commit 209ecbb8585bf6 ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW writes, the bug existed before that commit. So tag 5.10+ as the release for stable backports. CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28btrfs: do not consider send context as valid when trying to flush qgroupsFilipe Manana1-5/+9
At qgroup.c:try_flush_qgroup() we are asserting that current->journal_info is either NULL or has the value BTRFS_SEND_TRANS_STUB. However allowing for BTRFS_SEND_TRANS_STUB makes no sense because: 1) It is misleading, because send operations are read-only and do not ever need to reserve qgroup space; 2) We already assert that current->journal_info != BTRFS_SEND_TRANS_STUB at transaction.c:start_transaction(); 3) On a kernel without CONFIG_BTRFS_ASSERT=y set, it would result in a crash if try_flush_qgroup() is ever called in a send context, because at transaction.c:start_transaction we cast current->journal_info into a struct btrfs_trans_handle pointer and then dereference it. So just do allow a send context at try_flush_qgroup() and update the comment about it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28btrfs: zoned: fix silent data loss after failure splitting ordered extentFilipe Manana1-1/+1
On a zoned filesystem, sometimes we need to split an ordered extent into 3 different ordered extents. The original ordered extent is shortened, at the front and at the rear, and we create two other new ordered extents to represent the trimmed parts of the original ordered extent. After adjusting the original ordered extent, we create an ordered extent to represent the pre-range, and that may fail with ENOMEM for example. After that we always try to create the ordered extent for the post-range, and if that happens to succeed we end up returning success to the caller as we overwrite the 'ret' variable which contained the previous error. This means we end up with a file range for which there is no ordered extent, which results in the range never getting a new file extent item pointing to the new data location. And since the split operation did not return an error, writeback does not fail and the inode's mapping is not flagged with an error, resulting in a subsequent fsync not reporting an error either. It's possibly very unlikely to have the creation of the post-range ordered extent succeed after the creation of the pre-range ordered extent failed, but it's not impossible. So fix this by making sure we only create the post-range ordered extent if there was no error creating the ordered extent for the pre-range. Fixes: d22002fd37bd97 ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28Orangef: implement orangefs_readahead.Mike Marshall2-103/+53
Also remove open-coded readahead logic from orangefs_readpage. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2021-04-28smb3.1.1: enable negotiating stronger encryption by defaultSteve French1-1/+1
Now that stronger encryption (gcm256) has been more broadly tested, and confirmed to work with multiple servers (Windows and Azure for example), enable it by default. Although gcm256 is the second choice we offer (after gcm128 which should be faster), this change allows mounts to server which are configured to require the strongest encryption to work (without changing a module load parameter). Signed-off-by: Steve French <stfrench@microsoft.com> Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2021-04-28Merge tag 'printk-for-5.13' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Stop synchronizing kernel log buffer readers by logbuf_lock. As a result, the access to the buffer is fully lockless now. Note that printk() itself still uses locks because it tries to flush the messages to the console immediately. Also the per-CPU temporary buffers are still there because they prevent infinite recursion and serialize backtraces from NMI. All this is going to change in the future. - kmsg_dump API rework and cleanup as a side effect of the logbuf_lock removal. - Make bstr_printf() aware that %pf and %pF formats could deference the given pointer. - Show also page flags by %pGp format. - Clarify the documentation for plain pointer printing. - Do not show no_hash_pointers warning multiple times. - Update Senozhatsky email address. - Some clean up. * tag 'printk-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (24 commits) lib/vsprintf.c: remove leftover 'f' and 'F' cases from bstr_printf() printk: clarify the documentation for plain pointer printing kernel/printk.c: Fixed mundane typos printk: rename vprintk_func to vprintk vsprintf: dump full information of page flags in pGp mm, slub: don't combine pr_err with INFO mm, slub: use pGp to print page flags MAINTAINERS: update Senozhatsky email address lib/vsprintf: do not show no_hash_pointers message multiple times printk: console: remove unnecessary safe buffer usage printk: kmsg_dump: remove _nolock() variants printk: remove logbuf_lock printk: introduce a kmsg_dump iterator printk: kmsg_dumper: remove @active field printk: add syslog_lock printk: use atomic64_t for devkmsg_user.seq printk: use seqcount_latch for clear_seq printk: introduce CONSOLE_LOG_MAX printk: consolidate kmsg_dump_get_buffer/syslog_print_all code printk: refactor kmsg_dump_get_buffer() ...
2021-04-28seq_file: Add a seq_bprintf functionFlorent Revest1-0/+18
Similarly to seq_buf_bprintf in lib/seq_buf.c, this function writes a printf formatted string with arguments provided in a "binary representation" built by functions such as vbin_printf. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210427174313.860948-2-revest@chromium.org
2021-04-28ceph: don't allow access to MDS-private inodesJeff Layton4-0/+42
The MDS reserves a set of inodes for its own usage, and these should never be accessible to clients. Add a new helper to vet a proposed inode number against that range, and complain loudly and refuse to create or look it up if it's in it. Also, ensure that the MDS doesn't try to delegate inodes that are in that range or lower. Print a warning if it does, and don't save the range in the xarray. URL: https://tracker.ceph.com/issues/49922 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: fix up some bare fetches of i_sizeJeff Layton4-16/+16
We need to use i_size_read(), which properly handles the torn read case on 32-bit arches. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: convert some PAGE_SIZE invocations to thp_size()Jeff Layton1-26/+25
Start preparing to allow the use of THPs in the pagecache with ceph by making it use thp_size() in lieu of PAGE_SIZE in the appropriate places. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: support getting ceph.dir.rsnaps vxattrYanhu Cao5-2/+18
Add support for grabbing the rsnaps value out of the inode info in traces, and exposing that via ceph.dir.rsnaps xattr. Signed-off-by: Yanhu Cao <gmayyyha@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: drop pinned_page parameter from ceph_get_capsJeff Layton4-26/+13
All of the existing callers that don't set this to NULL just drop the page reference at some arbitrary point later in processing. There's no point in keeping a page reference that we don't use, so just drop the reference immediately after checking the Uptodate flag. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: fix inode leak on getattr error in __fh_to_dentryJeff Layton1-1/+3
Fixes: 878dabb64117 ("ceph: don't return -ESTALE if there's still an open file") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: only check pool permissions for regular filesJeff Layton1-0/+4
There is no need to do a ceph_pool_perm_check() on anything that isn't a regular file, as the MDS is what handles talking to the OSD in those cases. Just return 0 if it's not a regular file. Reported-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: send opened files/pinned caps/opened inodes metrics to MDS daemonXiubo Li2-2/+80
For the old ceph version, if it received this metric info, it will just ignore them. URL: https://tracker.ceph.com/issues/46866 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: avoid counting the same request twice or moreXiubo Li1-10/+10
If the request will retry, skip updating the latency metric. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: rename the metric helpersXiubo Li6-35/+35
Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: fix kerneldoc copypasta over ceph_start_io_directJeff Layton1-1/+1
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: use attach/detach_page_private for tracking snap contextJeff Layton1-11/+7
There is some ambiguity around the use of PagePrivate. It's generally expected in core code that if PagePrivate is set then you have a reference to it. It's not clear that ceph always does (and I believe it may not). Change ceph to use attach/detach_page_private so that we keep a reference to the page until the snap context is detached. Link: https://lore.kernel.org/ceph-devel/2503810.1616508988@warthog.procyon.org.uk/T/#mf29e5abbb0ec8035cde0de30778690de7d956f84 Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: don't use d_add in ceph_handle_snapdirJeff Layton3-14/+25
It's possible ceph_get_snapdir could end up finding a (disconnected) inode that already exists in the cache. Change the prototype for ceph_handle_snapdir to return a dentry pointer and have it use d_splice_alias so we don't end up with an aliased dentry in the cache. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: don't clobber i_snap_caps on non-I_NEW inodeJeff Layton1-4/+5
We want the snapdir to mirror the non-snapped directory's attributes for most things, but i_snap_caps represents the caps granted on the snapshot directory by the MDS itself. A misbehaving MDS could issue different caps for the snapdir and we lose them here. Only reset i_snap_caps when the inode is I_NEW. Also, move the setting of i_op and i_fop inside the if block since they should never change anyway. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: fix fall-through warnings for ClangGustavo A. R. Silva1-0/+2
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple of warnings by explicitly adding a break and a goto statements instead of just letting the code fall through to the next case. URL: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: convert ceph_readpages to ceph_readaheadJeff Layton1-199/+31
Convert ceph_readpages to ceph_readahead and make it use netfs_readahead. With this we can rip out a lot of the old readpage/readpages infrastructure. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-28ceph: convert ceph_write_begin to netfs_write_beginJeff Layton1-129/+57
Convert ceph_write_begin to use the netfs_write_begin helper. Most of the ops we need for it are already in place from the readpage conversion but we do add a new check_write_begin op since ceph needs to be able to vet whether there is an incompatible writeback already in flight before reading in the page. With this, we can also remove the old ceph_do_readpage helper. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>