Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- two fixes preventing deletion and manual creation of subvolume qgroup
- unify error code returned for unknown send flags
- fix assertion during subvolume creation when anonymous device could
be allocated by other thread (e.g. due to backref walk)
* tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not ASSERT() if the newly created subvolume already got read
btrfs: forbid deleting live subvol qgroup
btrfs: forbid creating subvol qgroups
btrfs: send: return EOPNOTSUPP on unknown flags
|
|
Creating a qgroup 0/subvolid leads to various races and it isn't
helpful, because you can't specify a subvol id when creating a subvol,
so you can't be sure it will be the right one. Any requirements on the
automatic subvol can be gratified by using a higher level qgroup and the
inheritance parameters of subvol creation.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- zoned mode fixes:
- fix slowdown when writing large file sequentially by looking up
block groups with enough space faster
- locking fixes when activating a zone
- new mount API fixes:
- preserve mount options for a ro/rw mount of the same subvolume
- scrub fixes:
- fix use-after-free in case the chunk length is not aligned to
64K, this does not happen normally but has been reported on
images converted from ext4
- similar alignment check was missing with raid-stripe-tree
- subvolume deletion fixes:
- prevent calling ioctl on already deleted subvolume
- properly track flag tracking a deleted subvolume
- in subpage mode, fix decompression of an inline extent (zlib, lzo,
zstd)
- fix crash when starting writeback on a folio, after integration with
recent MM changes this needs to be started conditionally
- reject unknown flags in defrag ioctl
- error handling, API fixes, minor warning fixes
* tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: limit RST scrub to chunk boundary
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
btrfs: don't unconditionally call folio_start_writeback in subpage
btrfs: use the original mount's mount options for the legacy reconfigure
btrfs: don't warn if discard range is not aligned to sector
btrfs: tree-checker: fix inline ref size in error messages
btrfs: zstd: fix and simplify the inline extent decompression
btrfs: lzo: fix and simplify the inline extent decompression
btrfs: zlib: fix and simplify the inline extent decompression
btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
btrfs: zoned: fix lock ordering in btrfs_zone_activate()
btrfs: fix unbalanced unlock of mapping_tree_lock
btrfs: ref-verify: free ref cache before clearing mount opt
btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send()
btrfs: zoned: optimize hint byte for zoned allocator
btrfs: zoned: factor out prepare_allocation_zoned()
|
|
Add extra sanity check for btrfs_ioctl_defrag_range_args::flags.
This is not really to enhance fuzzing tests, but as a preparation for
future expansion on btrfs_ioctl_defrag_range_args.
In the future we're going to add new members, allowing more fine tuning
for btrfs defrag. Without the -ENONOTSUPP error, there would be no way
to detect if the kernel supports those new defrag features.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If the source file descriptor to the snapshot ioctl refers to a deleted
subvolume, we get the following abort:
BTRFS: Transaction aborted (error -2)
WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs]
Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c
CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs]
RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027
RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840
RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998
R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe
R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80
FS: 00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0
Call Trace:
<TASK>
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? __warn+0x81/0x130
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? report_bug+0x171/0x1a0
? handle_bug+0x3a/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? create_pending_snapshot+0x1040/0x1190 [btrfs]
? create_pending_snapshot+0x1040/0x1190 [btrfs]
create_pending_snapshots+0x92/0xc0 [btrfs]
btrfs_commit_transaction+0x66b/0xf40 [btrfs]
btrfs_mksubvol+0x301/0x4d0 [btrfs]
btrfs_mksnapshot+0x80/0xb0 [btrfs]
__btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs]
btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs]
btrfs_ioctl+0x8a6/0x2650 [btrfs]
? kmem_cache_free+0x22/0x340
? do_sys_openat2+0x97/0xe0
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x46/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7fe20abe83af
RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af
RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58
</TASK>
---[ end trace 0000000000000000 ]---
BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry
BTRFS info (device vdc: state EA): forced readonly
BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction.
BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry
This happens because create_pending_snapshot() initializes the new root
item as a copy of the source root item. This includes the refs field,
which is 0 for a deleted subvolume. The call to btrfs_insert_root()
therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then
finds the root and returns -ENOENT if refs == 0, which causes
create_pending_snapshot() to abort.
Fix it by checking the source root's refs before attempting the
snapshot, but after locking subvol_sem to avoid racing with deletion.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull vfs rw updates from Christian Brauner:
"This contains updates from Amir for read-write backing file helpers
for stacking filesystems such as overlayfs:
- Fanotify is currently in the process of introducing pre content
events. Roughly, a new permission event will be added indicating
that it is safe to write to the file being accessed. These events
are used by hierarchical storage managers to e.g., fill the content
of files on first access.
During that work we noticed that our current permission checking is
inconsistent in rw_verify_area() and remap_verify_area().
Especially in the splice code permission checking is done multiple
times. For example, one time for the whole range and then again for
partial ranges inside the iterator.
In addition, we mostly do permission checking before we call
file_start_write() except for a few places where we call it after.
For pre-content events we need such permission checking to be done
before file_start_write(). So this is a nice reason to clean this
all up.
After this series, all permission checking is done before
file_start_write().
As part of this cleanup we also massaged the splice code a bit. We
got rid of a few helpers because we are alredy drowning in special
read-write helpers. We also cleaned up the return types for splice
helpers.
- Introduce generic read-write helpers for backing files. This lifts
some overlayfs code to common code so it can be used by the FUSE
passthrough work coming in over the next cycles. Make Amir and
Miklos the maintainers for this new subsystem of the vfs"
* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
fs: fix __sb_write_started() kerneldoc formatting
fs: factor out backing_file_mmap() helper
fs: factor out backing_file_splice_{read,write}() helpers
fs: factor out backing_file_{read,write}_iter() helpers
fs: prepare for stackable filesystems backing file helpers
fsnotify: optionally pass access range in file permission hooks
fsnotify: assert that file_start_write() is not held in permission hooks
fsnotify: split fsnotify_perm() into two hooks
fs: use splice_copy_file_range() inline helper
splice: return type ssize_t from all helpers
fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
fs: move file_start_write() into direct_splice_actor()
fs: fork splice_file_range() from do_splice_direct()
fs: create {sb,file}_write_not_started() helpers
fs: create file_write_started() helper
fs: create __sb_write_started() helper
fs: move kiocb_start_write() into vfs_iocb_iter_write()
fs: move permission hook out of do_iter_read()
fs: move permission hook out of do_iter_write()
fs: move file_start_write() into vfs_iter_write()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix that verifies that the snapshot source is a root, same
check is also done in user space but should be done by the ioctl as
well"
* tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not allow non subvolume root targets for snapshot
|
|
Our btrfs subvolume snapshot <source> <destination> utility enforces
that <source> is the root of the subvolume, however this isn't enforced
in the kernel. Update the kernel to also enforce this limitation to
avoid problems with other users of this ioctl that don't have the
appropriate checks in place.
Reported-by: Martin Michaelis <code@mgjm.de>
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and message updates:
- for simple quotas, handle the case when a snapshot is created and
the target qgroup already exists
- fix a warning when file descriptor given to send ioctl is not
writable
- fix off-by-one condition when checking chunk maps
- free pages when page array allocation fails during compression
read, other cases were handled
- fix memory leak on error handling path in ref-verify debugging
feature
- copy missing struct member 'version' in 64/32bit compat send ioctl
- tree-checker verifies inline backref ordering
- print messages to syslog on first mount and last unmount
- update error messages when reading chunk maps"
* tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: ensure send_fd is writable
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
btrfs: fix 64bit compat send ioctl arguments not initializing version member
btrfs: make error messages more clear when getting a chunk map
btrfs: fix off-by-one when checking chunk map includes logical address
btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()
btrfs: add dmesg output for first mount and last unmount of a filesystem
btrfs: do not abort transaction if there is already an existing qgroup
btrfs: tree-checker: add type and sequence check for inline backrefs
|
|
When the send protocol versioning was added in 5.16 e77fbf990316
("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was
not updated (added by 2351f431f727 ("btrfs: fix send ioctl on 32bit with
64bit kernel")), missing the version struct member. The compat code is
probably rarely used, nobody reported any bugs.
Found by tool https://github.com/jirislaby/clang-struct .
Fixes: e77fbf990316 ("btrfs: send: prepare for v2 protocol")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In vfs code, file_start_write() is usually called after the permission
hook in rw_verify_area(). btrfs_ioctl_encoded_write() in an exception
to this rule.
Move file_start_write() to after the rw_verify_area() check in encoded
write to make the permission hook "start-write-safe".
This is needed for fanotify "pre content" events.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-9-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential overflow in returned value from SEARCH_TREE_V2
ioctl on 32bit architecture
- zoned mode fixes:
- drop unnecessary write pointer check for RAID0/RAID1/RAID10
profiles, now it works because of raid-stripe-tree
- wait for finishing the zone when direct IO needs a new
allocation
- simple quota fixes:
- pass correct owning root pointer when cleaning up an
aborted transaction
- fix leaking some structures when processing delayed refs
- change key type number of BTRFS_EXTENT_OWNER_REF_KEY,
reorder it before inline refs that are supposed to be
sorted, keeping the original number would complicate a lot
of things; this change needs an updated version of
btrfs-progs to work and filesystems need to be recreated
- fix error pointer dereference after failure to allocate fs
devices
- fix race between accounting qgroup extents and removing a
qgroup
* tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: make OWNER_REF_KEY type value smallest among inline refs
btrfs: fix qgroup record leaks when using simple quotas
btrfs: fix race between accounting qgroup extents and removing a qgroup
btrfs: fix error pointer dereference after failure to allocate fs devices
btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe()
btrfs: get correct owning_root when dropping snapshot
btrfs: zoned: wait for data BG to be finished on direct IO allocation
btrfs: zoned: drop no longer valid write pointer check
btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()
btrfs: use u64 for buffer sizes in the tree search ioctls
|
|
In the tree search v2 ioctl we use the type size_t, which is an unsigned
long, to track the buffer size in the local variable 'buf_size'. An
unsigned long is 32 bits wide on a 32 bits architecture. The buffer size
defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later
try to copy the local variable 'buf_size' to the argument struct, when
the search returns -EOVERFLOW, we copy only 32 bits which will be a
problem on big endian systems.
Fix this by using a u64 type for the buffer sizes, not only at
btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain
so that we can use the u64 at btrfs_ioctl_tree_search_v2().
Fixes: cc68a8a5a433 ("btrfs: new ioctl TREE_SEARCH_V2")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- raid-stripe-tree
New tree for logical file extent mapping where the physical mapping
may not match on multiple devices. This is now used in zoned mode
to implement RAID0/RAID1* profiles, but can be used in non-zoned
mode as well. The support for RAID56 is in development and will
eventually fix the problems with the current implementation. This
is a backward incompatible feature and has to be enabled at mkfs
time.
- simple quota accounting (squota)
A simplified mode of qgroup that accounts all space on the initial
extent owners (a subvolume), the snapshots are then cheap to create
and delete. The deletion of snapshots in fully accounting qgroups
is a known CPU/IO performance bottleneck.
The squota is not suitable for the general use case but works well
for containers where the original subvolume exists for the whole
time. This is a backward incompatible feature as it needs extending
some structures, but can be enabled on an existing filesystem.
- temporary filesystem fsid (temp_fsid)
The fsid identifies a filesystem and is hard coded in the
structures, which disallows mounting the same fsid found on
different devices.
For a single device filesystem this is not strictly necessary, a
new temporary fsid can be generated on mount e.g. after a device is
cloned. This will be used by Steam Deck for root partition A/B
testing, or can be used for VM root images.
Other user visible changes:
- filesystems with partially finished metadata_uuid conversion cannot
be mounted anymore and the uuid fixup has to be done by btrfs-progs
(btrfstune).
Performance improvements:
- reduce reservations for checksum deletions (with enabled free space
tree by factor of 4), on a sample workload on file with many
extents the deletion time decreased by 12%
- make extent state merges more efficient during insertions, reduce
rb-tree iterations (run time of critical functions reduced by 5%)
Core changes:
- the integrity check functionality has been removed, this was a
debugging feature and removal does not affect other integrity
checks like checksums or tree-checker
- space reservation changes:
- more efficient delayed ref reservations, this avoids building up
too much work or overusing or exhausting the global block
reserve in some situations
- move delayed refs reservation to the transaction start time,
this prevents some ENOSPC corner cases related to exhaustion of
global reserve
- improvements in reducing excessive reservations for block group
items
- adjust overcommit logic in near full situations, account for one
more chunk to eventually allocate metadata chunk, this is mostly
relevant for small filesystems (<10GiB)
- single device filesystems are scanned but not registered (except
seed devices), this allows temp_fsid to work
- qgroup iterations do not need GFP_ATOMIC allocations anymore
- cleanups, refactoring, reduced data structure size, function
parameter simplifications, error handling fixes"
* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
btrfs: open code timespec64 in struct btrfs_inode
btrfs: remove redundant log root tree index assignment during log sync
btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
btrfs: sysfs: show temp_fsid feature
btrfs: disable the device add feature for temp-fsid
btrfs: disable the seed feature for temp-fsid
btrfs: update comment for temp-fsid, fsid, and metadata_uuid
btrfs: remove pointless empty log context list check when syncing log
btrfs: update comment for struct btrfs_inode::lock
btrfs: remove pointless barrier from btrfs_sync_file()
btrfs: add and use helpers for reading and writing last_trans_committed
btrfs: add and use helpers for reading and writing fs_info->generation
btrfs: add and use helpers for reading and writing log_transid
btrfs: add and use helpers for reading and writing last_log_commit
btrfs: support cloned-device mount capability
btrfs: add helper function find_fsid_by_disk
btrfs: stop reserving excessive space for block group item insertions
btrfs: stop reserving excessive space for block group item updates
btrfs: reorder btrfs_inode to fill gaps
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
...
|
|
Convert btrfs to use bdev_open_by_path() and pass the handle around. We
also drop the holder from struct btrfs_device as it is now not needed
anymore.
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-20-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The device addition operation will transform the cloned temp-fsid mounted
device into a multi-device filesystem. Therefore, it is marked as
unsupported.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the last_trans_committed field of struct btrfs_fs_info is
modified and read without any locking or other protection. For example
early in the fsync path, skip_inode_logging() is called which reads
fs_info->last_trans_committed, but at the same time we can have a
transaction commit completing and updating that field.
In the case of an fsync this is harmless and any data race should be
rare and at most cause an unnecessary logging of an inode.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_trans_committed field of struct btrfs_fs_info using
READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.
However there are other readers that are neither holding the lock nor
holding a transaction handle open:
1) When reading an inode from disk, at btrfs_read_locked_inode();
2) When reading the generation to expose it to sysfs, at
btrfs_generation_show();
3) Early in the fsync path, at skip_inode_logging();
4) When creating a hole at btrfs_cont_expand(), during write paths,
truncate and reflinking;
5) In the fs_info ioctl (btrfs_ioctl_fs_info());
6) While mounting the filesystem, in the open_ctree() path. In these
cases it's safe to directly read fs_info->generation as no one
can concurrently start a transaction and update fs_info->generation.
In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The root argument for btrfs_update_inode() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Relocation COWs metadata blocks in two cases for the reloc root:
- copying the subvolume root item when creating the reloc root
- copying a btree node when there is a COW during relocation
In both cases, the resulting btree node hits an abnormal code path with
respect to the owner field in its btrfs_header. It first creates the
root item for the new objectid, which populates the reloc root id, and
it at this point that delayed refs are created.
Later, it fully copies the old node into the new node (including the
original owner field) which overwrites it. This results in a simple
quotas mismatch where we run the delayed ref for the reloc root which
has no simple quota effect (reloc root is not an fstree) but when we
ultimately delete the node, the owner is the real original fstree and we
do free the space.
To work around this without tampering with the behavior of relocation,
add a parameter to btrfs_add_tree_block that lets the relocation code
path specify a different owning root than the "operating" root (in this
case, owning root is the real root and the operating root is the reloc
root). These can naturally be plumbed into delayed refs that have the
same concept.
Note that this is a double count in some sense, but a relatively natural
one, as there are really two extents, and the old one will be deleted
soon. This is consistent with how data relocation extents are accounted
by simple quotas.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Consider the following sequence:
- enable quotas
- create subvol S id 256 at dir outer/
- create a qgroup 1/100
- add 0/256 (S's auto qgroup) to 1/100
- create subvol T id 257 at dir outer/inner/
With full qgroups, there is no relationship between 0/257 and either of
0/256 or 1/100. There is an inherit feature that the creator of inner/
can use to specify it ought to be in 1/100.
Simple quotas are targeted at container isolation, where such automatic
inheritance for not necessarily trusted/controlled nested subvol
creation would be quite helpful. Therefore, add a new default behavior
for simple quotas: when you create a nested subvol, automatically
inherit as parents any parents of the qgroup of the subvol the new inode
is going in.
In our example, 257/0 would also be under 1/100, allowing easy control
of a total quota over an arbitrary hierarchy of subvolumes.
I think this _might_ be a generally useful behavior, so it could be
interesting to put it behind a new inheritance flag that simple quotas
always use while traditional quotas let the user specify, but this is a
minimally intrusive change to start.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add a new quota mode called "simple quotas". It can be enabled by the
existing quota enable ioctl via a new command, and sets an incompat
bit, as the implementation of simple quotas will make backwards
incompatible changes to the disk format of the extent tree.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- reject unknown mount options
- adjust transaction abort error message level
- fix one more build warning with -Wmaybe-uninitialized
- proper error handling in several COW-related cases
* tag 'for-6.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: error out when reallocating block for defrag using a stale transaction
btrfs: error when COWing block from a root that is being deleted
btrfs: error out when COWing block using a stale transaction
btrfs: always print transaction aborted messages with an error level
btrfs: reject unknown mount options early
btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.c
|
|
Jens reported the following warnings from -Wmaybe-uninitialized recent
Linus' branch.
In file included from ./include/asm-generic/rwonce.h:26,
from ./arch/arm64/include/asm/rwonce.h:71,
from ./include/linux/compiler.h:246,
from ./include/linux/export.h:5,
from ./include/linux/linkage.h:7,
from ./include/linux/kernel.h:17,
from fs/btrfs/ioctl.c:6:
In function ‘instrument_copy_from_user_before’,
inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
inlined from ‘btrfs_ioctl_space_info’ at fs/btrfs/ioctl.c:2999:6,
inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4616:10:
./include/linux/kasan-checks.h:38:27: warning: ‘space_args’ may be used
uninitialized [-Wmaybe-uninitialized]
38 | #define kasan_check_write __kasan_check_write
./include/linux/instrumented.h:129:9: note: in expansion of macro
‘kasan_check_write’
129 | kasan_check_write(to, n);
| ^~~~~~~~~~~~~~~~~
./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
volatile void *’ to ‘__kasan_check_write’ declared here
20 | bool __kasan_check_write(const volatile void *p, unsigned int
size);
| ^~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:2981:39: note: ‘space_args’ declared here
2981 | struct btrfs_ioctl_space_args space_args;
| ^~~~~~~~~~
In function ‘instrument_copy_from_user_before’,
inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
inlined from ‘_btrfs_ioctl_send’ at fs/btrfs/ioctl.c:4343:9,
inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4658:10:
./include/linux/kasan-checks.h:38:27: warning: ‘args32’ may be used
uninitialized [-Wmaybe-uninitialized]
38 | #define kasan_check_write __kasan_check_write
./include/linux/instrumented.h:129:9: note: in expansion of macro
‘kasan_check_write’
129 | kasan_check_write(to, n);
| ^~~~~~~~~~~~~~~~~
./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
volatile void *’ to ‘__kasan_check_write’ declared here
20 | bool __kasan_check_write(const volatile void *p, unsigned int
size);
| ^~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:4341:49: note: ‘args32’ declared here
4341 | struct btrfs_ioctl_send_args_32 args32;
| ^~~~~~
This was due to his config options and having KASAN turned on,
which adds some extra checks around copy_from_user(), which then
triggered the -Wmaybe-uninitialized checker for these cases.
Fix the warnings by initializing the different structs we're copying
into.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- several fixes for handling directory item (inserting, removing,
iteration, error handling)
- fix transaction commit stalls when auto relocation is running and
blocks other tasks that want to commit
- fix a build error when DEBUG is enabled
- fix lockdep warning in inode number lookup ioctl
- fix race when finishing block group creation
- remove link to obsolete wiki in several files
* tag 'for-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org
btrfs: assert delayed node locked when removing delayed item
btrfs: remove BUG() after failure to insert delayed dir index item
btrfs: improve error message after failure to add delayed dir index item
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
btrfs: check for BTRFS_FS_ERROR in pending ordered assert
btrfs: fix lockdep splat and potential deadlock after failure running delayed items
btrfs: do not block starts waiting on previous transaction commit
btrfs: release path before inode lookup during the ino lookup ioctl
btrfs: fix race between finishing block group creation and its item update
|
|
During the ino lookup ioctl we can end up calling btrfs_iget() to get an
inode reference while we are holding on a root's btree. If btrfs_iget()
needs to lookup the inode from the root's btree, because it's not
currently loaded in memory, then it will need to lock another or the
same path in the same root btree. This may result in a deadlock and
trigger the following lockdep splat:
WARNING: possible circular locking dependency detected
6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Not tainted
------------------------------------------------------
syz-executor277/5012 is trying to acquire lock:
ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
but task is already holding lock:
ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302
btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955
btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline]
btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338
btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline]
open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494
btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154
btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519
legacy_get_tree+0xef/0x190 fs/fs_context.c:611
vfs_get_tree+0x8c/0x270 fs/super.c:1519
fc_mount fs/namespace.c:1112 [inline]
vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142
btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579
legacy_get_tree+0xef/0x190 fs/fs_context.c:611
vfs_get_tree+0x8c/0x270 fs/super.c:1519
do_new_mount+0x28f/0xae0 fs/namespace.c:3335
do_mount fs/namespace.c:3675 [inline]
__do_sys_mount fs/namespace.c:3884 [inline]
__se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-tree-01){++++}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(btrfs-tree-00);
lock(btrfs-tree-01);
lock(btrfs-tree-00);
rlock(btrfs-tree-01);
*** DEADLOCK ***
1 lock held by syz-executor277/5012:
#0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
stack backtrace:
CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f0bec94ea39
Fix this simply by releasing the path before calling btrfs_iget() as at
point we don't need the path anymore.
Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/
Fixes: 23d0b79dfaed ("btrfs: Add unprivileged version of ino_lookup ioctl")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-27-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
|
|
There is an internal error report that scrub found an error in an orphan
inode's data.
However there are very limited ways to cleanup such orphan inodes:
- btrfs_start_pre_rw_mount()
This happens at either mount, or RO->RW switch.
This is not a viable solution for root fs which may not be unmounted
or RO mounted.
Furthermore this doesn't cover every subvolume, it only covers the
currently cached subvolumes.
- btrfs_lookup_dentry()
This happens when we first lookup the subvolume dentry.
But dentry can be cached thus it's not ensured to be triggered every
time.
- create_snapshot()
This only happens for the created snapshot, not the source one.
This means if we didn't trigger orphan items cleanup, there is really no
other way to manually trigger it. Add this step to the START_SYNC ioctl.
This is a slight change in the semantics of the ioctl but as sync can be
potentially slow and is usually paired with WAIT_SYNC ioctl.
The errors are not handled because the main point of the ioctl is the
async commit, orphan cleanup is a side effect.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A small code simplification, move the default value of transid to its
initialization and remove the else-statement.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Recently a Meta-internal workload encountered subvolume creation taking
up to 2s each, significantly slower than directory creation. As they
were hoping to be able to use subvolumes instead of directories, and
were looking to create hundreds, this was a significant issue. After
Josef investigated, it turned out to be due to the transaction commit
currently performed at the end of subvolume creation.
This change improves the workload by not doing transaction commit for every
subvolume creation, and merely requiring a transaction commit on fsync.
In the worst case, of doing a subvolume create and fsync in a loop, this
should require an equal amount of time to the current scheme; and in the
best case, the internal workload creating hundreds of subvolumes before
fsyncing is greatly improved.
While it would be nice to be able to use the log tree and use the normal
fsync path, log tree replay can't deal with new subvolume inodes
presently.
It's possible that there's some reason that the transaction commit is
necessary for correctness during subvolume creation; however,
git logs indicate that the commit dates back to the beginning of
subvolume creation, and there are no notes on why it would be necessary.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.
For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Balance as exclusive state is compatible with paused balance and device
add, which makes some things more complicated. The assertion of valid
states when starting from paused balance needs to take into account two
more states, the combinations can be hit when there are several threads
racing to start balance and device add. This won't typically happen when
the commands are started from command line.
Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE.
Concurrently adding multiple devices to the same mount point and
btrfs_exclop_finish executed finishes before assertion in
btrfs_exclop_balance, exclusive_operation will changed to
BTRFS_EXCLOP_NONE state which lead to assertion failed:
fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE ||
fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD,
in fs/btrfs/ioctl.c:456
Call Trace:
<TASK>
btrfs_exclop_balance+0x13c/0x310
? memdup_user+0xab/0xc0
? PTR_ERR+0x17/0x20
btrfs_ioctl_add_dev+0x2ee/0x320
btrfs_ioctl+0x9d5/0x10d0
? btrfs_ioctl_encoded_write+0xb80/0xb80
__x64_sys_ioctl+0x197/0x210
do_syscall_64+0x3c/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED.
Concurrently adding multiple devices to the same mount point and
btrfs_exclop_balance executed finish before the latter thread execute
assertion in btrfs_exclop_balance, exclusive_operation will changed to
BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed:
fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE ||
fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD ||
fs_info->exclusive_operation == BTRFS_EXCLOP_NONE,
fs/btrfs/ioctl.c:458
Call Trace:
<TASK>
btrfs_exclop_balance+0x240/0x410
? memdup_user+0xab/0xc0
? PTR_ERR+0x17/0x20
btrfs_ioctl_add_dev+0x2ee/0x320
btrfs_ioctl+0x9d5/0x10d0
? btrfs_ioctl_encoded_write+0xb80/0xb80
__x64_sys_ioctl+0x197/0x210
do_syscall_64+0x3c/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
An example of the failed assertion is below, which shows that the
paused balance is also needed to be checked.
root@syzkaller:/home/xsk# ./repro
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
[ 416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0
Failed to add device /dev/vda, errno 14
[ 416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3
[ 416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
Failed to add device /dev/vda, errno 14
[ 416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3
Failed to add device /dev/vda, errno 14
[ 416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3
[ 416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1
Failed to add device /dev/vda, errno 14
[ 416.637759][ T7984] assertion failed: fs_info->exclusive_operation ==
BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation ==
BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation ==
BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458
[ 416.639845][ T7984] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 #7
[ 416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e
[ 416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282
[ 416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000
[ 416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7
[ 416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000
[ 416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0
[ 416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000
[ 416.648785][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000
[ 416.649616][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0
[ 416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 416.652502][ T7984] PKRU: 55555554
[ 416.652888][ T7984] Call Trace:
[ 416.653241][ T7984] <TASK>
[ 416.653527][ T7984] btrfs_exclop_balance+0x240/0x410
[ 416.654036][ T7984] ? memdup_user+0xab/0xc0
[ 416.654465][ T7984] ? PTR_ERR+0x17/0x20
[ 416.654874][ T7984] btrfs_ioctl_add_dev+0x2ee/0x320
[ 416.655380][ T7984] btrfs_ioctl+0x9d5/0x10d0
[ 416.655822][ T7984] ? btrfs_ioctl_encoded_write+0xb80/0xb80
[ 416.656400][ T7984] __x64_sys_ioctl+0x197/0x210
[ 416.656874][ T7984] do_syscall_64+0x3c/0xb0
[ 416.657346][ T7984] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 416.657922][ T7984] RIP: 0033:0x4546af
[ 416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af
[ 416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003
[ 416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f
[ 416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640
[ 416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000
[ 416.664703][ T7984] </TASK>
[ 416.665040][ T7984] Modules linked in:
[ 416.665590][ T7984] ---[ end trace 0000000000000000 ]---
[ 416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e
[ 416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282
[ 416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000
[ 416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7
[ 416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000
[ 416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0
[ 416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000
[ 416.673501][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000
[ 416.674425][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0
[ 416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Link: https://lore.kernel.org/linux-btrfs/20230324031611.98986-1-xiaoshoukui@gmail.com/
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the introduction of scrub interface, the only flag that we support
is BTRFS_SCRUB_READONLY. Thus there is no sanity checks, if there are
some undefined flags passed in, we just ignore them.
This is problematic if we want to introduce new scrub flags, as we have
no way to determine if such flags are supported.
Address the problem by introducing a check for the flags, and if
unsupported flags are set, return -EOPNOTSUPP to inform the user space.
This check should be backported for all supported kernels before any new
scrub flags are introduced.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- scan block devices in non-exclusive mode to avoid temporary mkfs
failures
- fix race between quota disable and quota assign ioctls
- fix deadlock when aborting transaction during relocation with scrub
- ignore fiemap path cache when there are multiple paths for a node
* tag 'for-6.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ignore fiemap path cache when there are multiple paths for a node
btrfs: fix deadlock when aborting transaction during relocation with scrub
btrfs: scan device in non-exclusive mode
btrfs: fix race between quota disable and quota assign ioctls
|
|
The quota assign ioctl can currently run in parallel with a quota disable
ioctl call. The assign ioctl uses the quota root, while the disable ioctl
frees that root, and therefore we can have a use-after-free triggered in
the assign ioctl, leading to a trace like the following when KASAN is
enabled:
[672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0
[672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736
[672.724][T736]
[672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37
[672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[672.727][T736] Call Trace:
[672.728][T736] <TASK>
[672.728][T736] dump_stack_lvl+0xd9/0x150
[672.725][T736] print_report+0xc1/0x5e0
[672.720][T736] ? __virt_addr_valid+0x61/0x2e0
[672.727][T736] ? __phys_addr+0xc9/0x150
[672.725][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.722][T736] kasan_report+0xc0/0xf0
[672.729][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.724][T736] btrfs_search_slot+0x2962/0x2db0
[672.723][T736] ? fs_reclaim_acquire+0xba/0x160
[672.722][T736] ? split_leaf+0x13d0/0x13d0
[672.726][T736] ? rcu_is_watching+0x12/0xb0
[672.723][T736] ? kmem_cache_alloc+0x338/0x3c0
[672.722][T736] update_qgroup_status_item+0xf7/0x320
[672.724][T736] ? add_qgroup_rb+0x3d0/0x3d0
[672.739][T736] ? do_raw_spin_lock+0x12d/0x2b0
[672.730][T736] ? spin_bug+0x1d0/0x1d0
[672.737][T736] btrfs_run_qgroups+0x5de/0x840
[672.730][T736] ? btrfs_qgroup_rescan_worker+0xa70/0xa70
[672.738][T736] ? __del_qgroup_relation+0x4ba/0xe00
[672.738][T736] btrfs_ioctl+0x3d58/0x5d80
[672.735][T736] ? tomoyo_path_number_perm+0x16a/0x550
[672.737][T736] ? tomoyo_execute_permission+0x4a0/0x4a0
[672.731][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.737][T736] ? __sanitizer_cov_trace_switch+0x54/0x90
[672.734][T736] ? do_vfs_ioctl+0x132/0x1660
[672.730][T736] ? vfs_fileattr_set+0xc40/0xc40
[672.730][T736] ? _raw_spin_unlock_irq+0x2e/0x50
[672.732][T736] ? sigprocmask+0xf2/0x340
[672.737][T736] ? __fget_files+0x26a/0x480
[672.732][T736] ? bpf_lsm_file_ioctl+0x9/0x10
[672.738][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.736][T736] __x64_sys_ioctl+0x198/0x210
[672.736][T736] do_syscall_64+0x39/0xb0
[672.731][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.739][T736] RIP: 0033:0x4556ad
[672.742][T736] </TASK>
[672.743][T736]
[672.748][T736] Allocated by task 27677:
[672.743][T736] kasan_save_stack+0x22/0x40
[672.741][T736] kasan_set_track+0x25/0x30
[672.741][T736] __kasan_kmalloc+0xa4/0xb0
[672.749][T736] btrfs_alloc_root+0x48/0x90
[672.746][T736] btrfs_create_tree+0x146/0xa20
[672.744][T736] btrfs_quota_enable+0x461/0x1d20
[672.743][T736] btrfs_ioctl+0x4a1c/0x5d80
[672.747][T736] __x64_sys_ioctl+0x198/0x210
[672.749][T736] do_syscall_64+0x39/0xb0
[672.744][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.756][T736]
[672.757][T736] Freed by task 27677:
[672.759][T736] kasan_save_stack+0x22/0x40
[672.759][T736] kasan_set_track+0x25/0x30
[672.756][T736] kasan_save_free_info+0x2e/0x50
[672.751][T736] ____kasan_slab_free+0x162/0x1c0
[672.758][T736] slab_free_freelist_hook+0x89/0x1c0
[672.752][T736] __kmem_cache_free+0xaf/0x2e0
[672.752][T736] btrfs_put_root+0x1ff/0x2b0
[672.759][T736] btrfs_quota_disable+0x80a/0xbc0
[672.752][T736] btrfs_ioctl+0x3e5f/0x5d80
[672.756][T736] __x64_sys_ioctl+0x198/0x210
[672.753][T736] do_syscall_64+0x39/0xb0
[672.765][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.769][T736]
[672.768][T736] The buggy address belongs to the object at ffff888022ec0000
[672.768][T736] which belongs to the cache kmalloc-4k of size 4096
[672.769][T736] The buggy address is located 520 bytes inside of
[672.769][T736] freed 4096-byte region [ffff888022ec0000, ffff888022ec1000)
[672.760][T736]
[672.764][T736] The buggy address belongs to the physical page:
[672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0
[672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002
[672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000
[672.771][T736] page dumped because: kasan: bad access detected
[672.778][T736] page_owner tracks the page as allocated
[672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88
[672.779][T736] get_page_from_freelist+0x119c/0x2d50
[672.779][T736] __alloc_pages+0x1cb/0x4a0
[672.776][T736] alloc_pages+0x1aa/0x270
[672.773][T736] allocate_slab+0x260/0x390
[672.771][T736] ___slab_alloc+0xa9a/0x13e0
[672.778][T736] __slab_alloc.constprop.0+0x56/0xb0
[672.771][T736] __kmem_cache_alloc_node+0x136/0x320
[672.789][T736] __kmalloc+0x4e/0x1a0
[672.783][T736] tomoyo_realpath_from_path+0xc3/0x600
[672.781][T736] tomoyo_path_perm+0x22f/0x420
[672.782][T736] tomoyo_path_unlink+0x92/0xd0
[672.780][T736] security_path_unlink+0xdb/0x150
[672.788][T736] do_unlinkat+0x377/0x680
[672.788][T736] __x64_sys_unlink+0xca/0x110
[672.789][T736] do_syscall_64+0x39/0xb0
[672.783][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.784][T736] page last free stack trace:
[672.787][T736] free_pcp_prepare+0x4e5/0x920
[672.787][T736] free_unref_page+0x1d/0x4e0
[672.784][T736] __unfreeze_partials+0x17c/0x1a0
[672.797][T736] qlist_free_all+0x6a/0x180
[672.796][T736] kasan_quarantine_reduce+0x189/0x1d0
[672.797][T736] __kasan_slab_alloc+0x64/0x90
[672.793][T736] kmem_cache_alloc+0x17c/0x3c0
[672.799][T736] getname_flags.part.0+0x50/0x4e0
[672.799][T736] getname_flags+0x9e/0xe0
[672.792][T736] vfs_fstatat+0x77/0xb0
[672.791][T736] __do_sys_newlstat+0x84/0x100
[672.798][T736] do_syscall_64+0x39/0xb0
[672.796][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.790][T736]
[672.791][T736] Memory state around the buggy address:
[672.799][T736] ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.805][T736] ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ^
[672.809][T736] ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex
before calling btrfs_run_qgroups(), which is what all qgroup ioctls should
call.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8waEPciCHJvMA@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"First batch of fixes. Among them there are two updates to sysfs and
ioctl which are not strictly fixes but are used for testing so there's
no reason to delay them.
- fix block group item corruption after inserting new block group
- fix extent map logging bit not cleared for split maps after
dropping range
- fix calculation of unusable block group space reporting bogus
values due to 32/64b division
- fix unnecessary increment of read error stat on write error
- improve error handling in inode update
- export per-device fsid in DEV_INFO ioctl to distinguish seeding
devices, needed for testing
- allocator size classes:
- fix potential dead lock in size class loading logic
- print sysfs stats for the allocation classes"
* tag 'for-6.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix block group item corruption after inserting new block group
btrfs: fix extent map logging bit not cleared for split maps after dropping range
btrfs: fix percent calculation for bg reclaim message
btrfs: fix unnecessary increment of read error stat on write error
btrfs: handle btrfs_del_item errors in __btrfs_update_delayed_inode
btrfs: ioctl: return device fsid from DEV_INFO ioctl
btrfs: fix potential dead lock in size class loading logic
btrfs: sysfs: add size class stats
|
|
Currently user space utilizes dev info ioctl to grab the info of a
certain devid, this includes its device uuid. But the returned info is
not enough to determine if a device is a seed.
Commit a26d60dedf9a ("btrfs: sysfs: add devinfo/fsid to retrieve actual
fsid from the device") exports the same value in sysfs so this is for
parity with ioctl. Add a new member, fsid, into
btrfs_ioctl_dev_info_args, and populate the member with fsid value.
This should not cause any compatibility problem, following the
combinations:
- Old user space, old kernel
- Old user space, new kernel
User space tool won't even check the new member.
- New user space, old kernel
The kernel won't touch the new member, and user space tool should
zero out its argument, thus the new member is all zero.
User space tool can then know the kernel doesn't support this fsid
reporting, and falls back to whatever they can.
- New user space, new kernel
Go as planned.
Would find the fsid member is no longer zero, and trust its value.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The usual mix of performance improvements and new features.
The core change is reworking how checksums are processed, with
followup cleanups and simplifications. There are two minor changes in
block layer and iomap code.
Features:
- block group allocation class heuristics:
- pack files by size (up to 128k, up to 8M, more) to avoid
fragmentation in block groups, assuming that file size and life
time is correlated, in particular this may help during balance
- with tracepoints and extensible in the future
Performance:
- send: cache directory utimes and only emit the command when
necessary
- speedup up to 10x
- smaller final stream produced (no redundant utimes commands
issued)
- compatibility not affected
- fiemap: skip backref checks for shared leaves
- speedup 3x on sample filesystem with all leaves shared (e.g. on
snapshots)
- micro optimized b-tree key lookup, speedup in metadata operations
(sample benchmark: fs_mark +10% of files/sec)
Core changes:
- change where checksumming is done in the io path:
- checksum and read repair does verification at lower layer
- cascaded cleanups and simplifications
- raid56 refactoring and cleanups
Fixes:
- sysfs: make sure that a run-time change of a feature is correctly
tracked by the feature files
- scrub: better reporting of tree block errors
Other:
- locally enable -Wmaybe-uninitialized after fixing all warnings
- misc cleanups, spelling fixes
Other code:
- block: export bio_split_rw
- iomap: remove IOMAP_F_ZONE_APPEND"
* tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (109 commits)
btrfs: make kobj_type structures constant
btrfs: remove the bdev argument to btrfs_rmap_block
btrfs: don't rely on unchanging ->bi_bdev for zone append remaps
btrfs: never return true for reads in btrfs_use_zone_append
btrfs: pass a btrfs_bio to btrfs_use_append
btrfs: set bbio->file_offset in alloc_new_bio
btrfs: use file_offset to limit bios size in calc_bio_boundaries
btrfs: do unsigned integer division in the extent buffer binary search loop
btrfs: eliminate extra call when doing binary search on extent buffer
btrfs: raid56: handle endio in scrub_rbio
btrfs: raid56: handle endio in recover_rbio
btrfs: raid56: handle endio in rmw_rbio
btrfs: raid56: submit the read bios from scrub_assemble_read_bios
btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios
btrfs: raid56: fold recover_assemble_read_bios into recover_rbio
btrfs: raid56: add a bio_list_put helper
btrfs: raid56: wait for I/O completion in submit_read_bios
btrfs: raid56: simplify code flow in rmw_rbio
btrfs: raid56: simplify error handling and code flow in raid56_parity_write
btrfs: replace btrfs_wait_tree_block_writeback by wait_on_extent_buffer_writeback
...
|
|
btrfs_clean_tree_block is a misnomer, it's just
clear_extent_buffer_dirty with some extra accounting around it. Rename
this to btrfs_clear_buffer_dirty to make it more clear it belongs with
it's setter, btrfs_mark_buffer_dirty.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We check the header generation in the extent buffer against the current
running transaction id to see if it's safe to clear DIRTY on this
buffer. Generally speaking if we're clearing the buffer dirty we're
holding the transaction open, but in the case of cleaning up an aborted
transaction we don't, so we have extra checks in that path to check the
transid. To allow for a future cleanup go ahead and pass in the trans
handle so we don't have to rely on ->running_transaction being set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This round there are a lot of cleanups and moved code so the diffstat
looks huge, otherwise there are some nice performance improvements and
an update to raid56 reliability.
User visible features:
- raid56 reliability vs performance trade off:
- fix destructive RMW for raid5 data (raid6 still needs work): do
full checksum verification for all data during RMW cycle, this
should prevent rewriting potentially corrupted data without
notice
- stripes are cached in memory which should reduce the performance
impact but still can hurt some workloads
- checksums are verified after repair again
- this is the last option without introducing additional features
(write intent bitmap, journal, another tree), the extra checksum
read/verification was supposed to be avoided by the original
implementation exactly for performance reasons but that caused
all the reliability problems
- discard=async by default for devices that support it
- implement emergency flush reserve to avoid almost all unnecessary
transaction aborts due to ENOSPC in cases where there are too many
delayed refs or delayed allocation
- skip block group synchronization if there's no change in used
bytes, can reduce transaction commit count for some workloads
Performance improvements:
- fiemap and lseek:
- overall speedup due to skipping unnecessary or duplicate
searches (-40% run time)
- cache some data structures and sharedness of extents (-30% run
time)
- send:
- faster backref resolution when finding clones
- cached leaf to root mapping for faster backref walking
- improved clone/sharing detection
- overall run time improvements (-70%)
Core:
- module initialization converted to a table of function pointers run
in a sequence
- preparation for fscrypt, extend passing file names across calls,
dir item can store encryption status
- raid56 updates:
- more accurate error tracking of sectors within stripe
- simplify recovery path and remove dedicated endio worker kthread
- simplify scrub call paths
- refactoring to support the extra data checksum verification
during RMW cycle
- tree block parentness checks consolidated and done at metadata read
time
- improved error handling
- cleanups:
- move a lot of code for better synchronization between kernel and
user space sources, split big files
- enum cleanups
- GFP flag cleanups
- header file cleanups, prototypes, dependencies
- redundant parameter cleanups
- inline extent handling simplifications
- inode parameter conversion
- data structure cleanups, reductions, renames, merges"
* tag 'for-6.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (249 commits)
btrfs: print transaction aborted messages with an error level
btrfs: sync some cleanups from progs into uapi/btrfs.h
btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a range
btrfs: fix extent map use-after-free when handling missing device in read_one_chunk
btrfs: remove outdated logic from overwrite_item() and add assertion
btrfs: unify overwrite_item() and do_overwrite_item()
btrfs: replace strncpy() with strscpy()
btrfs: fix uninitialized variable in find_first_clear_extent_bit
btrfs: fix uninitialized parent in insert_state
btrfs: add might_sleep() annotations
btrfs: add stack helpers for a few btrfs items
btrfs: add nr_global_roots to the super block definition
btrfs: remove BTRFS_LEAF_DATA_OFFSET
btrfs: add helpers for manipulating leaf items and data
btrfs: add eb to btrfs_node_key_ptr_offset
btrfs: pass the extent buffer for the btrfs_item_nr helpers
btrfs: move the csum helpers into ctree.h
btrfs: move eb offset helpers into extent_io.h
btrfs: move file_extent_item helpers into file-item.h
btrfs: move leaf_data_end into ctree.c
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
|
|
Using strncpy() on NUL-terminated strings are deprecated. To avoid
possible forming of non-terminated string strscpy() should be used.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|