Age | Commit message (Collapse) | Author | Files | Lines |
|
We've been seeing some livelock-ish behavior in the index update part of
the main write path, and while we've got low level btree path
tracepoints, we've been lacking high level btree iterator tracepoints.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a tracepoint for when we insert only part of an extent, due to too
many overwrites.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull smb client fixes from Steve French:
- SMB3.1.1 POSIX extensions fix for char remapping
- Fix for repeated directory listings when directory leases enabled
- deferred close handle reuse fix
* tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: improve directory cache reuse for readdir operations
smb: client: fix perf regression with deferred closes
smb: client: disable path remapping with POSIX extensions
|
|
Previously, ext2_fiemap would unconditionally apply "len = min_t(u64, len,
i_size_read(inode));", When inode->i_size was 0 (for an empty file), this
would reduce the requested len to 0. Passing len = 0 to iomap_fiemap could
then result in an -EINVAL error, even for valid queries on empty files.
Link: https://github.com/linux-test-project/ltp/issues/1246
Signed-off-by: Wei Gao <wegao@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250613152402.3432135-1-wegao@suse.com
|
|
Use ZONEFS_SUPER_SIZE constant instead of PAGE_SIZE allocating memory for
reading the super block in zonefs_read_super().
While PAGE_SIZE technically isn't incorrect as Linux doesn't support pages
smaller than 4k ZONEFS_SUPER_SIZE is semantically more correct.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
|
|
The new code neglects to remove a freshly-allocated RCL from the
callback's referring call list when no matching referring call is
found.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505171002.cE46sdj5-lkp@intel.com/
Fixes: 4f3c8d8c9e10 ("NFSD: Implement CB_SEQUENCE referring call lists")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Currently, cached directory contents were not reused across subsequent
'ls' operations because the cache validity check relied on comparing
the ctx pointer, which changes with each readdir invocation. As a
result, the cached dir entries was not marked as valid and the cache was
not utilized for subsequent 'ls' operations.
This change uses the file pointer, which remains consistent across all
readdir calls for a given directory instance, to associate and validate
the cache. As a result, cached directory contents can now be
correctly reused, improving performance for repeated directory listings.
Performance gains with local windows SMB server:
Without the patch and default actimeo=1:
1000 directory enumeration operations on dir with 10k files took 135.0s
With this patch and actimeo=0:
1000 directory enumeration operations on dir with 10k files took just 5.1s
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Customer reported that one of their applications started failing to
open files with STATUS_INSUFFICIENT_RESOURCES due to NetApp server
hitting the maximum number of opens to same file that it would allow
for a single client connection.
It turned out the client was failing to reuse open handles with
deferred closes because matching ->f_flags directly without masking
off O_CREAT|O_EXCL|O_TRUNC bits first broke the comparision and then
client ended up with thousands of deferred closes to same file. Those
bits are already satisfied on the original open, so no need to check
them against existing open handles.
Reproducer:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <pthread.h>
#define NR_THREADS 4
#define NR_ITERATIONS 2500
#define TEST_FILE "/mnt/1/test/dir/foo"
static char buf[64];
static void *worker(void *arg)
{
int i, j;
int fd;
for (i = 0; i < NR_ITERATIONS; i++) {
fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_APPEND, 0666);
for (j = 0; j < 16; j++)
write(fd, buf, sizeof(buf));
close(fd);
}
}
int main(int argc, char *argv[])
{
pthread_t t[NR_THREADS];
int fd;
int i;
fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_TRUNC, 0666);
close(fd);
memset(buf, 'a', sizeof(buf));
for (i = 0; i < NR_THREADS; i++)
pthread_create(&t[i], NULL, worker, NULL);
for (i = 0; i < NR_THREADS; i++)
pthread_join(t[i], NULL);
return 0;
}
Before patch:
$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1391
After patch:
$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1
Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Pierguido Lambri <plambri@redhat.com>
Fixes: b8ea3b1ff544 ("smb: enable reuse of deferred file handles for write operations")
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This function takes super_lock in shared mode, so it should release the
same lock.
Cc: stable@vger.kernel.org # v6.16-rc1
Fixes: af7551cf13cf7f ("super: remove pointless s_root checks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/20250611164044.GF6138@frogsfrogsfrogs
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We want to print the name in case of mkdir failure and now we will
get a cryptic (efault) as name.
Fixes: c54b386969a5 ("VFS: Change vfs_mkdir() to return the dentry.")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/20250612072245.2825938-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We currently use multiple CONFIG_UNIX guards. This looks messy and makes
the code harder to follow and maintain. Use a helper function
coredump_sock_connect() that handles the connect portion. This allows us
to remove the CONFIG_UNIX guard in the main do_coredump() function.
Link: https://lore.kernel.org/20250605-schlamm-touren-720ba2b60a85@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Extend the coredump socket to allow the coredump server to tell the
kernel how to process individual coredumps.
When the crashing task connects to the coredump socket the kernel will
send a struct coredump_req to the coredump server. The kernel will set
the size member of struct coredump_req allowing the coredump server how
much data can be read.
The coredump server uses MSG_PEEK to peek the size of struct
coredump_req. If the kernel uses a newer struct coredump_req the
coredump server just reads the size it knows and discard any remaining
bytes in the buffer. If the kernel uses an older struct coredump_req
the coredump server just reads the size the kernel knows.
The returned struct coredump_req will inform the coredump server what
features the kernel supports. The coredump_req->mask member is set to
the currently know features.
The coredump server may only use features whose bits were raised by the
kernel in coredump_req->mask.
In response to a coredump_req from the kernel the coredump server sends
a struct coredump_ack to the kernel. The kernel informs the coredump
server what version of struct coredump_ack it supports by setting struct
coredump_req->size_ack to the size it knows about. The coredump server
may only send as many bytes as coredump_req->size_ack indicates (a
smaller size is fine of course). The coredump server must set
coredump_ack->size accordingly.
The coredump server sets the features it wants to use in struct
coredump_ack->mask. Only bits returned in struct coredump_req->mask may
be used.
In case an invalid struct coredump_ack is sent to the kernel a non-zero
u32 integer is sent indicating the reason for the failure. If it was
successful a zero u32 integer is sent.
In the initial version the following features are supported in
coredump_{req,ack}->mask:
* COREDUMP_KERNEL
The kernel will write the coredump data to the socket.
* COREDUMP_USERSPACE
The kernel will not write coredump data but will indicate to the
parent that a coredump has been generated. This is used when userspace
generates its own coredumps.
* COREDUMP_REJECT
The kernel will skip generating a coredump for this task.
* COREDUMP_WAIT
The kernel will prevent the task from exiting until the coredump
server has shutdown the socket connection.
The flexible coredump socket can be enabled by using the "@@" prefix
instead of the single "@" prefix for the regular coredump socket:
@@/run/systemd/coredump.socket
will enable flexible coredump handling. Current kernels already enforce
that "@" must be followed by "/" and will reject anything else. So
extending this is backward and forward compatible.
Link: https://lore.kernel.org/20250603-work-coredump-socket-protocol-v2-1-05a5f0c18ecc@kernel.org
Acked-by: Lennart Poettering <lennart@poettering.net>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We normally can't create a new directory with the case-insensitive
option already set - except when we're creating a snapshot.
And if casefolding is enabled filesystem wide, we should still set it
even though not strictly required, for consistency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, we only ever logged the filesystem UUID.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We have to be able to print superblock sections even if they fail to
validate (for debugging), so we have to calculate the number of entries
from the field size.
Reported-by: syzbot+5138f00559ffb3cb3610@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
It seems btree node scan picked up a partially overwritten btree node,
and corrected the "bset version older than sb version_min" error -
resulting in an invalid superblock with a bad version_min field.
Don't run this check at all when we're in btree node scan, and when we
do run it, do something saner if the bset version is totally crazy.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Multiple ioctl handlers individually use a lot of stack space, and clang chooses
to inline them into the bch2_fs_ioctl() function, blowing through the warning
limit:
fs/bcachefs/chardev.c:655:6: error: stack frame size (1032) exceeds limit (1024) in 'bch2_fs_ioctl' [-Werror,-Wframe-larger-than]
655 | long bch2_fs_ioctl(struct bch_fs *c, unsigned cmd, void __user *arg)
By marking the largest two of them as noinline_for_stack, no indidual code path
ends up using this much, which avoids the warning and reduces the possible
total stack usage in the ioctl handler.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
fsck_err() can return a transaction restart if passed a transaction
object - this has always been true when it has to drop locks to prompt
for user input, but we're seeing this more now that we're logging the
error being corrected in the journal.
gc_accounting_done() doesn't call fsck_err() from an actual commit loop,
and it doesn't need to be holding btree locks when it calls fsck_err(),
so the easy fix here for the unhandled transaction restart is to just
not pass it the transaction object. We'll miss out on the fancy new
logging, but that's ok.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix a small leak of the superblock 'clean' section.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
PREEMPT_RT redefines how standard spinlocks work, so local_irq_save() +
spin_lock() is no longer equivalent to spin_lock_irqsave(). Fortunately,
we don't strictly need to do it that way.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix a UAF: we were calling darray_make_room() and retaining a pointer to
the old buffer.
And fix an UBSAN warning: struct bch_sb_field_downgrade_entry uses
__counted_by, so set dst->nr_errors before assigning to the array entry.
Reported-by: syzbot+14c52d86ddbd89bea13e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Object debugging generally needs special provisions for putting said
objects on the stack, which rhashtable does not have.
Reported-by: syzbot+bcc38a9556d0324c2ec2@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If we think we're read-only but the VFS doesn't, fun will ensue.
And now that we know we have to be able to do this safely, just make
nochanges imply ro.
Reported-by: syzbot+a7d6ceaba099cc21dee4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Link: https://lore.kernel.org/all/6822ab02.050a0220.f2294.00cb.GAE@google.com/T/
Reported-by: syzbot+2c3ef91c9523c3d1a25c@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_btree_lost_data() gets called on btree node read error, but the
error might be transient.
btree_node_scan is expensive, and there's no need to run it persistently
(marking it in the superblock as required to run) - check_topology
will run it if required, via bch2_get_scanned_nodes().
Running it non-persistently is fine, to avoid check_topology having to
rewind recovery to run it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_btree_increase_depth() was originally for disaster recovery, to get
some data back from the journal when a btree root was bad.
We don't need it for that purpose anymore; on bad btree root we'll
launch btree node scan and reconstruct all the interior nodes.
If there's a key in the journal for a depth that doesn't exists, and
it's not from check_topology/btree node scan, we should just ignore it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Invalidate pagecache after we write the new superblock and send a
uevent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
It seems excessive forced btree node rewrites can cause interior btree
updates to become wedged during recovery, before we're using the write
buffer for backpointer updates.
Add more flags so we can determine where these are coming from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We had a deadlock during recovery where interior btree updates became
wedged and all open_buckets were consumed; start adding more
introspection.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Log the specific error being corrected in the journal when we're
repairing, this helps greatly with 'bcachefs list_journal' analysis.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The next patch will add logging of the specific error being corrected in
repair paths to the journal; this means __bch2_fsck_err() can return
transaction restarts in places that previously weren't expecting them.
check_topology() is old code that doesn't use btree iterators for btree
node locking - it'll have to be rewritten in the future to work online.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
just set DCACHE_DONTCACHE for "don't cache" mounts...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
makes simple_lookup() slightly cheaper there - no need for
simple_lookup() to set the flag and we want it on everything
on those anyway.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
No users left and anything that wants it would be better off just
setting DCACHE_DONTCACHE in their ->s_d_flags.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Default ->d_op being simple_dentry_operations is equivalent to leaving
it NULL and putting DCACHE_DONTCACHE into ->s_d_flags.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Do that before new dentry is visible anywhere. It does create
a new possible state for dentries present in ->d_children/->d_sib -
DCACHE_PAR_LOOKUP present, negative, unhashed, not in in-lookup
hash chains, refcount positive. Those are going to be skipped
by all tree-walkers (both d_walk() callbacks in fs/dcache.c and
explicit loops over children/sibling lists elsewhere) and
dput() is fine with those.
NOTE: dropping the final reference to a "normal" in-lookup dentry
(in in-lookup hash) is a bug - somebody must've forgotten to
call d_lookup_done() on it and bad things will happen. With those
it's OK; if/when we get around to making __dentry_kill() complain
about such breakage, remember that predicate to check should
*not* be just d_in_lookup(victim) but rather a combination of that
with !hlist_bl_unhashed(&victim->d_u.d_in_lookup_hash). Might
be worth considering later...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Convert the last user (d_alloc_pseudo()) and be done with that.
Any out-of-tree filesystem using it should switch to d_splice_alias_ops()
or, better yet, check whether it really needs to have ->d_op vary among
its dentries.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
No need to mess with ->d_op at all. Note that ->d_delete that always
returns 1 is equivalent to having DCACHE_DONTCACHE in ->d_flags.
Later the same thing will be placed into ->s_d_flags of the filesystems
where we want that behaviour for all dentries; then the check in
simple_lookup() will at least get unlikely() slapped on it.
NOTE: there are only two filesystems where
* simple_lookup() might be called
* default ->d_op is non-NULL
* its ->d_delete() doesn't always return 1
If not for those, we could have simple_lookup() just set DCACHE_DONTCACHE
without even looking at ->d_op. Filesystems in question are btrfs
and tracefs; both have ->d_delete() returning 1 on anything fed to
simple_lookup(), so both would be fine with simple_lookup() setting
DCACHE_DONTCACHE regardless of ->d_op.
IOW, we might want to drop the check for ->d_op in simple_lookup();
it's definitely a separate story, though.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
If a lookup in tracefs is done on a file that does not exist, it leaves a
dentry hanging around until memory pressure removes it. But eventfs
dentries should hang around as when their ref count goes to zero, it
requires more work to recreate it. For the rest of the tracefs dentries,
they hang around as their dentry is used as a descriptor for the tracing
system. But if a file lookup happens for a file in tracefs that does not
exist, it should be deleted.
Add a .d_delete callback that checks if dentry->fsdata is set or not. Only
eventfs dentries set fsdata so if it has content it should not be deleted
and should hang around in the cache.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... and store it in ->s_d_flags, to be used by __d_alloc()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
DCACHE_OP_PRUNE in ->d_flags at the time of d_set_d_op() should've
been treated the same as any other DCACHE_OP_... - we forgot to adjust
that WARN_ON() when DCACHE_OP_PRUNE had been introduced...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
rather than locking the directory and using lookup_one(), just use
lookup_one_unlocked(). This keeps locking code centralised.
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/20250608230952.20539-5-neil@brown.name
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The code in coda_readdir() is nearly identical to iterate_dir().
Differences are:
- iterate_dir() is killable
- iterate_dir() adds permission checking and accessing notifications
I believe these are not harmful for coda so it is best to use
iterate_dir() directly. This will allow locking changes without
touching the code in coda.
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/20250608230952.20539-4-neil@brown.name
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|