| Age | Commit message (Collapse) | Author | Files | Lines |
|
This gets the memory sizes from the nodes and stores the limit
as 50% of those. I think eventually we should drop the limits
once we have memcg aware shrinking, but this should be more NUMA
friendly, and I think seems like what people would prefer to
happen on NUMA aware systems.
Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
This enable NUMA awareness for the shrinker on the
ttm pools.
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
The list_lru will now handle numa for us, so no need to keep
separate pool types for it. Just consolidate into the global ones.
This adds a debugfs change to avoid dumping non-existant orders due
to this change.
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
This is an initial port of the TTM pools for
write combined and uncached pages to use the list_lru.
This makes the pool's more NUMA aware and avoids
needing separate NUMA pools (later commit enables this).
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
This uses the newly introduced per-node gpu tracking stats,
to track GPU memory allocated via TTM and reclaimable memory in
the TTM page pools.
These stats will be useful later for system information and
later when mem cgroups are integrated.
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
While discussing memcg intergration with gpu memory allocations,
it was pointed out that there was no numa/system counters for
GPU memory allocations.
With more integrated memory GPU server systems turning up, and
more requirements for memory tracking it seems we should start
closing the gap.
Add two counters to track GPU per-node system memory allocations.
The first is currently allocated to GPU objects, and the second
is for memory that is stored in GPU page pools that can be reclaimed,
by the shrinker.
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
When a CREATE returns STATUS_STOPPED_ON_SYMLINK, smb2_check_message()
returns success without any length validation, leaving the symlink
parsers as the only defense against an untrusted server.
symlink_data() walks SMB 3.1.1 error contexts with the loop test "p <
end", but reads p->ErrorId at offset 4 and p->ErrorDataLength at offset
0. When the server-controlled ErrorDataLength advances p to within 1-7
bytes of end, the next iteration will read past it. When the matching
context is found, sym->SymLinkErrorTag is read at offset 4 from
p->ErrorContextData with no check that the symlink header itself fits.
smb2_parse_symlink_response() then bounds-checks the substitute name
using SMB2_SYMLINK_STRUCT_SIZE as the offset of PathBuffer from
iov_base. That value is computed as sizeof(smb2_err_rsp) +
sizeof(smb2_symlink_err_rsp), which is correct only when
ErrorContextCount == 0.
With at least one error context the symlink data sits 8 bytes deeper,
and each skipped non-matching context shifts it further by 8 +
ALIGN(ErrorDataLength, 8). The check is too short, allowing the
substitute name read to run past iov_len. The out-of-bound heap bytes
are UTF-16-decoded into the symlink target and returned to userspace via
readlink(2).
Fix this all up by making the loops test require the full context header
to fit, rejecting sym if its header runs past end, and bound the
substitute name against the actual position of sym->PathBuffer rather
than a fixed offset.
Because sub_offs and sub_len are 16bits, the pointer math will not
overflow here with the new greater-than.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Shyam Prasad N <sprasad@microsoft.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Bharath SM <bharathsm@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: stable <stable@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Assisted-by: gregkh_clanker_t1000
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The bounds check uses (u8 *)ea + nlen + 1 + vlen as the end of the EA
name and value, but ea_data sits at offset sizeof(struct
smb2_file_full_ea_info) = 8 from ea, not at offset 0. The strncmp()
later reads ea->ea_data[0..nlen-1] and the value bytes follow at
ea_data[nlen+1..nlen+vlen], so the actual end is ea->ea_data + nlen + 1
+ vlen. Isn't pointer math fun?
The earlier check (u8 *)ea > end - sizeof(*ea) only guarantees the
8-byte header is in bounds, but since the last EA is placed within 8
bytes of the end of the response, the name and value bytes are read past
the end of iov.
Fix this mess all up by using ea->ea_data as the base for the bounds
check.
An "untrusted" server can use this to leak up to 8 bytes of kernel heap
into the EA name comparison and influence which WSL xattr the data is
interpreted as.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Shyam Prasad N <sprasad@microsoft.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Bharath SM <bharathsm@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: stable <stable@kernel.org>
Assisted-by: gregkh_clanker_t1000
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When flushing out outstanding glock work during an unmount, gfs2_log_flush()
can be called when sdp->sd_jdesc has already been deallocated and sdp->sd_jdesc
is NULL. Commit 35264909e9d1 ("gfs2: Fix NULL pointer dereference in
gfs2_log_flush") added a check for that to gfs2_log_flush() itself, but it
missed the sdp->sd_jdesc dereference in gfs2_log_release(). Fix that.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202604071139.HNJiCaAi-lkp@intel.com/
Fixes: 35264909e9d1 ("gfs2: Fix NULL pointer dereference in gfs2_log_flush")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_evict_inode(), don't issue error messages once a withdraw has already
occurred.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
During an unmount, wait for potential withdraw to complete before calling
gfs2_make_fs_ro(). This will allow gfs2_make_fs_ro() to skip much of its work.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_dinode_in(), only allow directories to have the GFS2_DIF_EXHASH
flag set. This will prevent other parts of the code from treating
regular inodes as directories based on the presence of that flag.
In sweep_bh_for_rgrps() and __gfs2_free_blocks(), check if the
GFS2_DIF_EXHASH flag is set instead of checking if i_depth is non-zero.
This matches what the directory code does. (The i_depth checks were
introduced in commit 6d3117b412951 ("GFS2: Wipe directory hash table
metadata when deallocating a directory").)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When a withdraw occurs in gfs2_log_flush() and we are left with an unsubmitted
bio, fail that bio. Otherwise, the bh's in that bio will remain locked and
gfs2_evict_inode() -> truncate_inode_pages() -> gfs2_invalidate_folio() ->
gfs2_discard() will hang trying to discard the locked bh's.
In addition, when gfs2_log_flush() fails to submit a new transaction, unpin the
buffers in the failing transaction like gfs2_remove_from_journal() does. If
any of the bd's are on the ail2 list, leave them there and do_withdraw() ->
gfs2_withdraw_glocks() -> inode_go_inval() -> truncate_inode_pages() ->
gfs2_invalidate_folio() -> gfs2_discard() will remove them. They will be freed
in gfs2_release_folio().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Function gfs2_logd() calls the log flushing functions gfs2_ail1_start(),
gfs2_ail1_wait(), and gfs2_ail1_empty() without holding sdp->sd_log_flush_lock,
but these functions require exclusion against concurrent transactions.
To fix that, add a non-locking __gfs2_log_flush() function. Then, in
gfs2_logd(), take sdp->sd_log_flush_lock before calling the above mentioned log
flushing functions and __gfs2_log_flush().
Fixes: 5e4c7632aae1c ("gfs2: Issue revokes more intelligently")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When a withdrawn filesystem's inodes are being evicted, the address spaces of
those inodes still need to be truncated but we can no longer start new
transactions. We still don't want gfs2_invalidate_folio() to race with
gfs2_log_flush(), so take a read lock on sdp->sd_log_flush_lock in that case.
(It may not be obvious, but gfs2_invalidate_folio() is a jdata-only address
space operation.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When a withdraw is carried out, call gfs2_ail_drain() under the
sdp->sd_log_flush_lock. This isn't strictly necessary but should be easier to
read, and more robust against possible future bugs.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
We only need to convert to picosecond units before writing to RING_IDLEDLY.
Fixes: 7c53ff050ba8 ("drm/xe: Apply Wa_16023105232")
Cc: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com>
Acked-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com>
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://patch.msgid.link/20260401012710.4165547-1-vinay.belgaumkar@intel.com
(cherry picked from commit 13743bd628bc9d9a0e2fe53488b2891aedf7cc74)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
This adds the following commits from upstream:
53373d135579 dtc: Remove unused dts_version in dtc-lexer.l
caf7465c5d60 libfdt: fdt_check_full: Handle FDT_NOP when FDT_END is expected
5976c4a66098 libfdt: fdt_rw: Introduce fdt_downgrade_version()
5bb5bedd347d fdtdump: Return an error code on wrong tag value
68b960e299f7 fdtdump: Remove dtb version check
adba02caf554 dtc: Use a consistent type for basenamelen
8d15a63e84ff libfdt: Verify alignment of sub-blocks in dtb
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
The dynamic EPP feature uses power_supply_reg_notifier() and
power_supply_unreg_notifier() but doesn't declare a dependency on
POWER_SUPPLY, causing linker errors when POWER_SUPPLY is not enabled.
Add POWER_SUPPLY to the selects.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Fixes: e30ca6dd5345 ("cpufreq/amd-pstate: Add dynamic energy performance preference")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604040742.ySEdkuAa-lkp@intel.com/
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/20260407194949.310114-1-mario.limonciello@amd.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Clang recently added -fdiagnostics-show-inlining-chain [1] to improve
the visibility of inlining chains in diagnostics. This is particularly
useful for CONFIG_FORTIFY_SOURCE where detections can happen deep in
inlined functions.
Add this flag to KBUILD_CFLAGS under a cc-option so it is enabled if the
compiler supports it. Note that GCC does not have an equivalent flag as
it supports a similar diagnostic structure unconditionally.
Link: https://github.com/llvm/llvm-project/pull/174892 [1]
Link: https://github.com/ClangBuiltLinux/linux/issues/1571
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20260330-kbuild-show-inlining-v2-1-c0c481a4ea7b@google.com
Signed-off-by: Nicolas Schier <nsc@kernel.org>
|
|
After having removed the last usage of no_pci_devices(), this function
can be removed.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/b0ce592d-c34c-4e0b-b389-4e346b3a0c44@gmail.com
|
|
Palm Top PC 110 is a handheld personal computer with 80486SX CPU that was
released exclusively in Japan in September 1995.
While the kernel still supports 486 CPU it is highly unlikely that anyone
is using this device with the latest kernel.
Remove the driver.
[bhelgaas: since this was posted, "x86/cpu: Remove M486/M486SX/ELAN
support" has been queued for v7.1, so pc110pad is no longer relevant:
https://lore.kernel.org/all/20251214084710.3606385-2-mingo@kernel.org/]
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20240808172733.1194442-4-dmitry.torokhov@gmail.com
|
|
RCU Tasks Trace grace period implies RCU grace period, and this
guarantee is expected to remain in the future. Only BPF is the user of
this predicate, hence retire the API and clean up all in-tree users.
RCU Tasks Trace is now implemented on SRCU-fast and its grace period
mechanism always has at least one call to synchronize_rcu() as it is
required for SRCU-fast's correctness (it replaces the smp_mb() that
SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP.
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
For tests that carry a __description tag, allow matching on both the
description string and program name for convenience. Before this commit,
the description string must be spelt out to filter the tests.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260407145606.3991770-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In all cases in which a struct acpi_driver is used for binding a driver
to an ACPI device object, a corresponding platform device is created by
the ACPI core and that device is regarded as a proper representation of
underlying hardware. Accordingly, a struct platform_driver should be
used by driver code to bind to that device. There are multiple reasons
why drivers should not bind directly to ACPI device objects [1].
In particular, registering a watchdog device under a struct acpi_device
is questionable because it causes the watchdog to be hidden in the ACPI
bus sysfs hierarchy and it goes against the general rule that a struct
acpi_device can only be a parent of another struct acpi_device.
Overall, it is better to bind drivers to platform devices than to their
ACPI companions, so convert the ni903x_wdt watchdog ACPI driver to a
platform one.
While this is not expected to alter functionality, it changes sysfs
layout and so it will be visible to user space.
Note that after this change it actually makes sense to look for
the "timeout-sec" property via device_property_read_u32() under the
device passed to watchdog_init_timeout() because it has an fwnode
handle (unlike a struct acpi_device which is an fwnode itself).
Link: https://lore.kernel.org/all/2396510.ElGaqSPkdT@rafael.j.wysocki/ [1]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Link: https://patch.msgid.link/13996583.uLZWGnKmhe@rafael.j.wysocki
|
|
In all cases in which a struct acpi_driver is used for binding a driver
to an ACPI device object, a corresponding platform device is created by
the ACPI core and that device is regarded as a proper representation of
underlying hardware. Accordingly, a struct platform_driver should be
used by driver code to bind to that device. There are multiple reasons
why drivers should not bind directly to ACPI device objects [1].
Overall, it is better to bind drivers to platform devices than to their
ACPI companions, so convert the Xen ACPI processor aggregator device
(PAD) driver to a platform one.
While this is not expected to alter functionality, it changes sysfs
layout and so it will be visible to user space.
Link: https://lore.kernel.org/all/2396510.ElGaqSPkdT@rafael.j.wysocki/ [1]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/8683270.T7Z3S40VBb@rafael.j.wysocki
|
|
Using the stricter "./tools/docs/kernel-doc -Wall -v" to verify proper
formatting of documentation comments includes warnings related to return
markup on functions that are omitted during the default verification
checks. This stricter verification reports a couple of missing return
descriptions in resctrl:
Warning: .../fs/resctrl/rdtgroup.c:1536 No description found for return value of 'rdtgroup_cbm_to_size'
Warning: .../fs/resctrl/rdtgroup.c:3131 No description found for return value of 'mon_get_kn_priv'
Warning: .../fs/resctrl/rdtgroup.c:3523 No description found for return value of 'cbm_ensure_valid'
Warning: .../fs/resctrl/monitor.c:238 No description found for return value of 'resctrl_find_cleanest_closid'
Add the missing return descriptions.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/1c50b9f7c73251c007133590986f127e1af57780.1775576382.git.reinette.chatre@intel.com
|
|
The x86 maintainers handle the resctrl filesystem and x86 architectural
resctrl code. Even so, the x86 maintainers are not part of the resctrl
section and not returned when scripts/get_maintainer.pl is run on resctrl
filesystem code. With patches flowing via x86 maintainers resctrl should
also ensure it follows the tip rules.
Add the x86 maintainer alias, x86@kernel.org, to the resctrl section to
ensure x86 maintainers are included in associated resctrl submissions.
Add a reference to the tip tree handbook to make it clear which rules
resctrl follows.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/4c14dd82e81737c6413e10fe097475b1cc0886fc.1775576382.git.reinette.chatre@intel.com
|
|
The correct kfunc name is scx_bpf_dsq_move_to_local(), not
scx_bpf_move_to_local(). Fix the two references in the
Scheduling Cycle section.
Signed-off-by: fangqiurong <fangqiurong@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
use NR_STD_WORKER_POOLS for irq_work_fns[] array definition.
NR_STD_WORKER_POOLS is also 2, but better to use MACRO.
Initialization loop for_each_bh_worker_pool() also uses same MACRO.
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
PCI drivers can use a device-tree overlay to describe the hardware
available on the PCI board. This is the case, for instance, of the
LAN966x PCI device driver.
Adding some more nodes in the device-tree overlay adds some more
consumer/supplier relationship between devices instantiated from this
overlay.
Those fw_node consumer/supplier relationships are handled by fw_devlink
and are created based on the device-tree parsing done by the
of_fwnode_add_links() function.
Those consumer/supplier links are needed in order to ensure a correct PM
runtime management and a correct removal order between devices.
For instance, without those links a supplier can be removed before its
consumers is removed leading to all kind of issue if this consumer still
want the use the already removed supplier.
The support for the usage of an overlay from a PCI driver has been added
on x86 systems in commit 1f340724419ed ("PCI: of: Create device tree PCI
host bridge node").
In the past, support for fw_devlink on x86 had been tried but this
support has been removed in commit 4a48b66b3f52 ("of: property: Disable
fw_devlink DT support for X86"). Indeed, this support was breaking some
x86 systems such as OLPC system and the regression was reported in [0].
Instead of disabling this support for all x86 system, use a finer grain
and disable this support only for the possible problematic subset of x86
systems (at least OLPC and CE4100).
Those systems use a device-tree to describe their hardware. Identify
those systems using key properties in the device-tree.
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Link: https://lore.kernel.org/lkml/3c1f2473-92ad-bfc4-258e-a5a08ad73dd0@web.de/ [0]
Link: https://patch.msgid.link/20260325143555.451852-18-herve.codina@bootlin.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
As far as I can tell, we never intentionally constrained ourselves to
these status codes, and it is misleading and surprising to lack the
bdev error logging when we get a different error code from the block
layer. This can lead to jumping to a wrong conclusion like "this
system didn't see any bio failures but aborted with EIO".
For example on nvme devices, I observe many failures coming back as
BLK_STS_MEDIUM. It is apparent that the nvme driver returns a variety of
BLK_STS_* status values in nvme_error_status().
So handle the known expected errors and make some noise on the rest
which we expect won't really happen.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
can_finish_ordered_extent() and btrfs_finish_ordered_zoned() set
BTRFS_ORDERED_IOERR via bare set_bit(). Later,
btrfs_mark_ordered_extent_error() in btrfs_finish_one_ordered() uses
test_and_set_bit(), finds it already set, and skips
mapping_set_error(). The error is never recorded on the inode's
address_space, making it invisible to fsync. For encoded writes this
causes btrfs receive to silently produce files with zero-filled holes.
Fix: replace bare set_bit(BTRFS_ORDERED_IOERR) with
btrfs_mark_ordered_extent_error() which pairs test_and_set_bit() with
mapping_set_error(), guaranteeing the error is recorded exactly once.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: Michal Grzedzicki <mge@meta.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In btrfs_finish_one_ordered(), clear_bits is unconditionally initialized
with EXTENT_DEFRAG. For NOCOW ordered extents this is always a no-op
because should_nocow() already forces the COW path when EXTENT_DEFRAG is
set, so a NOCOW ordered extent can never have EXTENT_DEFRAG on its range.
Although harmless, the unconditional btrfs_clear_extent_bit() call still
performs a cold rbtree lookup under the io tree spinlock on every NOCOW
write completion. Avoid this by only adding EXTENT_DEFRAG to clear_bits
for non-NOCOW ordered extents, and skip the call entirely when there are
no bits to clear.
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The UUID tree rescan check in open_ctree() compares
fs_info->generation with the superblock's uuid_tree_generation.
This comparison is not reliable because fs_info->generation is
bumped at transaction start time in join_transaction(), while
uuid_tree_generation is only updated at commit time via
update_super_roots().
Between the early BTRFS_FS_UPDATE_UUID_TREE_GEN flag check and the
late rescan decision, mount operations such as file orphan cleanup
from an unclean shutdown start transactions without committing
them. This advances fs_info->generation past uuid_tree_generation
and produces a false-positive mismatch.
Use the BTRFS_FS_UPDATE_UUID_TREE_GEN flag directly instead. The
flag was already set earlier in open_ctree() when the generations
were known to match, and accurately represents "UUID tree is up to
date" without being affected by subsequent transaction starts.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we get an error during the transaction commit path, we are resetting
current->journal_info to NULL twice - once in btrfs_commit_transaction()
right before calling cleanup_transaction() and then once again inside
cleanup_transaction(). Remove the instance in btrfs_commit_transaction().
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Having the filesystem in an error state, meaning we had a transaction
abort, is unexpected. Mark every check for the error state with the
unlikely annotation to convey that and to allow the compiler to generate
better code.
On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly
reduced object size and better code.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
2008598 175912 15592 2200102 219226 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
2008450 175912 15592 2199954 219192 fs/btrfs/btrfs.ko
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:
- Add a quirk for JMicron JMB582/JMB585 AHCI controllers such that
they only use 32-bit DMA addresses.
While these controllers do report that they support 64-bit DMA
addresses, a user reports that using 64-bit DMA addresses cause
silent corruption even on modern x86 systems (Arthur)
* tag 'ata-7.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci: force 32-bit DMA for JMicron JMB582/JMB585
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull Hyper-V fixes from Wei Liu:
- Two fixes for Hyper-V PCI driver (Long Li, Sahil Chandna)
- Fix an infinite loop issue in MSHV driver (Stanislav Kinsburskii)
* tag 'hyperv-fixes-signed-20260406' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
mshv: Fix infinite fault loop on permission-denied GPA intercepts
PCI: hv: Fix double ida_free in hv_pci_probe error path
PCI: hv: Set default NUMA node to 0 for devices without affinity info
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Eight hotfixes. All are cc:stable and seven are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
ocfs2: fix out-of-bounds write in ocfs2_write_end_inline
mm/damon/stat: deallocate damon_call() failure leaking damon_ctx
mm/vma: fix memory leak in __mmap_region()
mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug
mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails
mm: reinstate unconditional writeback start in balance_dirty_pages()
liveupdate: propagate file deserialization failures
mm: filemap: fix nr_pages calculation overflow in filemap_map_pages()
|
|
Fixed PCM device numbering is required for acp pdm dmic pcm device
to have a common UCM changes.
Set the 'use_dai_pcm_id' flag true in acp pdm dma driver for acp 6.3
platform. This will fix the pcm device numbering based on dai_link->id.
Fixes: 33cea6bbe488 ("ASoC: amd: add acp6.2 pdm platform driver")
Signed-off-by: Syed Saba Kareem <Syed.SabaKareem@amd.com>
Fixes: tag.
Link: https://patch.msgid.link/20260403100624.676953-1-syed.sabakareem@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
In alarmtimer_suspend(), timerqueue_getnext() is called under
base->lock, but next->expires is read after the lock is released.
This is safe because suspend freezes all relevant task contexts,
but reading the node while holding the lock makes the code easier
to reason about and not worry about a theoretical UAF.
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260407143627.19405-1-zhanxusheng@xiaomi.com
|
|
Qualcomm Hawi SoC include apps smmu that implements arm,mmu-500, which
is used to translate device-visible virtual addresses to physical
addresses. Add compatible for these items.
Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When kobject_init_and_add() fails, the call chain is:
create_space_info()
-> btrfs_sysfs_add_space_info_type()
-> kobject_init_and_add()
-> failure
-> kobject_put(&space_info->kobj)
-> space_info_release()
-> kfree(space_info)
Then control returns to create_space_info():
btrfs_sysfs_add_space_info_type() returns error
-> goto out_free
-> kfree(space_info)
This causes a double free.
Keep the direct kfree(space_info) for the earlier failure path, but
after btrfs_sysfs_add_space_info_type() has called kobject_put(), let
the kobject release callback handle the cleanup.
Fixes: a11224a016d6d ("btrfs: fix memory leaks in create_space_info() error paths")
CC: stable@vger.kernel.org # 6.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When kobject_init_and_add() fails, the call chain is:
create_space_info_sub_group()
-> btrfs_sysfs_add_space_info_type()
-> kobject_init_and_add()
-> failure
-> kobject_put(&sub_group->kobj)
-> space_info_release()
-> kfree(sub_group)
Then control returns to create_space_info_sub_group(), where:
btrfs_sysfs_add_space_info_type() returns error
-> kfree(sub_group)
Thus, sub_group is freed twice.
Keep parent->sub_group[index] = NULL for the failure path, but after
btrfs_sysfs_add_space_info_type() has called kobject_put(), let the
kobject release callback handle the cleanup.
Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group")
CC: stable@vger.kernel.org # 6.18+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a bug report that a btrfs with running dev-replace got rejected
with the following messages:
BTRFS error (device sdk1): devid 0 path /dev/sdk1 is registered but not found in chunk tree
BTRFS error (device sdk1): remove the above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount
BTRFS error (device sdk1): open_ctree failed: -117
[CAUSE]
The tree and super block dumps show the fs is completely sane, except
one thing, there is no dev item for devid 0 in chunk tree.
However this is not a bug, as we do not insert dev item for devid 0 in
the first place.
Since the devid 0 is only there temporarily we do not really need to
insert a dev item for it and then later remove it again.
It is the commit 34308187395f ("btrfs: add extra device item checks at
mount") adding a overly strict check that triggers a false alert and
rejected the valid filesystem.
[FIX]
Add a special handling for devid 0, and doesn't require devid 0 to
have a device item in chunk tree.
Reported-by: Jaron Viëtor <jaron@vietors.com>
Link: https://lore.kernel.org/linux-btrfs/CAF1bhLVYLZvD=j2XyuxXDKD-NWNJAwDnpVN+UYeQW-HbzNRn1A@mail.gmail.com/
Fixes: 34308187395f ("btrfs: add extra device item checks at mount")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In close_ctree(), we call invalidate_inode_pages2() to invalidate all
pages from btree inode.
But the problem is, it never returns 0, but always -EBUSY.
The problem is that we are still holding all the essential tree root
nodes, thus pages holding those tree blocks can not be invalidated thus
invalidate_inode_pages2() always returns -EBUSY.
This is also against the error cleanup path of open_ctree(), which
properly frees all root pointers before calling invalidate_inode_pages().
So fix the order by delaying invalidate_inode_pages2() until we have
freed all root pointers.
Reviewed-by: Anand Jain <asj@kernel.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Under memory pressure, direct reclaim can kick in during compressed
readahead. This puts the associated task into D-state. Then shrink_lruvec()
disables interrupts when acquiring the LRU lock. Under heavy pressure,
we've observed reclaim can run long enough that the CPU becomes prone to
CSD lock stalls since it cannot service incoming IPIs. Although the CSD
lock stalls are the worst case scenario, we have found many more subtle
occurrences of this latency on the order of seconds, over a minute in some
cases.
Prevent direct reclaim during compressed readahead. This is achieved by
using different GFP flags at key points when the bio is marked for
readahead.
There are two functions that allocate during compressed readahead:
btrfs_alloc_compr_folio() and add_ra_bio_pages(). Both currently use
GFP_NOFS which includes __GFP_DIRECT_RECLAIM.
For the internal API call btrfs_alloc_compr_folio(), the signature changes
to accept an additional gfp_t parameter. At the readahead call site, it
gets flags similar to GFP_NOFS but stripped of __GFP_DIRECT_RECLAIM.
__GFP_NOWARN is added since these allocations are allowed to fail. Demand
reads still use full GFP_NOFS and will enter reclaim if needed. All other
existing call sites of btrfs_alloc_compr_folio() now explicitly pass
GFP_NOFS to retain their current behavior.
add_ra_bio_pages() gains a bool parameter which allows callers to specify
if they want to allow direct reclaim or not. In either case, the
__GFP_NOWARN flag was added unconditionally since the allocations are
speculative.
There has been some previous work done on calling add_ra_bio_pages() [0].
This patch is complementary: where that patch reduces call frequency, this
patch reduces the latency associated with those calls.
[0] https://lore.kernel.org/linux-btrfs/656838ec1232314a2657716e59f4f15a8eadba64.1751492111.git.boris@bur.io/
Reviewed-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In cache_save_setup(), if create_free_space_inode() succeeds but the
subsequent lookup_free_space_inode() still fails on retry, the
BUG_ON(retries) will crash the kernel. This can happen due to I/O
errors or transient failures, not just programming bugs.
Replace the BUG_ON with proper error handling that returns the original
error code through the existing cleanup path. The callers already handle
this gracefully: disk_cache_state defaults to BTRFS_DC_ERROR, so the
space cache simply won't be written for that block group.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The sectorsize is used once or at most twice in the callbacks, no need
to cache it on stack. Minor effect on zstd_compress_folios() where it
saves 8 bytes of stack.
Signed-off-by: David Sterba <dsterba@suse.com>
|