| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 0ea84089dbf62a92dc7889c79e6b18fc89260808 upstream.
When a non-NCQ command is issued while NCQ commands are being executed,
ata_scsi_qc_issue() indicates to the SCSI layer that the command issuing
should be deferred by returning SCSI_MLQUEUE_XXX_BUSY. This command
deferring is correct and as mandated by the ACS specifications since
NCQ and non-NCQ commands cannot be mixed.
However, in the case of a host adapter using multiple submission queues,
when the target device is under a constant load of NCQ commands, there
are no guarantees that requeueing the non-NCQ command will be executed
later and it may be deferred again repeatedly as other submission queues
can constantly issue NCQ commands from different CPUs ahead of the
non-NCQ command. This can lead to very long delays for the execution of
non-NCQ commands, and even complete starvation for these commands in the
worst case scenario.
Since the block layer and the SCSI layer do not distinguish between
queueable (NCQ) and non queueable (non-NCQ) commands, libata-scsi SAT
implementation must ensure forward progress for non-NCQ commands in the
presence of NCQ command traffic. This is similar to what SAS HBAs with a
hardware/firmware based SAT implementation do.
Implement such forward progress guarantee by limiting requeueing of
non-NCQ commands from ata_scsi_qc_issue(): when a non-NCQ command is
received and NCQ commands are in-flight, do not force a requeue of the
non-NCQ command by returning SCSI_MLQUEUE_XXX_BUSY and instead return 0
to indicate that the command was accepted but hold on to the qc using
the new deferred_qc field of struct ata_port.
This deferred qc will be issued using the work item deferred_qc_work
running the function ata_scsi_deferred_qc_work() once all in-flight
commands complete, which is checked with the port qc_defer() callback
return value indicating that no further delay is necessary. This check
is done using the helper function ata_scsi_schedule_deferred_qc() which
is called from ata_scsi_qc_complete(). This thus excludes this mechanism
from all internal non-NCQ commands issued by ATA EH.
When a port deferred_qc is non NULL, that is, the port has a command
waiting for the device queue to drain, the issuing of all incoming
commands (both NCQ and non-NCQ) is deferred using the regular busy
mechanism. This simplifies the code and also avoids potential denial of
service problems if a user issues too many non-NCQ commands.
Finally, whenever ata EH is scheduled, regardless of the reason, a
deferred qc is always requeued so that it can be retried once EH
completes. This is done by calling the function
ata_scsi_requeue_deferred_qc() from ata_eh_set_pending(). This avoids
the need for any special processing for the deferred qc in case of NCQ
error, link or device reset, or device timeout.
Reported-by: Xingui Yang <yangxingui@huawei.com>
Reported-by: Igor Pylypiv <ipylypiv@google.com>
Fixes: bdb01301f3ea ("scsi: Add host and host template flag 'host_tagset'")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Tested-by: Igor Pylypiv <ipylypiv@google.com>
Tested-by: Xingui Yang <yangxingui@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 47bf2e813817159f4d195be83a9b5a640ee6baec ]
Driver is using num_vhca_ports capability to distinguish between
multiport master device and multiport slave device. num_vhca_ports is a
capability the driver sets according to the MAX num_vhca_ports
capability reported by FW. On the other hand, light SFs doesn't set the
above capbility.
This leads to wrong results whenever light SFs is checking whether he is
a multiport master or slave.
Therefore, use the MAX capability to distinguish between master and
slave devices.
Fixes: e71383fb9cd1 ("net/mlx5: Light probe local SFs")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <Jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260218072904.1764634-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7537bae8b6eb635583e0e6260f61d13ddbd52087 ]
The RAPL MSR read path incorrectly validates CPU context when called
from the PMU subsystem:
if (atomic) {
if (unlikely(smp_processor_id() != cpu))
return -EIO;
rdmsrq(ra->reg.msr, ra->value);
}
This check fails for package-scoped MSRs like RAPL energy counters,
which are readable from any CPU within the package.
The perf tool avoids hitting this check by validating against
/sys/bus/event_source/devices/power/cpumask before opening events.
However, turbostat does not perform this validation and may attempt
reads from non-lead CPUs, causing the check to fail and return zero
power values.
Since package-scoped MSRs are architecturally accessible from any CPU
in the package, remove the CPU matching check.
Also rename 'atomic' to 'pmu_ctx' to clarify this indicates PMU context
where rdmsrq() can be used directly instead of rdmsrl_safe_on_cpu().
Fixes: 748d6ba43afd ("powercap: intel_rapl: Enable MSR-based RAPL PMU support")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Furquim Ulisses <ulisses.furquim@intel.com>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20260209234310.1440722-2-sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f4b830a5371914239756b0599e5dc9d4c328e387 ]
It is possible to put the KTD2801 chip in an unknown/undefined state by
changing the brightness very rapidly (for example, with a brightness
slider). When this happens, the brightness is stuck on max and cannot be
changed until the chip is power cycled.
Fix this by disabling interrupts while talking to the chip. While at it,
make expresswire_power_off() use fsleep() and also unexport some
functions meant to be internal.
Fixes: 1368d06dd2c9 ("leds: Introduce ExpressWire library")
Tested-by: Karel Balej <balejk@matfyz.cz>
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
Link: https://patch.msgid.link/20251217-expresswire-fix-v2-1-4a02b10acd96@dujemihanovic.xyz
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b3c78bc53630d14a5770451ede3a30e7052f3b8b ]
pidfs and nsfs recently gained support for encode/decode of file handles
via name_to_handle_at(2)/open_by_handle_at(2).
These special kernel filesystems have custom ->open() and ->permission()
export methods, which nfsd does not respect and it was never meant to be
used for exporting those filesystems by nfsd.
Therefore, do not allow nfsd to export filesystems with custom ->open()
or ->permission() methods.
Fixes: b3caba8f7a34a ("pidfs: implement file handle support")
Fixes: 5222470b2fbb3 ("nsfs: support file handles")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20260129100212.49727-3-amir73il@gmail.com
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f47c1b77d0a2a9c0d49ec14302e74f933398d1a3 ]
The clk_save_context() and clk_restore_context() helpers are only
implemented by the Common Clock Framework. They are not available when
using legacy clock frameworks. Dummy implementations are provided, but
only if no clock support is available at all.
Hence when CONFIG_HAVE_CLK=y, but CONFIG_COMMON_CLK is not enabled:
m68k-linux-gnu-ld: drivers/net/phy/air_en8811h.o: in function `en8811h_resume':
air_en8811h.c:(.text+0x83e): undefined reference to `clk_restore_context'
m68k-linux-gnu-ld: drivers/net/phy/air_en8811h.o: in function `en8811h_suspend':
air_en8811h.c:(.text+0x856): undefined reference to `clk_save_context'
Fix this by moving forward declarations and dummy implementions from the
HAVE_CLK to the COMMON_CLK section.
Fixes: 8b95d1ce3300c411 ("clk: Add functions to save/restore clock context en-masse")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511301553.eaEz1nEW-lkp@intel.com/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f8a6e5eac701369afb5d69aba875dc5fec93003d ]
In commit 3bdbd0858df6 ("Input: adp5589: remove the driver") the last user
of include/linux/input/adp5589.h was removed along with the whole driver,
thus the header file can be also removed.
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Nuno Sá <nuno.sa@analog.com>
Fixes: 3bdbd0858df6 ("Input: adp5589: remove the driver")
Link: https://patch.msgid.link/20260113151140.3843753-1-vz@mleia.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a57b1f07d2d35843a7ada30c8cf9a215c0931868 ]
The @data buffer is 5 bytes, not 4, it has been extended for the need of
devices with an extra ID bytes.
Fixes: 34a956739d29 ("mtd: spinand: Add support for 5-byte IDs")
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 59b7bb3d48333889adb1dd2aac3ab0cf26714390 ]
Add a new quirk ATA_QUIRK_MAX_SEC, which has a separate table with device
specific values.
Convert all existing ATA_QUIRK_MAX_SEC_XXX device quirks in
__ata_dev_quirks to the new format.
Quirks ATA_QUIRK_MAX_SEC_128 and ATA_QUIRK_MAX_SEC_1024 cannot be removed
yet, since they are also used by libata.force, which functionally, is a
separate user of the quirks. The quirks will be removed once all users
have been converted to use the new format.
The quirk ATA_QUIRK_MAX_SEC_8191 can be removed since it has no equivalent
libata.force parameter.
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Stable-dep-of: 5f64ae1ef639 ("ata: libata-core: Quirk INTEL SSDSC2KG480G8 max_sectors")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b85137caf110a09a4a18f00f730de4709f9afc8 ]
The hibernate resume sequence involves loading a resume kernel that is just
used for loading the hibernate image before shifting back to the existing
kernel.
During that hibernate resume sequence the resume kernel may have loaded
the ccp driver. If this happens the resume kernel will also have called
PSP_CMD_TEE_RING_INIT but it will never have called
PSP_CMD_TEE_RING_DESTROY.
This is problematic because the existing kernel needs to re-initialize the
ring. One could argue that the existing kernel should call destroy
as part of restore() but there is no guarantee that the resume kernel did
or didn't load the ccp driver. There is also no callback opportunity for
the resume kernel to destroy before handing back control to the existing
kernel.
Similar problems could potentially exist with the use of kdump and
crash handling. I actually reproduced this issue like this:
1) rmmod ccp
2) hibernate the system
3) resume the system
4) modprobe ccp
The resume kernel will have loaded ccp but never destroyed and then when
I try to modprobe it fails.
Because of these possible cases add a flow that checks the error code from
the PSP_CMD_TEE_RING_INIT call and tries to call PSP_CMD_TEE_RING_DESTROY
if it failed. If this succeeds then call PSP_CMD_TEE_RING_INIT again.
Fixes: f892a21f51162 ("crypto: ccp - use generic power management")
Reported-by: Lars Francke <lars.francke@gmail.com>
Closes: https://lore.kernel.org/platform-driver-x86/CAD-Ua_gfJnQSo8ucS_7ZwzuhoBRJ14zXP7s8b-zX3ZcxcyWePw@mail.gmail.com/
Tested-by: Yijun Shen <Yijun.Shen@Dell.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://patch.msgid.link/20260116041132.153674-6-superm1@kernel.org
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1e13f27e0675552161ab1778be9a23a636dde8a7 ]
nft_counter_reset() calls u64_stats_add() with a negative value to reset
the counter. This will work on 64bit archs, hence the negative value
added will wrap as a 64bit value which then can wrap the stat counter as
well.
On 32bit archs, the added negative value will wrap as a 32bit value and
_not_ wrapping the stat counter properly. In most cases, this would just
lead to a very large 32bit value being added to the stat counter.
Fix by introducing u64_stats_sub().
Fixes: 4a1d3acd6ea8 ("netfilter: nft_counter: Use u64_stats_t for statistic.")
Signed-off-by: Anders Grahn <anders.grahn@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 31fb95400451040050361e22ff480476964280f0 ]
Some endpoint platforms have only a small number of usable BARs. At the
same time, EPF drivers (e.g. vNTB) may need multiple independent inbound
regions (control/scratchpad, one or more memory windows, and optionally
MSI or other feature-related regions). Subrange mapping allows these to
share a single BAR without consuming additional BARs that may not be
available, or forcing a fragile layout by aggressively packing into a
single contiguous memory range.
Extend the PCI endpoint core to support mapping subranges within a BAR.
Add an optional 'submap' field in struct pci_epf_bar so an endpoint
function driver can request inbound mappings that fully cover the BAR.
Introduce a new EPC feature bit, subrange_mapping, and reject submap
requests from pci_epc_set_bar() unless the controller advertises both
subrange_mapping and dynamic_inbound_mapping features.
The submap array describes the complete BAR layout (no overlaps and no
gaps are allowed to avoid exposing untranslated address ranges). This
provides the generic infrastructure needed to map multiple logical
regions into a single BAR at different offsets, without assuming a
controller-specific inbound address translation mechanism.
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://patch.msgid.link/20260124145012.2794108-3-den@valinux.co.jp
Stable-dep-of: 72cb5ed2a5c6 ("PCI: dwc: ep: Add per-PF BAR and inbound ATU mapping support")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 06a81c5940e46cc7bddee28f16bdd29a12a76344 ]
Introduce a new EPC feature bit (dynamic_inbound_mapping) that indicates
whether an Endpoint Controller can update the inbound address
translation for a BAR without requiring the EPF driver to clear/reset
the BAR first.
Endpoint Function drivers (e.g. vNTB) can use this information to decide
whether it really is safe to call pci_epc_set_bar() multiple times to
update inbound mappings for the BAR.
Suggested-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260124145012.2794108-2-den@valinux.co.jp
Stable-dep-of: 72cb5ed2a5c6 ("PCI: dwc: ep: Add per-PF BAR and inbound ATU mapping support")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8924336531e21b187d724b5fdf5277269c9ec22c ]
The IPC sysctls implement the ctl_table_root::permissions hook and they
override the file access mode based on the CAP_CHECKPOINT_RESTORE
capability, which is being checked regardless of whether any access is
actually denied or not, so if an LSM denies the capability, an audit
record may be logged even when access is in fact granted.
It wouldn't be viable to restructure the sysctl permission logic to only
check the capability when the access would be actually denied if it's not
granted. Thus, do the same as in net_ctl_permissions() (net/sysctl_net.c)
- switch from ns_capable() to ns_capable_noaudit(), so that the check
never emits an audit record.
Link: https://lkml.kernel.org/r/20260122141303.241133-1-omosnace@redhat.com
Fixes: 0889f44e2810 ("ipc: Check permissions for checkpoint_restart sysctls at open time")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 27b0fcae8f535fb882b1876227a935dcfdf576aa ]
The xdrgen decoders for strings and opaque data had an
optimization that skipped calling xdr_inline_decode() when the
item length was zero. This left the data pointer uninitialized,
which could lead to unpredictable behavior when callers access
it.
Remove the zero-length check and always call xdr_inline_decode().
When passed a length of zero, xdr_inline_decode() returns the
current buffer position, which is valid and matches the behavior
of hand-coded XDR decoders throughout the kernel.
Fixes: 4b132aacb076 ("tools: Add xdrgen")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e8a1e7eaa19d0b757b06a2f913e3eeb4b1c002c6 ]
__sprint_symbol() might access an invalid pointer when
kallsyms_lookup_buildid() returns a symbol found by
ftrace_mod_address_lookup().
The ftrace lookup function must set both @modname and @modbuildid the same
way as module_address_lookup().
Link: https://lkml.kernel.org/r/20251128135920.217303-7-pmladek@suse.com
Fixes: 9294523e3768 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit acfdbb4ab2910ff6f03becb569c23ac7b2223913 ]
Add a helper function for reading the optional "build_id" member of struct
module. It is going to be used also in ftrace_mod_address_lookup().
Use "#ifdef" instead of "#if IS_ENABLED()" to match the declaration of the
optional field in struct module.
Link: https://lkml.kernel.org/r/20251128135920.217303-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: e8a1e7eaa19d ("kallsyms/ftrace: set module buildid in ftrace_mod_address_lookup()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cd6735896d0343942cf3dafb48ce32eb79341990 ]
bpf_address_lookup() has been used only in kallsyms_lookup_buildid(). It
was supposed to set @modname and @modbuildid when the symbol was in a
module.
But it always just cleared @modname because BPF symbols were never in a
module. And it did not clear @modbuildid because the pointer was not
passed.
The wrapper is no longer needed. Both @modname and @modbuildid are now
always initialized to NULL in kallsyms_lookup_buildid().
Remove the wrapper and rename __bpf_address_lookup() to
bpf_address_lookup() because this variant is used everywhere.
[akpm@linux-foundation.org: fix loongarch]
Link: https://lkml.kernel.org/r/20251128135920.217303-6-pmladek@suse.com
Fixes: 9294523e3768 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b07829d546c83134629591f02c5348d57cea0c1e ]
The printf() format attributes are applied inconsistently for the binary
printf helpers, which causes warnings for the bpf_trace code using
them from functions that pass down format strings:
kernel/trace/bpf_trace.c: In function '____bpf_trace_printk':
kernel/trace/bpf_trace.c:377:9: error: function '____bpf_trace_printk' might be a candidate for 'gnu_printf' format attribute [-Werror=suggest-attribute=format]
377 | ret = bstr_printf(data.buf, MAX_BPRINTF_BUF, fmt, data.bin_args);
| ^~~
This can be addressed either by annotating all five callers in bpf code,
or by removing the annotations on the callees that were added by Andy
Shevchenko last year.
As Alexei Starovoitov points out, there are no callers in C code that
would benefit from the __printf attributes, the only users are in BPF
code or in the do_trace_printk() helper that already checks the arguments.
Drop all three of these annotations, reverting the earlierl commits that
added these, in order to get a clean build with -Wsuggest-attribute=format.
Fixes: 6b2c1e30ad68 ("seq_file: Mark binary printing functions with __printf() attribute")
Fixes: 7bf819aa992f ("vsnprintf: Mark binary printing functions with __printf() attribute")
Link: https://lore.kernel.org/all/CAADnVQK3eZp3yp35OUx8j1UBsQFhgsn5-4VReqAJ=68PaaKYmg@mail.gmail.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512061640.9hKTnB8p-lkp@intel.com/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260204132643.1302967-1-arnd@kernel.org
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ccef4b2703ff5b0de0b1bda30a0de3026d52eb19 ]
The header has a function which calls pr_err(). Don't require users of
the header to include <linux/printk.h> and include it here.
Fixes: 87cfc79dcd60 ("drm/msm/a6xx: Resolve the meaning of UBWC_MODE")
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Reviewed-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Link: https://lore.kernel.org/r/20260110-iris-ubwc-v1-1-dd70494dcd7b@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cc2f39d6ac48e6e3cb2d6240bc0d6df839dd0828 ]
Currently, hwrng_fill is not cleared until the hwrng_fillfn() thread
exits. Since hwrng_unregister() reads hwrng_fill outside the rng_mutex
lock, a concurrent hwrng_unregister() may call kthread_stop() again on
the same task.
Additionally, if hwrng_unregister() is called immediately after
hwrng_register(), the stopped thread may have never been executed. Thus,
hwrng_fill remains dirty even after hwrng_unregister() returns. In this
case, subsequent calls to hwrng_register() will fail to start new
threads, and hwrng_unregister() will call kthread_stop() on the same
freed task. In both cases, a use-after-free occurs:
refcount_t: addition on 0; use-after-free.
WARNING: ... at lib/refcount.c:25 refcount_warn_saturate+0xec/0x1c0
Call Trace:
kthread_stop+0x181/0x360
hwrng_unregister+0x288/0x380
virtrng_remove+0xe3/0x200
This patch fixes the race by protecting the global hwrng_fill pointer
inside the rng_mutex lock, so that hwrng_fillfn() thread is stopped only
once, and calls to kthread_run() and kthread_stop() are serialized
with the lock held.
To avoid deadlock in hwrng_fillfn() while being stopped with the lock
held, we convert current_rng to RCU, so that get_current_rng() can read
current_rng without holding the lock. To remove the lock from put_rng(),
we also delay the actual cleanup into a work_struct.
Since get_current_rng() no longer returns ERR_PTR values, the IS_ERR()
checks are removed from its callers.
With hwrng_fill protected by the rng_mutex lock, hwrng_fillfn() can no
longer clear hwrng_fill itself. Therefore, if hwrng_fillfn() returns
directly after current_rng is dropped, kthread_stop() would be called on
a freed task_struct later. To fix this, hwrng_fillfn() calls schedule()
now to keep the task alive until being stopped. The kthread_stop() call
is also moved from hwrng_unregister() to drop_current_rng(), ensuring
kthread_stop() is called on all possible paths where current_rng becomes
NULL, so that the thread would not wait forever.
Fixes: be4000bc4644 ("hwrng: create filler thread")
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Lianjie Wang <karin0.zst@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 553b4999cbe231b5011cb8db05a3092dec168aca ]
Using a threaded interrupt without a dedicated primary handler mandates
the IRQF_ONESHOT flag to mask the interrupt source while the threaded
handler is active. Otherwise the interrupt can fire again before the
threaded handler had a chance to run.
Mark explained that this should not happen with this hardware since it
is a slow irqchip which is behind an I2C/ SPI bus but the IRQ-core will
refuse to accept such a handler.
Set IRQF_ONESHOT so the interrupt source is masked until the secondary
handler is done.
Fixes: 1c6c69525b40e ("genirq: Reject bogus threaded irq requests")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260128095540.863589-16-bigeasy@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 943b052ded21feb84f293d40b06af3181cd0d0d7 ]
The flag IRQF_COND_ONESHOT was already force-added to request_irq() because
the ACPI SCI interrupt handler is using the IRQF_ONESHOT flag which breaks
all shared handlers.
devm_request_irq() needs the same change since some users, such as
int0002_vgpio, are using this function instead.
Add IRQF_COND_ONESHOT to the flags passed to devm_request_irq().
Fixes: c37927a203fa2 ("genirq: Set IRQF_COND_ONESHOT in request_irq()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128095540.863589-2-bigeasy@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4be42c92220128b3128854a2d6b0f0ad0bcedbdb ]
At the moment the we allow the jmp attach only for ftrace_ops that
has FTRACE_OPS_FL_JMP set. This conflicts with following changes
where we use single ftrace_ops object for all direct call sites,
so all could be be attached via just call or jmp.
We already limit the jmp attach support with config option and bit
(LSB) set on the trampoline address. It turns out that's actually
enough to limit the jmp attach for architecture and only for chosen
addresses (with LSB bit set).
Each user of register_ftrace_direct or modify_ftrace_direct can set
the trampoline bit (LSB) to indicate it has to be attached by jmp.
The bpf trampoline generation code uses trampoline flags to generate
jmp-attach specific code and ftrace inner code uses the trampoline
bit (LSB) to handle return from jmp attachment, so there's no harm
to remove the FTRACE_OPS_FL_JMP bit.
The fexit/fmodret performance stays the same (did not drop),
current code:
fentry : 77.904 ± 0.546M/s
fexit : 62.430 ± 0.554M/s
fmodret : 66.503 ± 0.902M/s
with this change:
fentry : 80.472 ± 0.061M/s
fexit : 63.995 ± 0.127M/s
fmodret : 67.362 ± 0.175M/s
Fixes: 25e4e3565d45 ("ftrace: Introduce FTRACE_OPS_FL_JMP")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-2-jolsa@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f88a31308db6a856229150039b0f56d59696ed31 ]
Eliminate all kernel-doc warnings in seqlock.h:
- correct the macro to have "()" immediately following the macro name
- don't include the macro parameters in the short description (first line)
- make the parameter names in the comments match the actual macro
parameter names.
- use "::" for the Example
WARNING: include/linux/seqlock.h:1341 This comment starts with '/**', but isn't a kernel-doc comment.
* scoped_seqlock_read (lock, ss_state) - execute the read side critical
Documentation/locking/seqlock:242: include/linux/seqlock.h:1351: WARNING:
Definition list ends without a blank line; unexpected unindent. [docutils]
Warning: include/linux/seqlock.h:1357 function parameter '_seqlock' not described in 'scoped_seqlock_read'
Warning: include/linux/seqlock.h:1357 function parameter '_target' not described in 'scoped_seqlock_read'
Fixes: cc39f3872c08 ("seqlock: Introduce scoped_seqlock_read()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260123183749.3997533-1-rdunlap@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ae23bc81ddf7c17b663c4ed1b21e35527b0a7131 ]
This commit fixes a security issue where BPF_PROG_DETACH on tcx or
netkit devices could be executed by any user when no program fd was
provided, bypassing permission checks. The fix adds a capability
check for CAP_NET_ADMIN or CAP_SYS_ADMIN in this case.
Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com>
Link: https://lore.kernel.org/r/20260127160200.10395-1-ggonnet.linux@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 929e30f9312514902133c45e51c79088421ab084 ]
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.
Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.
This patch adds a new msg_tot_len field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations.
Note that we intentionally do not include sk_receive_queue data in the
FIONREAD result. Data in sk_receive_queue has not yet been processed by
the BPF verdict program, and may be redirected to other sockets or
dropped. Including it would create semantic ambiguity since this data
may never be readable by the user.
Unix and VSOCK sockets have similar issues, but fixing them is outside
the scope of this patch as it would require more intrusive changes.
Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.
FD1:read()
-- FD1->copied_seq++
| [read data]
|
[enqueue data] v
[sockmap] -> ingress to self -> ingress_msg queue
FD1 native stack ------> ^
-- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
| |
| ingress to FD1
v ^
... | [sockmap]
FD2 native stack
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20260124113314.113584-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b40cc5adaa80e1471095a62d78233b611d7a558c ]
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.
The issue is that when reading from ingress_msg, we update tp->copied_seq
by default. However, if the data is not from its own protocol stack,
tcp->rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.
This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.
This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.
FD1:read()
-- FD1->copied_seq++
| [read data]
|
[enqueue data] v
[sockmap] -> ingress to self -> ingress_msg queue
FD1 native stack ------> ^
-- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
| |
| ingress to FD1
v ^
... | [sockmap]
FD2 native stack
Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 72f3bbebff15e87171271d643ee2672fb8e92031 ]
Consolidate the creation and start of qp into the function
hisi_qm_alloc_qps_node. This change eliminates the need for
each module to perform these steps in two separate phases
(creation and start).
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stable-dep-of: 6aff4d977e2d ("crypto: hisilicon/hpre - support the hpre algorithm fallback")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8cd9b608ee8dea78cac3f373bd5e3b3de2755d46 ]
When a single queue used by multiple tfms, the protection of shared
resources by individual module driver programs is no longer
sufficient. The hisi_qp_send needs to be ensured by the lock in qp.
Fixes: 5fdb4b345cfb ("crypto: hisilicon - add a lock for the qp send operation")
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 21452eaa06edb5f6038720e643aed0bbfffad9c3 ]
Originally, when a queue was requested, it could only be configured
with the default algorithm type of 0. Now, when multiple tfms use
the same queue, the queue must be selected based on its attributes
to meet the requirements of tfm tasks. So the algorithm type
attribute of queue need to be distinguished. Just like a queue used
for compression in ZIP cannot be used for decompression tasks.
Fixes: 3f1ec97aacf1 ("crypto: hisilicon/qm - Put device finding logic into QM")
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
for callback
[ Upstream commit 08eb67d23e5172a5d1e60f1f0acccee569fe10ba ]
When multiple tfm use a same qp, the backlog data should be managed
centrally by the qp, rather than in the qp_ctx of each req.
Additionally, since SEC_BD_TYPE1 and SEC_BD_TYPE2 cannot use the
tag of the sqe to carry the virtual address of the req, the sent
sqe is stored in the qp. This allows the callback function to get
the req address. To handle the differences between hardware types,
the callback functions are split into two separate implementations.
Fixes: f0ae287c5045 ("crypto: hisilicon/sec2 - implement full backlog mode for sec")
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 07f3c3a1cd56c2048a92dad0c11f15e4ac3888c1 ]
A previous commit got rid of any use of this member, but forgot to
remove it. Kill it.
Fixes: f4bb2f65bb81 ("io_uring/eventfd: move ctx->evfd_last_cq_tail into io_ev_fd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6784f274722559c0cdaaa418bc8b7b1d61c314f9 ]
commit 4ef4ac360101 ("device_cgroup: avoid access to ->i_rdev in the
common case in devcgroup_inode_permission()") reordered the checks in
devcgroup_inode_permission() to check the inode mode before checking
i_rdev, for better cache behavior.
However, the likely() annotation on the i_rdev check was not updated
to reflect the new code flow. Originally, when i_rdev was checked
first, likely(!inode->i_rdev) made sense because most inodes were(?)
regular files/directories, thus i_rdev == 0.
After the reorder, by the time we reach the i_rdev check, we have
already confirmed the inode IS a block or character device. Block and
character special files are precisely defined by having a device number
(i_rdev), so !inode->i_rdev is now the rare edge case, not the common
case.
Branch profiling confirmed this is 100% mispredicted:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
0 2631904 100 devcgroup_inode_permission device_cgroup.h 24
Remove likely() to avoid giving the wrong hint to the CPU.
Fixes: 4ef4ac360101 ("device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260107-likely_device-v1-1-0c55f83a7e47@debian.org
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 76489955c6d4a065ca69dc88faf7a50a59b66f35 ]
The comapt_xxx_class symbols aren't declared in anything that
lib/comapt_audit.c is including (arm64 build) which is causing
the following sparse warnings:
lib/compat_audit.c:7:10: warning: symbol 'compat_dir_class'
was not declared. Should it be static?
lib/compat_audit.c:12:10: warning: symbol 'compat_read_class'
was not declared. Should it be static?
lib/compat_audit.c:17:10: warning: symbol 'compat_write_class'
was not declared. Should it be static?
lib/compat_audit.c:22:10: warning: symbol 'compat_chattr_class'
was not declared. Should it be static?
lib/compat_audit.c:27:10: warning: symbol 'compat_signal_class'
was not declared. Should it be static?
Trying to fix this by chaning compat_audit.c to inclde <linux/audit.h>
does not work on arm64 due to compile errors with the extra includes
that changing this header makes. The simpler thing would be just to
move the definitons of these symbols out of <linux/audit.h> into
<linux/audit_arch.h> which is included.
Fixes: 4b58841149dca ("audit: Add generic compat syscall support")
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
[PM: rewrite subject line, fixed line length in description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit e48e16f3e37fac76e2f0c14c58df2b0398a323b0 upstream.
Currently, F2FS requires the packed_ssa feature to be enabled when
utilizing non-4KB block sizes (e.g., 16KB). This restriction limits
the flexibility of filesystem formatting options.
This patch allows F2FS to support non-4KB block sizes even when the
packed_ssa feature is disabled. It adjusts the SSA calculation logic to
correctly handle summary entries in larger blocks without the packed
layout.
Cc: stable@kernel.org
Fixes: 7ee8bc3942f2 ("f2fs: revert summary entry count from 2048 to 512 in 16kb block support")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous MMCID fixes to address bugs and performance regressions
in the recent rewrite of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path by removing an
unintended transition phase
- Fix performance/scalability regression on a thread-pool benchmark
by optimizing transitional CIDs when scheduling out"
* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Optimize transitional CIDs when scheduling out
sched/mmcid: Drop per CPU CID immediately when switching to per task mode
sched/mmcid: Protect transition on weakly ordered systems
sched/mmcid: Prevent live lock on task to CPU mode transition
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar::
- Bump up the Clang minimum version requirements for livepatch
builds, due to Clang assembler section handling bugs causing
silent miscompilations
- Strip livepatching symbol artifacts from non-livepatch modules
- Fix livepatch build warnings when certain Clang LTO options
are enabled
- Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool/klp: Fix unexported static call key access for manually built livepatch modules
objtool/klp: Fix symbol correlation for orphaned local symbols
livepatch: Free klp_{object,func}_ext data after initialization
livepatch: Fix having __klp_objects relics in non-livepatch modules
livepatch/klp-build: Require Clang assembler >= 20
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"A couple of late-breaking MM fixes. One against a new-in-this-cycle
patch and the other addresses a locking issue which has been there for
over a year"
* tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/memory-failure: reject unsupported non-folio compound page
procfs: avoid fetching build ID while holding VMA lock
|
|
Pull ceph fixes from Ilya Dryomov:
"One RBD and two CephFS fixes which address potential oopses.
The RBD thing is more of a rare edge case that pops up in our CI,
while the two CephFS scenarios are regressions that were reported by
users and can be triggered trivially in normal operation. All marked
for stable"
* tag 'ceph-for-6.19-rc9' of https://github.com/ceph/ceph-client:
ceph: fix NULL pointer dereference in ceph_mds_auth_match()
ceph: fix oops due to invalid pointer for kfree() in parse_longname()
rbd: check for EOD after exclusive lock is ensured to be held
|
|
Fix PROCMAP_QUERY to fetch optional build ID only after dropping mmap_lock
or per-VMA lock, whichever was used to lock VMA under question, to avoid
deadlock reported by syzbot:
-> #1 (&mm->mmap_lock){++++}-{4:4}:
__might_fault+0xed/0x170
_copy_to_iter+0x118/0x1720
copy_page_to_iter+0x12d/0x1e0
filemap_read+0x720/0x10a0
blkdev_read_iter+0x2b5/0x4e0
vfs_read+0x7f4/0xae0
ksys_read+0x12a/0x250
do_syscall_64+0xcb/0xf80
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (&sb->s_type->i_mutex_key#8){++++}-{4:4}:
__lock_acquire+0x1509/0x26d0
lock_acquire+0x185/0x340
down_read+0x98/0x490
blkdev_read_iter+0x2a7/0x4e0
__kernel_read+0x39a/0xa90
freader_fetch+0x1d5/0xa80
__build_id_parse.isra.0+0xea/0x6a0
do_procmap_query+0xd75/0x1050
procfs_procmap_ioctl+0x7a/0xb0
__x64_sys_ioctl+0x18e/0x210
do_syscall_64+0xcb/0xf80
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(&mm->mmap_lock);
lock(&sb->s_type->i_mutex_key#8);
lock(&mm->mmap_lock);
rlock(&sb->s_type->i_mutex_key#8);
*** DEADLOCK ***
This seems to be exacerbated (as we haven't seen these syzbot reports
before that) by the recent:
777a8560fd29 ("lib/buildid: use __kernel_read() for sleepable context")
To make this safe, we need to grab file refcount while VMA is still locked, but
other than that everything is pretty straightforward. Internal build_id_parse()
API assumes VMA is passed, but it only needs the underlying file reference, so
just add another variant build_id_parse_file() that expects file passed
directly.
[akpm@linux-foundation.org: fix up kerneldoc]
Link: https://lkml.kernel.org/r/20260129215340.3742283-1-andrii@kernel.org
Fixes: ed5d583a88a9 ("fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reported-by: <syzbot+4e70c8e0a2017b432f7a@syzkaller.appspotmail.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless and Netfilter.
Previous releases - regressions:
- eth: stmmac: fix stm32 (and potentially others) resume regression
- nf_tables: fix inverted genmask check in nft_map_catchall_activate()
- usb: r8152: fix resume reset deadlock
- fix reporting RXH_XFRM_NO_CHANGE as input_xfrm for RSS contexts
Previous releases - always broken:
- sched: cls_u32: use skb_header_pointer_careful() to avoid OOB reads
with malicious u32 rules
- eth: ice: timestamping related fixes"
* tag 'net-6.19-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
ipv6: Fix ECMP sibling count mismatch when clearing RTF_ADDRCONF
netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate()
net: cpsw: Execute ndo_set_rx_mode callback in a work queue
net: cpsw_new: Execute ndo_set_rx_mode callback in a work queue
gve: Correct ethtool rx_dropped calculation
gve: Fix stats report corruption on queue count change
selftest: net: add a test-case for encap segmentation after GRO
net: gro: fix outer network offset
net: add proper RCU protection to /proc/net/ptype
net: ethernet: adi: adin1110: Check return value of devm_gpiod_get_optional() in adin1110_check_spi()
wifi: iwlwifi: mvm: pause TCM on fast resume
wifi: iwlwifi: mld: cancel mlo_scan_start_wk
net: spacemit: k1-emac: fix jumbo frame support
net: enetc: Convert 16-bit register reads to 32-bit for ENETC v4
net: enetc: Convert 16-bit register writes to 32-bit for ENETC v4
net: enetc: Remove CBDR cacheability AXI settings for ENETC v4
net: enetc: Remove SI/BDR cacheability AXI settings for ENETC v4
tipc: use kfree_sensitive() for session key material
net: stmmac: fix stm32 (and potentially others) resume regression
net: rss: fix reporting RXH_XFRM_NO_CHANGE as input_xfrm for contexts
...
|
|
The linker script scripts/module.lds.S specifies that all input
__klp_objects sections should be consolidated into an output section of
the same name, and start/stop symbols should be created to enable
scripts/livepatch/init.c to locate this data.
This start/stop pattern is not ideal for modules because the symbols are
created even if no __klp_objects input sections are present.
Consequently, a dummy __klp_objects section also appears in the
resulting module. This unnecessarily pollutes non-livepatch modules.
Instead, since modules are relocatable files, the usual method for
locating consolidated data in a module is to read its section table.
This approach avoids the aforementioned problem.
The klp_modinfo already stores a copy of the entire section table with
the final addresses. Introduce a helper function that
scripts/livepatch/init.c can call to obtain the location of the
__klp_objects section from this data.
Fixes: dd590d4d57eb ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260123102825.3521961-2-petr.pavlu@suse.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm
Pull TSM (TEE security Manager) fixes from Dan Williams:
"The largest change is reverting part of an ABI that never shipped in a
released kernel (Documentation/ABI/testing/sysfs-class-tsm). The fix /
replacement for that is too large to squeeze in at this late date.
The rest is a collection of small fixups:
- Fix multiple streams per host bridge for SEV-TIO
- Drop the TSM ABI for reporting IDE streams (to be replaced)
- Fix virtual function enumeration
- Fix reserved stream ID initialization
- Fix unused variable compiler warning"
* tag 'tsm-fixes-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm:
crypto/ccp: Allow multiple streams on the same root bridge
crypto/ccp: Use PCI bridge defaults for IDE
coco/tsm: Remove unused variable tsm_rwsem
PCI/IDE: Fix reading a wrong reg for unused sel stream initialization
PCI/IDE: Fix off by one error calculating VF RID range
Revert "PCI/TSM: Report active IDE streams"
|
|
The CephFS kernel client has regression starting from 6.18-rc1.
We have issue in ceph_mds_auth_match() if fs_name == NULL:
const char fs_name = mdsc->fsc->mount_options->mds_namespace;
...
if (auth->match.fs_name && strcmp(auth->match.fs_name, fs_name)) {
/ fsname mismatch, try next one */
return 0;
}
Patrick Donnelly suggested that: In summary, we should definitely start
decoding `fs_name` from the MDSMap and do strict authorizations checks
against it. Note that the `-o mds_namespace=foo` should only be used for
selecting the file system to mount and nothing else. It's possible
no mds_namespace is specified but the kernel will mount the only
file system that exists which may have name "foo".
This patch reworks ceph_mdsmap_decode() and namespace_equals() with
the goal of supporting the suggested concept. Now struct ceph_mdsmap
contains m_fs_name field that receives copy of extracted FS name
by ceph_extract_encoded_string(). For the case of "old" CephFS file
systems, it is used "cephfs" name.
[ idryomov: replace redundant %*pE with %s in ceph_mdsmap_decode(),
get rid of a series of strlen() calls in ceph_namespace_match(),
drop changes to namespace_equals() body to avoid treating empty
mds_namespace as equal, drop changes to ceph_mdsc_handle_fsmap()
as namespace_equals() isn't an equivalent substitution there ]
Cc: stable@vger.kernel.org
Fixes: 22c73d52a6d0 ("ceph: fix multifs mds auth caps issue")
Link: https://tracker.ceph.com/issues/73886
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Tested-by: Patrick Donnelly <pdonnell@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:
watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
NIP: mm_get_cid+0xe8/0x188
LR: mm_get_cid+0x108/0x188
mm_cid_switch_to+0x3c4/0x52c
__schedule+0x47c/0x700
schedule_idle+0x3c/0x64
do_idle+0x160/0x1b0
cpu_startup_entry+0x48/0x50
start_secondary+0x284/0x288
start_secondary_prolog+0x10/0x14
watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
NIP: plpar_hcall_norets_notrace+0x18/0x2c
LR: queued_spin_lock_slowpath+0xd88/0x15d0
_raw_spin_lock+0x80/0xa0
raw_spin_rq_lock_nested+0x3c/0xf8
mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
sched_mm_cid_exit+0x108/0x22c
do_exit+0xf4/0x5d0
make_task_dead+0x0/0x178
system_call_exception+0x128/0x390
system_call_vectored_common+0x15c/0x2ec
The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.
After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.
This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.
The transition mechanism does:
WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
WRITE_ONCE(mm->mm_cid.percpu, new_mode);
fixup()
WRITE_ONCE(mm->mm_cid.transit, 0);
mm_cid_schedin() does:
if (!READ_ONCE(mm->mm_cid.percpu))
...
cid |= READ_ONCE(mm->mm_cid.transit);
so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.
This could obviously be solved by using:
smp_store_release(&mm->mm_cid.percpu, true);
and
smp_load_acquire(&mm->mm_cid.percpu);
but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.
That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.
That makes the update of both states atomic and a concurrent read observes
always consistent state.
The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.
Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events fix from Ingo Molnar:
"Fix a race in the user-callchains code"
* tag 'perf-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: sched: Fix perf crash with new is_user_task() helper
|
|
In order to do a user space stacktrace the current task needs to be a user
task that has executed in user space. It use to be possible to test if a
task is a user task or not by simply checking the task_struct mm field. If
it was non NULL, it was a user task and if not it was a kernel task.
But things have changed over time, and some kernel tasks now have their
own mm field.
An idea was made to instead test PF_KTHREAD and two functions were used to
wrap this check in case it became more complex to test if a task was a
user task or not[1]. But this was rejected and the C code simply checked
the PF_KTHREAD directly.
It was later found that not all kernel threads set PF_KTHREAD. The io-uring
helpers instead set PF_USER_WORKER and this needed to be added as well.
But checking the flags is still not enough. There's a very small window
when a task exits that it frees its mm field and it is set back to NULL.
If perf were to trigger at this moment, the flags test would say its a
user space task but when perf would read the mm field it would crash with
at NULL pointer dereference.
Now there are flags that can be used to test if a task is exiting, but
they are set in areas that perf may still want to profile the user space
task (to see where it exited). The only real test is to check both the
flags and the mm field.
Instead of making this modification in every location, create a new
is_user_task() helper function that does all the tests needed to know if
it is safe to read the user space memory or not.
[1] https://lore.kernel.org/all/20250425204120.639530125@goodmis.org/
Fixes: 90942f9fac05 ("perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL")
Closes: https://lore.kernel.org/all/0d877e6f-41a7-4724-875d-0b0a27b8a545@roeck-us.net/
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260129102821.46484722@gandalf.local.home
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- important fix for ARM 32-bit based systems using cma= kernel
parameter (Oreoluwa Babatunde)
- a fix for the corner case of the DMA atomic pool based allocations
(Sai Sree Kartheek Adivi)
* tag 'dma-mapping-6.19-2026-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma/pool: distinguish between missing and exhausted atomic pools
of: reserved_mem: Allow reserved_mem framework detect "cma=" kernel param
|
|
This variant of skb_header_pointer() should be used in contexts
where @offset argument is user-controlled and could be negative.
Negative offsets are supported, as long as the zone starts
between skb->head and skb->data.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260128141539.3404400-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|