Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There is a spelling mistake in a fail error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250520080657.30726-1-colin.i.king@gmail.com
|
|
The kerneldoc for futex_wait_setup() states it can return "0" or "<1".
This isn't true because the error case is "<0" not less than 1.
Document that <0 is returned on error. Drop the possible return values
and state possible reasons.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20250517151455.1065363-6-bigeasy@linutronix.de
|
|
The prctl.h ABI header was slightly updated during the development of
the interface. In particular the "immutable" parameter became a bit in
the option argument.
Synchronize prctl.h ABI header again and make use of the definition in
the testsuite and "perf bench futex".
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20250517151455.1065363-5-bigeasy@linutronix.de
|
|
There is no need for an explicit NULL pointer initialisation plus a
comment why it is okay. RCU_INIT_POINTER() can be used for NULL
initialisations and it is documented.
This has been build tested with gcc version 9.3.0 (Debian 9.3.0-22) on a
x86-64 defconfig.
Fixes: 094ac8cff7858 ("futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250517151455.1065363-4-bigeasy@linutronix.de
|
|
Use TAP output for easier automated testing.
Suggested-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20250517151455.1065363-3-bigeasy@linutronix.de
|
|
Use TAP output for easier automated testing.
Suggested-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20250517151455.1065363-2-bigeasy@linutronix.de
|
|
Fix those:
./kernel/futex/futex.h:208: warning: Function parameter or struct member 'drop_hb_ref' not described in 'futex_q'
./kernel/futex/waitwake.c:343: warning: expecting prototype for futex_wait_queue(). Prototype was for futex_do_wait() instead
./kernel/futex/waitwake.c:594: warning: Function parameter or struct member 'task' not described in 'futex_wait_setup'
Fixes: 93f1b6d79a73 ("futex: Move futex_queue() into futex_wait_setup()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250512185641.0450a99b@canb.auug.org.au # report
Link: https://lore.kernel.org/r/20250515171641.24073-1-bp@kernel.org # submission
|
|
futex_mm_init()
The following commit added an rcu_assign_pointer() assignment to
futex_mm_init() in <linux/futex.h>:
bd54df5ea7ca ("futex: Allow to resize the private local hash")
Which breaks the build on older compilers (gcc-9, x86-64 defconfig):
CC io_uring/futex.o
In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
from ./include/linux/compiler.h:390,
from ./include/linux/array_size.h:5,
from ./include/linux/kernel.h:16,
from io_uring/futex.c:2:
./include/linux/futex.h: In function 'futex_mm_init':
./include/linux/rcupdate.h:555:36: error: dereferencing pointer to incomplete type 'struct futex_private_hash'
The problem is that this variant of rcu_assign_pointer() wants to
know the full type of 'struct futex_private_hash', which type
is local to futex.c:
kernel/futex/core.c:struct futex_private_hash {
There are a couple of mechanical solutions for this bug:
- we can uninline futex_mm_init() and move it into futex/core.c
- or we can share the structure definition with kernel/fork.c.
But both of these solutions have disadvantages: the first one adds
runtime overhead, while the second one dis-encapsulates private
futex types.
A third solution, implemented by this patch, is to just initialize
mm->futex_phash with NULL like the patch below, it's not like this
new MM's ->futex_phash can be observed externally until the task
is inserted into the task list, which guarantees full store ordering.
The relaxation of this initialization might also give a tiny speedup
on certain platforms.
Fixes: bd54df5ea7ca ("futex: Allow to resize the private local hash")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/aB8SI00EHBri23lB@gmail.com
|
|
Since commit 2070887fdeac ("futex: fix restart in wait_requeue_pi"),
futex_wait_requeue_pi() no longer uses restart_block. Update the comment
accordingly.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250428193445.4571-1-namcao@linutronix.de
|
|
There have been recent reports about running out of lockdep keys:
MAX_LOCKDEP_KEYS too low!
One possible reason is that too many dynamic keys have been registered.
A possible culprit is the lockdep_register_key() call in qdisc_alloc()
of net/sched/sch_generic.c.
Currently, there is no way to find out how many dynamic keys have been
registered. Add such a stat to the /proc/lockdep_stats to get better
clarity.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: llvm@lists.linux.dev
Link: https://lore.kernel.org/r/20250506042049.50060-4-boqun.feng@gmail.com
|
|
To catch the code trying to use a subclass value >= MAX_LOCKDEP_SUBCLASSES (8),
add a DEBUG_LOCKS_WARN_ON() statement to notify the users that such a
large value is not allowed.
[ boqun: Reword the commit log with a more objective tone ]
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: llvm@lists.linux.dev
Link: https://lore.kernel.org/r/20250506042049.50060-3-boqun.feng@gmail.com
|
|
When hlock_equal() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y, CONFIG_LOCKDEP=y and
CONFIG_LOCKDEP_SMALL=n:
lockdep.c:2005:20: error: unused function 'hlock_equal' [-Werror,-Wunused-function]
Fix this by moving the function to the respective existing ifdeffery
for its the only user.
See also:
6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build")
Fixes: 68e305678583 ("lockdep: Adjust check_redundant() for recursive read change")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: llvm@lists.linux.dev
Link: https://lore.kernel.org/r/20250506042049.50060-2-boqun.feng@gmail.com
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Namhyung Kim:
"Just a couple of build fixes on arm64"
* tag 'perf-tools-fixes-for-v6.15-2025-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
perf tools: Fix in-source libperf build
perf tools: Fix arm64 build by generating unistd_64.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix read out of bounds bug in tracing_splice_read_pipe()
The size of the sub page being read can now be greater than a page.
But the buffer used in tracing_splice_read_pipe() only allocates a
page size. The data copied to the buffer is the amount in sub buffer
which can overflow the buffer.
Use min((size_t)trace_seq_used(&iter->seq), PAGE_SIZE) to limit the
amount copied to the buffer to a max of PAGE_SIZE.
- Fix the test for NULL from "!filter_hash" to "!*filter_hash"
The add_next_hash() function checked for NULL at the wrong pointer
level.
- Do not use the array in trace_adjust_address() if there are no
elements
The trace_adjust_address() finds the offset of a module that was
stored in the persistent buffer when reading the previous boot buffer
to see if the address belongs to a module that was loaded in the
previous boot. An array is created that matches currently loaded
modules with previously loaded modules. The trace_adjust_address()
uses that array to find the new offset of the address that's in the
previous buffer. But if no module was loaded, it ends up reading the
last element in an array that was never allocated.
Check if nr_entries is zero and exit out early if it is.
- Remove nested lock of trace_event_sem in print_event_fields()
The print_event_fields() function iterates over the ftrace_events
list and requires the trace_event_sem semaphore held for read. But
this function is always called with that semaphore held for read.
Remove the taking of the semaphore and replace it with
lockdep_assert_held_read(&trace_event_sem)
* tag 'trace-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Do not take trace_event_sem in print_event_fields()
tracing: Fix trace_adjust_address() when there is no modules in scratch area
ftrace: Fix NULL memory allocation check
tracing: Fix oob write in trace_seq_to_buffer()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fix from Helge Deller:
"Fix a double SIGFPE crash"
* tag 'parisc-for-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix double SIGFPE crash
|
|
Camm noticed that on parisc a SIGFPE exception will crash an application with
a second SIGFPE in the signal handler. Dave analyzed it, and it happens
because glibc uses a double-word floating-point store to atomically update
function descriptors. As a result of lazy binding, we hit a floating-point
store in fpe_func almost immediately.
When the T bit is set, an assist exception trap occurs when when the
co-processor encounters *any* floating-point instruction except for a double
store of register %fr0. The latter cancels all pending traps. Let's fix this
by clearing the Trap (T) bit in the FP status register before returning to the
signal handler in userspace.
The issue can be reproduced with this test program:
root@parisc:~# cat fpe.c
static void fpe_func(int sig, siginfo_t *i, void *v) {
sigset_t set;
sigemptyset(&set);
sigaddset(&set, SIGFPE);
sigprocmask(SIG_UNBLOCK, &set, NULL);
printf("GOT signal %d with si_code %ld\n", sig, i->si_code);
}
int main() {
struct sigaction action = {
.sa_sigaction = fpe_func,
.sa_flags = SA_RESTART|SA_SIGINFO };
sigaction(SIGFPE, &action, 0);
feenableexcept(FE_OVERFLOW);
return printf("%lf\n",1.7976931348623158E308*1.7976931348623158E308);
}
root@parisc:~# gcc fpe.c -lm
root@parisc:~# ./a.out
Floating point exception
root@parisc:~# strace -f ./a.out
execve("./a.out", ["./a.out"], 0xf9ac7034 /* 20 vars */) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
...
rt_sigaction(SIGFPE, {sa_handler=0x1110a, sa_mask=[], sa_flags=SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
--- SIGFPE {si_signo=SIGFPE, si_code=FPE_FLTOVF, si_addr=0x1078f} ---
--- SIGFPE {si_signo=SIGFPE, si_code=FPE_FLTOVF, si_addr=0xf8f21237} ---
+++ killed by SIGFPE +++
Floating point exception
Signed-off-by: Helge Deller <deller@gmx.de>
Suggested-by: John David Anglin <dave.anglin@bell.net>
Reported-by: Camm Maguire <camm@maguirefamily.org>
Cc: stable@vger.kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fixes from Borislav Petkov:
- Test the correct structure member when handling correctable errors
and avoid spurious interrupts, in altera_edac
* tag 'edac_urgent_for_v6.15_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/altera: Set DDR and SDMMC interrupt mask before registration
EDAC/altera: Test the correct error reg offset
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix SEV-SNP memory acceptance from the EFI stub for guests
running at VMPL >0"
* tag 'x86-urgent-2025-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/sev: Support memory acceptance in the EFI stub under SVSM
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc perf fixes from Ingo Molnar:
- Require group events for branch counter groups and
PEBS counter snapshotting groups to be x86 events.
- Fix the handling of counter-snapshotting of non-precise
events, where counter values may move backwards a bit,
temporarily, confusing the code.
- Restrict perf/KVM PEBS to guest-owned events.
* tag 'perf-urgent-2025-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: KVM: Mask PEBS_ENABLE loaded for guest with vCPU's value.
perf/x86/intel/ds: Fix counter backwards of non-precise events counters-snapshotting
perf/x86/intel: Check the X86 leader for pebs_counter_event_group
perf/x86/intel: Only check the group flag for X86 leader
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
- Prevent NULL pointer dereference in msi_domain_debug_show()
- Fix crash in the qcom-mpm irqchip driver when configuring
interrupts for non-wake GPIOs
* tag 'irq-urgent-2025-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/qcom-mpm: Prevent crash when trying to handle non-wake GPIOs
genirq/msi: Prevent NULL pointer dereference in msi_domain_debug_show()
|
|
Commit:
d54d610243a4 ("x86/boot/sev: Avoid shared GHCB page for early memory acceptance")
provided a fix for SEV-SNP memory acceptance from the EFI stub when
running at VMPL #0. However, that fix was insufficient for SVSM SEV-SNP
guests running at VMPL >0, as those rely on a SVSM calling area, which
is a shared buffer whose address is programmed into a SEV-SNP MSR, and
the SEV init code that sets up this calling area executes much later
during the boot.
Given that booting via the EFI stub at VMPL >0 implies that the firmware
has configured this calling area already, reuse it for performing memory
acceptance in the EFI stub.
Fixes: fcd042e86422 ("x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0")
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250428174322.2780170-2-ardb+git@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Add missing sentinels to the arm64 Spectre-BHB MIDR arrays, otherwise
is_midr_in_range_list() reads beyond the end of these arrays"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: errata: Add missing sentinels to Spectre-BHB MIDR arrays
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
- imx-lpi2c: fix clock error handling sequence in probe
* tag 'i2c-for-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: imx-lpi2c: Fix clock count when probe defers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A bunch of small fixes. Mostly driver specific.
- An OOB access fix in core UMP rawmidi conversion code
- Fix for ASoC DAPM hw_params widget sequence
- Make retry of usb_set_interface() errors for flaky devices
- Fix redundant USB MIDI name strings
- Quirks for various HP and ASUS models with HD-audio, and
Jabra Evolve 65 USB-audio
- Cirrus Kunit test fixes
- Various fixes for ASoC Intel, stm32, renesas, imx-card, and
simple-card"
* tag 'sound-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (30 commits)
ASoC: amd: ps: fix for irq handler return status
ASoC: simple-card-utils: Fix pointer check in graph_util_parse_link_direction
ASoC: intel/sdw_utils: Add volume limit to cs35l56 speakers
ASoC: intel/sdw_utils: Add volume limit to cs42l43 speakers
ASoC: stm32: sai: add a check on minimal kernel frequency
ASoC: stm32: sai: skip useless iterations on kernel rate loop
ALSA: hda/realtek - Add more HP laptops which need mute led fixup
ALSA: hda/realtek: Fix built-mic regression on other ASUS models
ASoC: Intel: catpt: avoid type mismatch in dev_dbg() format
ALSA: usb-audio: Fix duplicated name in MIDI substream names
ALSA: ump: Fix buffer overflow at UMP SysEx message conversion
ALSA: usb-audio: Add second USB ID for Jabra Evolve 65 headset
ALSA: hda/realtek: Add quirk for HP Spectre x360 15-df1xxx
ALSA: hda: Apply volume control on speaker+lineout for HP EliteStudio AIO
ASoC: Intel: bytcr_rt5640: Add DMI quirk for Acer Aspire SW3-013
ASoC: amd: acp: Fix devm_snd_soc_register_card(acp-pdm-mach) failure
ASoC: amd: acp: Fix NULL pointer deref in acp_i2s_set_tdm_slot
ASoC: amd: acp: Fix NULL pointer deref on acp resume path
ASoC: renesas: rz-ssi: Use NOIRQ_SYSTEM_SLEEP_PM_OPS()
ASoC: soc-acpi-intel-ptl-match: add empty item to ptl_cs42l43_l3[]
...
|
|
Implement a simple NUMA aware spinlock for testing and howto purposes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Test the basic functionality for the NUMA and MPOL flags:
- FUTEX2_NUMA should take the NUMA node which is after the uaddr
and use it.
- Only update the node if FUTEX_NO_NODE was set by the user
- FUTEX2_MPOL should use the memory based on the policy. I attempted to
set the node with mbind() and then use this with MPOL but this fails
and futex falls back to the default node for the current CPU.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-22-bigeasy@linutronix.de
|
|
Test the basic functionality of the private hash:
- Upon start, with no threads there is no private hash.
- The first thread initializes the private hash.
- More than four threads will increase the size of the private hash if
the system has more than 16 CPUs online.
- Once the user sets the size of private hash, auto scaling is disabled.
- The user is only allowed to use numbers to the power of two.
- The user may request the global or make the hash immutable.
- Once the global hash has been set or the hash has been made immutable,
further changes are not allowed.
- Futex operations should work the whole time. It must be possible to
hold a lock, such a PI initialised mutex, during the resize operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-21-bigeasy@linutronix.de
|
|
Make it build without relying on recent headers.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Add the -b/ --buckets argument to specify the number of hash buckets for
the private futex hash. This is directly passed to
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, buckets, immutable)
and must return without an error if specified. The `immutable' is 0 by
default and can be set to 1 via the -I/ --immutable argument.
The size of the private hash is verified with PR_FUTEX_HASH_GET_SLOTS.
If PR_FUTEX_HASH_GET_SLOTS failed then it is assumed that an older
kernel was used without the support and that the global hash is used.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-20-bigeasy@linutronix.de
|
|
Synchronize prctl.h with current uapi version after adding
PR_FUTEX_HASH.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-19-bigeasy@linutronix.de
|
|
Extend the futex2 interface to be aware of mempolicy.
When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or
home_node specified covering the futex address, use that hash-map.
Notably, in this case the futex will go to the global node hashtable,
even if it is a PRIVATE futex.
When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node
value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be
tried first before reverting to setting node to the local node.
[bigeasy: add CONFIG_FUTEX_MPOL, add MPOL to FUTEX2_VALID_MASK, write
the node only to user if FUTEX_NO_NODE was supplied]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-18-bigeasy@linutronix.de
|
|
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.
struct futex_numa_32 {
u32 val;
u32 node;
};
When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.
When FUTEX2_NUMA is not set, the node is simply an extension of the
hash, such that traditional futexes are still interleaved over the
nodes.
This is done to avoid having to have a separate !numa hash-table.
[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
pr_info() for size and allocation information. Cast the naddr math to
void*]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-17-bigeasy@linutronix.de
|
|
My initial testing showed that:
perf bench futex hash
reported less operations/sec with private hash. After using the same
amount of buckets in the private hash as used by the global hash then
the operations/sec were about the same.
This changed once the private hash became resizable. This feature added
an RCU section and reference counting via atomic inc+dec operation into
the hot path.
The reference counting can be avoided if the private hash is made
immutable.
Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the
private should be made immutable. Once set (to true) the a further
resize is not allowed (same if set to global hash).
Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not
be changed.
Update "perf bench" suite.
For comparison, results of "perf bench futex hash -s":
- Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs:
- Before the introducing task local hash
shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs = 10
private Averaged 2.192.405 operations/sec (+- 0,07%), total secs = 10
- With the series
shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs = 10
-b128 Averaged 141.394 operations/sec (+- 1,15%), total secs = 10
-Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs = 10
-b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs = 10
-Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs = 10
128 is the default allocation of hash buckets.
8192 was the previous amount of allocated hash buckets.
- Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs:
- Before the introducing task local hash
shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs = 20
private Averaged 2.505.801 operations/sec (+- 0,05%), total secs = 20
- With the series
shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs = 20
-b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs = 20
-Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs = 20
-b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs = 20
-Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs = 20
1024 is the default allocation of hash buckets.
65536 was the previous amount of allocated hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20250416162921.513656-16-bigeasy@linutronix.de
|
|
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize process allocates a new struct futex_private_hash and drops
the initial reference. Synchronized with mm_struct::futex_hash_lock it
is checked if the reference counter for the currently used
mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
on the current private hash are requeued on the new private hash and the
new private hash is set to mm_struct::futex_phash. Otherwise the newly
allocated private hash is saved as mm_struct::futex_phash_new and the
rehashing and reassigning is delayed to the futex_hash() caller once the
reference counter is marked DEAD.
The replacement is not performed at rcuref_put() time because certain
callers, such as futex_wait_queue(), drop their reference after changing
the task state. This change will be destroyed once the futex_hash_lock
is acquired.
The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
multiple times. An increase and decrease is allowed and request blocks
until the assignment is done.
The private hash allocated at thread creation is changed from 16 to
16 <= 4 * number_of_threads <= global_hash_size
where number_of_threads can not exceed the number of online CPUs. Should
the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
[peterz: reorganize the code to avoid state tracking and simplify new
object handling, block the user until changes are in effect, allow
increase and decrease of the hash].
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-15-bigeasy@linutronix.de
|
|
Allocate a private futex hash with 16 slots if a task forks its first
thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-14-bigeasy@linutronix.de
|
|
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via:
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);
A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.
The current status for the private hash can be queried via:
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
|
|
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-12-bigeasy@linutronix.de
|
|
futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.
Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.
The locked hash bucket is not returned.
A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be
guaranteed. Therefore before the waiter is removed and a reference is
acquired which is later dropped by the waiter to avoid a resize.
Add futex_q_lockptr_lock() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra
reference in futex_q::drop_hb_ref and let the waiter drop the reference
in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-11-bigeasy@linutronix.de
|
|
To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-10-bigeasy@linutronix.de
|
|
futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is
dropped later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.
Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-9-bigeasy@linutronix.de
|
|
This gets us:
fph = futex_private_hash(key) /* gets fph and inc users */
futex_private_hash_get(fph) /* inc users */
futex_private_hash_put(fph) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-8-bigeasy@linutronix.de
|
|
This gets us:
hb = futex_hash(key) /* gets hb and inc users */
futex_hash_get(hb) /* inc users */
futex_hash_put(hb) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-7-bigeasy@linutronix.de
|
|
Create explicit scopes for hb variables; almost pure re-indent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-6-bigeasy@linutronix.de
|
|
Getting the hash bucket and queuing it are two distinct actions. In
light of wanting to add a put hash bucket function later, untangle
them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-5-bigeasy@linutronix.de
|
|
futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().
Mostly such that requeue can have an extra test in between.
Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().
[bigeasy: fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
|
|
To enable node specific hash-tables using huge pages if possible.
[bigeasy: use __vmalloc_node_range_noprof(), add nommu bits, inline
vmalloc_huge]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250416162921.513656-3-bigeasy@linutronix.de
|
|
rcuref_read() returns the number of references that are currently held.
If 0 is returned then it is not safe to assume that the object ca be
scheduled for deconstruction because it is marked DEAD. This happens if
the return value of rcuref_put() is ignored and assumptions are made.
If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
If rcuref_put() did not return to the caller then the counter did not
yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
is still a chance that the counter will transition from RCUREF_NOREF to
0 meaning it is still valid and must not be deconstructed. In this brief
window rcuref_read() will return 0.
Provide rcuref_is_dead() to determine if the counter is marked as
RCUREF_DEAD.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-2-bigeasy@linutronix.de
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A fairly small pile of fixes, plus one new compatible string addition
to the Synopsis driver for a new platform.
The most notable thing is the fix for divide by zeros in spi-mem if an
operation has no dummy bytes"
* tag 'spi-fix-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: tegra114: Don't fail set_cs_timing when delays are zero
spi: spi-qpic-snand: fix NAND_READ_LOCATION_2 register handling
spi: spi-mem: Add fix to avoid divide error
spi: dt-bindings: snps,dw-apb-ssi: Add compatible for SOPHGO SG2042 SoC
spi: dt-bindings: snps,dw-apb-ssi: Merge duplicate compatible entry
spi: spi-qpic-snand: propagate errors from qcom_spi_block_erase()
spi: stm32-ospi: Fix an error handling path in stm32_ospi_probe()
|