summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-02-15Merge tag 'linux_kselftest-kunit-fixes-6.8-rc5' of ↵Linus Torvalds3-0/+19
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit fix from Shuah Khan: "One important fix to unregister kunit_bus when KUnit module is unloaded. Not doing so causes an error when KUnit module tries to re-register the bus when it gets reloaded" * tag 'linux_kselftest-kunit-fixes-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: device: Unregister the kunit_bus on shutdown
2024-02-15netfilter: nf_tables: fix bidirectional offload regressionFelix Fietkau1-0/+1
Commit 8f84780b84d6 ("netfilter: flowtable: allow unidirectional rules") made unidirectional flow offload possible, while completely ignoring (and breaking) bidirectional flow offload for nftables. Add the missing flag that was left out as an exercise for the reader :) Cc: Vlad Buslov <vladbu@nvidia.com> Fixes: 8f84780b84d6 ("netfilter: flowtable: allow unidirectional rules") Reported-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-15netfilter: nat: restore default DNAT behaviorKyle Swenson1-1/+4
When a DNAT rule is configured via iptables with different port ranges, iptables -t nat -A PREROUTING -p tcp -d 10.0.0.2 -m tcp --dport 32000:32010 -j DNAT --to-destination 192.168.0.10:21000-21010 we seem to be DNATing to some random port on the LAN side. While this is expected if --random is passed to the iptables command, it is not expected without passing --random. The expected behavior (and the observed behavior prior to the commit in the "Fixes" tag) is the traffic will be DNAT'd to 192.168.0.10:21000 unless there is a tuple collision with that destination. In that case, we expect the traffic to be instead DNAT'd to 192.168.0.10:21001, so on so forth until the end of the range. This patch intends to restore the behavior observed prior to the "Fixes" tag. Fixes: 6ed5943f8735 ("netfilter: nat: remove l4 protocol port rovers") Signed-off-by: Kyle Swenson <kyle.swenson@est.tech> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-15netfilter: nft_set_pipapo: fix missing : in kdocPablo Neira Ayuso1-2/+2
Add missing : in kdoc field names. Fixes: 8683f4b9950d ("nft_set_pipapo: Prepare for vectorised implementation: helpers") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-15modpost: trim leading spaces when processing source files listRadek Krejci1-1/+6
get_line() does not trim the leading spaces, but the parse_source_files() expects to get lines with source files paths where the first space occurs after the file path. Fixes: 70f30cfe5b89 ("modpost: use read_text_file() and get_line() for reading text files") Signed-off-by: Radek Krejci <radek.krejci@oracle.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-15gen_compile_commands: fix invalid escape sequence warningAndrew Ballance1-1/+1
With python 3.12, '\#' results in this warning SyntaxWarning: invalid escape sequence '\#' Signed-off-by: Andrew Ballance <andrewjballance@gmail.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-15kbuild: Fix changing ELF file type for output of gen_btf for big endianNathan Chancellor1-2/+7
Commit 90ceddcb4950 ("bpf: Support llvm-objcopy for vmlinux BTF") changed the ELF type of .btf.vmlinux.bin.o to ET_REL via dd, which works fine for little endian platforms: 00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............| -00000010 03 00 b7 00 01 00 00 00 00 00 00 80 00 80 ff ff |................| +00000010 01 00 b7 00 01 00 00 00 00 00 00 80 00 80 ff ff |................| However, for big endian platforms, it changes the wrong byte, resulting in an invalid ELF file type, which ld.lld rejects: 00000000 7f 45 4c 46 02 02 01 00 00 00 00 00 00 00 00 00 |.ELF............| -00000010 00 03 00 16 00 00 00 01 00 00 00 00 00 10 00 00 |................| +00000010 01 03 00 16 00 00 00 01 00 00 00 00 00 10 00 00 |................| Type: <unknown>: 103 ld.lld: error: .btf.vmlinux.bin.o: unknown file type Fix this by updating the entire 16-bit e_type field rather than just a single byte, so that everything works correctly for all platforms and linkers. 00000000 7f 45 4c 46 02 02 01 00 00 00 00 00 00 00 00 00 |.ELF............| -00000010 00 03 00 16 00 00 00 01 00 00 00 00 00 10 00 00 |................| +00000010 00 01 00 16 00 00 00 01 00 00 00 00 00 10 00 00 |................| Type: REL (Relocatable file) While in the area, update the comment to mention that binutils 2.35+ matches LLD's behavior of rejecting an ET_EXEC input, which occurred after the comment was added. Cc: stable@vger.kernel.org Fixes: 90ceddcb4950 ("bpf: Support llvm-objcopy for vmlinux BTF") Link: https://github.com/llvm/llvm-project/pull/75643 Suggested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-15docs: kconfig: Fix grammar and formattingThorsten Blum1-3/+3
- Remove unnecessary spaces - Fix grammar s/to solution/solution/ Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-15i2c: i801: Fix block process call transactionsJean Delvare1-2/+2
According to the Intel datasheets, software must reset the block buffer index twice for block process call transactions: once before writing the outgoing data to the buffer, and once again before reading the incoming data from the buffer. The driver is currently missing the second reset, causing the wrong portion of the block buffer to be read. Signed-off-by: Jean Delvare <jdelvare@suse.de> Reported-by: Piotr Zakowski <piotr.zakowski@intel.com> Closes: https://lore.kernel.org/linux-i2c/20240213120553.7b0ab120@endymion.delvare/ Fixes: 315cd67c9453 ("i2c: i801: Add Block Write-Block Read Process Call support") Reviewed-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
2024-02-15i2c: pasemi: split driver into two separate modulesArnd Bergmann2-4/+8
On powerpc, it is possible to compile test both the new apple (arm) and old pasemi (powerpc) drivers for the i2c hardware at the same time, which leads to a warning about linking the same object file twice: scripts/Makefile.build:244: drivers/i2c/busses/Makefile: i2c-pasemi-core.o is added to multiple modules: i2c-apple i2c-pasemi Rework the driver to have an explicit helper module, letting Kbuild take care of whether this should be built-in or a loadable driver. Fixes: 9bc5f4f660ff ("i2c: pasemi: Split pci driver to its own file") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
2024-02-15kbuild: use 4-space indentation when followed by conditionalsMasahiro Yamada4-14/+14
GNU Make manual [1] clearly forbids a tab at the beginning of the conditional directive line: "Extra spaces are allowed and ignored at the beginning of the conditional directive line, but a tab is not allowed." This will not work for the next release of GNU Make, hence commit 82175d1f9430 ("kbuild: Replace tabs with spaces when followed by conditionals") replaced the inappropriate tabs with 8 spaces. However, the 8-space indentation cannot be visually distinguished. Linus suggested 2-4 spaces for those nested if-statements. [2] This commit redoes the replacement with 4 spaces. [1]: https://www.gnu.org/software/make/manual/make.html#Conditional-Syntax [2]: https://lore.kernel.org/all/CAHk-=whJKZNZWsa-VNDKafS_VfY4a5dAjG-r8BZgWk_a-xSepw@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-14lsm: fix integer overflow in lsm_set_self_attr() syscallJann Horn1-2/+5
security_setselfattr() has an integer overflow bug that leads to out-of-bounds access when userspace provides bogus input: `lctx->ctx_len + sizeof(*lctx)` is checked against `lctx->len` (and, redundantly, also against `size`), but there are no checks on `lctx->ctx_len`. Therefore, userspace can provide an `lsm_ctx` with `->ctx_len` set to a value between `-sizeof(struct lsm_ctx)` and -1, and this bogus `->ctx_len` will then be passed to an LSM module as a buffer length, causing LSM modules to perform out-of-bounds accesses. The following reproducer will demonstrate this under ASAN (if AppArmor is loaded as an LSM): ``` struct lsm_ctx { uint64_t id; uint64_t flags; uint64_t len; uint64_t ctx_len; char ctx[]; }; int main(void) { size_t size = sizeof(struct lsm_ctx); struct lsm_ctx *ctx = malloc(size); ctx->id = 104/*LSM_ID_APPARMOR*/; ctx->flags = 0; ctx->len = size; ctx->ctx_len = -sizeof(struct lsm_ctx); syscall( 460/*__NR_lsm_set_self_attr*/, /*attr=*/ 100/*LSM_ATTR_CURRENT*/, /*ctx=*/ ctx, /*size=*/ size, /*flags=*/ 0 ); } ``` Fixes: a04a1198088a ("LSM: syscalls for current process attributes") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak, removed ref to ASAN splat that isn't included] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-02-14igc: Remove temporary workaroundSasha Neftin1-5/+1
PHY_CONTROL register works as defined in the IEEE 802.3 specification (IEEE 802.3-2008 22.2.4.1). Tidy up the temporary workaround. User impact: PHY can now be powered down when the ethernet link is down. Testing hints: ip link set down <device> (or just disconnect the ethernet cable). Oldest tested NVM version is: 1045:740. Fixes: 5586838fe9ce ("igc: Add code for PHY support") Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-02-14igb: Fix string truncation warnings in igb_set_fw_versionKunwu Chan2-18/+19
Commit 1978d3ead82c ("intel: fix string truncation warnings") fixes '-Wformat-truncation=' warnings in igb_main.c by using kasprintf. drivers/net/ethernet/intel/igb/igb_main.c:3092:53: warning:‘%d’ directive output may be truncated writing between 1 and 5 bytes into a region of size between 1 and 13 [-Wformat-truncation=] 3092 | "%d.%d, 0x%08x, %d.%d.%d", | ^~ drivers/net/ethernet/intel/igb/igb_main.c:3092:34: note:directive argument in the range [0, 65535] 3092 | "%d.%d, 0x%08x, %d.%d.%d", | ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/intel/igb/igb_main.c:3092:34: note:directive argument in the range [0, 65535] drivers/net/ethernet/intel/igb/igb_main.c:3090:25: note:‘snprintf’ output between 23 and 43 bytes into a destination of size 32 kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fix this warning by using a larger space for adapter->fw_version, and then fall back and continue to use snprintf. Fixes: 1978d3ead82c ("intel: fix string truncation warnings") Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Cc: Kunwu Chan <kunwu.chan@hotmail.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-02-14tracing: Inform kmemleak of saved_cmdlines allocationSteven Rostedt (Google)1-0/+3
The allocation of the struct saved_cmdlines_buffer structure changed from: s = kmalloc(sizeof(*s), GFP_KERNEL); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); to: orig_size = sizeof(*s) + val * TASK_COMM_LEN; order = get_order(orig_size); size = 1 << (order + PAGE_SHIFT); page = alloc_pages(GFP_KERNEL, order); if (!page) return NULL; s = page_address(page); memset(s, 0, sizeof(*s)); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); Where that s->saved_cmdlines allocation looks to be a dangling allocation to kmemleak. That's because kmemleak only keeps track of kmalloc() allocations. For allocations that use page_alloc() directly, the kmemleak needs to be explicitly informed about it. Add kmemleak_alloc() and kmemleak_free() around the page allocation so that it doesn't give the following false positive: unreferenced object 0xffff8881010c8000 (size 32760): comm "swapper", pid 0, jiffies 4294667296 hex dump (first 32 bytes): ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace (crc ae6ec1b9): [<ffffffff86722405>] kmemleak_alloc+0x45/0x80 [<ffffffff8414028d>] __kmalloc_large_node+0x10d/0x190 [<ffffffff84146ab1>] __kmalloc+0x3b1/0x4c0 [<ffffffff83ed7103>] allocate_cmdlines_buffer+0x113/0x230 [<ffffffff88649c34>] tracer_alloc_buffers.isra.0+0x124/0x460 [<ffffffff8864a174>] early_trace_init+0x14/0xa0 [<ffffffff885dd5ae>] start_kernel+0x12e/0x3c0 [<ffffffff885f5758>] x86_64_start_reservations+0x18/0x30 [<ffffffff885f582b>] x86_64_start_kernel+0x7b/0x80 [<ffffffff83a001c3>] secondary_startup_64_no_verify+0x15e/0x16b Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r.fsf@kernel.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240214112046.09a322d6@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Reported-by: Kalle Valo <kvalo@kernel.org> Tested-by: Kalle Valo <kvalo@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-14Merge tag 'kvm-riscv-fixes-6.8-1' of https://github.com/kvm-riscv/linux into ↵Paolo Bonzini2-11/+15
HEAD KVM/riscv fixes for 6.8, take #1 - Fix steal-time related sparse warnings
2024-02-14Merge tag 'kvm-x86-selftests-6.8-rcN' of https://github.com/kvm-x86/linux ↵Paolo Bonzini59-240/+237
into HEAD KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8: - Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing. - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the function generates boolean values, and all callers treat the return value as a bool).
2024-02-14Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2-12/+8
KVM x86 fixes for 6.8: - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended.
2024-02-14bpf, scripts: Correct GPL license nameGianmarco Lusvardi1-1/+1
The bpf_doc script refers to the GPL as the "GNU Privacy License". I strongly suspect that the author wanted to refer to the GNU General Public License, under which the Linux kernel is released, as, to the best of my knowledge, there is no license named "GNU Privacy License". This patch corrects the license name in the script accordingly. Fixes: 56a092c89505 ("bpf: add script and prepare bpf.h for new helpers documentation") Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20240213230544.930018-3-glusvardi@posteo.net
2024-02-14nvmem: include bit index in cell sysfs file nameArnd Bergmann2-10/+11
Creating sysfs files for all Cells caused a boot failure for linux-6.8-rc1 on Apple M1, which (in downstream dts files) has multiple nvmem cells that use the same byte address. This causes the device probe to fail with [ 0.605336] sysfs: cannot create duplicate filename '/devices/platform/soc@200000000/2922bc000.efuse/apple_efuses_nvmem0/cells/efuse@a10' [ 0.605347] CPU: 7 PID: 1 Comm: swapper/0 Tainted: G S 6.8.0-rc1-arnd-5+ #133 [ 0.605355] Hardware name: Apple Mac Studio (M1 Ultra, 2022) (DT) [ 0.605362] Call trace: [ 0.605365] show_stack+0x18/0x2c [ 0.605374] dump_stack_lvl+0x60/0x80 [ 0.605383] dump_stack+0x18/0x24 [ 0.605388] sysfs_warn_dup+0x64/0x80 [ 0.605395] sysfs_add_bin_file_mode_ns+0xb0/0xd4 [ 0.605402] internal_create_group+0x268/0x404 [ 0.605409] sysfs_create_groups+0x38/0x94 [ 0.605415] devm_device_add_groups+0x50/0x94 [ 0.605572] nvmem_populate_sysfs_cells+0x180/0x1b0 [ 0.605682] nvmem_register+0x38c/0x470 [ 0.605789] devm_nvmem_register+0x1c/0x6c [ 0.605895] apple_efuses_probe+0xe4/0x120 [ 0.606000] platform_probe+0xa8/0xd0 As far as I can tell, this is a problem for any device with multiple cells on different bits of the same address. Avoid the issue by changing the file name to include the first bit number. Fixes: 0331c611949f ("nvmem: core: Expose cells through sysfs") Link: https://github.com/AsahiLinux/linux/blob/bd0a1a7d4/arch/arm64/boot/dts/apple/t600x-dieX.dtsi#L156 Cc: <regressions@lists.linux.dev> Cc: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Rafał Miłecki <rafal@milecki.pl> Cc: Chen-Yu Tsai <wenst@chromium.org> Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <asahi@lists.linux.dev> Cc: Sven Peter <sven@svenpeter.dev> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Reviewed-by: Eric Curtin <ecurtin@redhat.com> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/r/20240209163454.98051-1-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-14Merge tag 'iio-fixes-for-6.8a' of ↵Greg Kroah-Hartman17-16/+58
http://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into char-misc-linus Jonathan writes: IIO: 1st set of fixes for the 6.8 cycle Usual mixed bag of issues introduced this cycle and fixes for long term issues that have been identified recently + one case where I messed up a merge resolution and dropped the build file changes. Most important is the userspace ABI fix for the iio_modifier enum where we accidentally added new entries in the middle rather than at the end. IIO Core - Close a memory leak in an error path. - Move LIGHT_UVA and LIGHT_UVB definitions to end of the iio_modifier enum to avoid breaking older userspace. (not yet in a released kernel thankfully). adi,adis - Fix a DMA buffer alignment issue that was missing in series that fixed these across IIO. adi,ad-sigma-delta - Fix a DMA buffer alignment issue that was missing in series that fixed these across IIO. adi,ad4130 - Zero init remaining fields of clock init data. - Only set GPIO control bits on pins that aren't in use for anything else. adi,ad5933 - Fix an old bug due to type mismatch. This is a rare device so good to get some new test coverage. adi,ad7091r - Use right variable for an error return code. bosch,bma400 - Add missing CONFIG_REGMAP_I2C dependency. bosch,bmp280: - Add missing bmp085 ID to the SPI table to avoid mismatch with the of_device_id table. hid-sensors: - Avoid returning an error for timestamp read back that succeeds. pni,rm3100 - Check value read from RM31000_REG_TMRC register is valid before using it. Hardening to avoid a real world issue seen on some faulty hardware. st,st-sensors - Fix a DMA buffer alignment issue that was missing in series that fixed these across IIO. ti,hdc3020 - Add missing Kconfig and Makefile entrees accidentally dropped when patches were applied. - Fix wrong temperature offset (negated) * tag 'iio-fixes-for-6.8a' of http://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio: iio: adc: ad4130: only set GPIO_CTRL if pin is unused iio: adc: ad4130: zero-initialize clock init data iio: accel: bma400: Fix a compilation problem iio: commom: st_sensors: ensure proper DMA alignment iio: hid-sensor-als: Return 0 for HID_USAGE_SENSOR_TIME_TIMESTAMP iio: move LIGHT_UVA and LIGHT_UVB to the end of iio_modifier staging: iio: ad5933: fix type mismatch regression iio: humidity: hdc3020: fix temperature offset iio: adc: ad7091r8: Fix error code in ad7091r8_gpio_setup() iio: adc: ad_sigma_delta: ensure proper DMA alignment iio: imu: adis: ensure proper DMA alignment iio: humidity: hdc3020: Add Makefile, Kconfig and MAINTAINERS entry iio: imu: bno055: serdev requires REGMAP iio: magnetometer: rm3100: add boundary check for the value read from RM3100_REG_TMRC iio: pressure: bmp280: Add missing bmp085 to SPI id table iio: core: fix memleak in iio_device_register_sysfs
2024-02-14drm/tests/drm_buddy: add alloc_contiguous testMatthew Auld1-0/+89
Sanity check DRM_BUDDY_CONTIGUOUS_ALLOCATION. v2: Fix checkpatch warnings. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240214131853.5934-2-Arunpravin.PaneerSelvam@amd.com Signed-off-by: Christian König <christian.koenig@amd.com>
2024-02-14drm/buddy: Fix alloc_range() error handling codeArunpravin Paneer Selvam1-0/+6
Few users have observed display corruption when they boot the machine to KDE Plasma or playing games. We have root caused the problem that whenever alloc_range() couldn't find the required memory blocks the function was returning SUCCESS in some of the corner cases. The right approach would be if the total allocated size is less than the required size, the function should return -ENOSPC. Cc: <stable@vger.kernel.org> # 6.7+ Fixes: 0a1844bf0b53 ("drm/buddy: Improve contiguous memory allocation") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3097 Tested-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patchwork.kernel.org/project/dri-devel/patch/20240207174456.341121-1-Arunpravin.PaneerSelvam@amd.com/ Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240214131853.5934-1-Arunpravin.PaneerSelvam@amd.com Signed-off-by: Christian König <christian.koenig@amd.com>
2024-02-14powerpc/iommu: Fix the missing iommu_group_put() during platform domain attachShivaprasad G Bhat1-1/+3
The function spapr_tce_platform_iommu_attach_dev() is missing to call iommu_group_put() when the domain is already set. This refcount leak shows up with BUG_ON() during DLPAR remove operation as: KernelBug: Kernel bug in state 'None': kernel BUG at arch/powerpc/platforms/pseries/iommu.c:100! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries <snip> Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_016) hv:phyp pSeries NIP: c0000000000ff4d4 LR: c0000000000ff4cc CTR: 0000000000000000 REGS: c0000013aed5f840 TRAP: 0700 Tainted: G I (6.8.0-rc3-autotest-g99bd3cb0d12e) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 44002402 XER: 20040000 CFAR: c000000000a0d170 IRQMASK: 0 ... NIP iommu_reconfig_notifier+0x94/0x200 LR iommu_reconfig_notifier+0x8c/0x200 Call Trace: iommu_reconfig_notifier+0x8c/0x200 (unreliable) notifier_call_chain+0xb8/0x19c blocking_notifier_call_chain+0x64/0x98 of_reconfig_notify+0x44/0xdc of_detach_node+0x78/0xb0 ofdt_write.part.0+0x86c/0xbb8 proc_reg_write+0xf4/0x150 vfs_write+0xf8/0x488 ksys_write+0x84/0x140 system_call_exception+0x138/0x330 system_call_vectored_common+0x15c/0x2ec The patch adds the missing iommu_group_put() call. Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach atctions after probe") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Closes: https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e271d@linux.vnet.ibm.com/ Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/170784021983.6249.10039296655906636112.stgit@linux.ibm.com
2024-02-14can: netlink: Fix TDCO calculation using the old data bittimingMaxime Jayat1-1/+1
The TDCO calculation was done using the currently applied data bittiming, instead of the newly computed data bittiming, which means that the TDCO had an invalid value unless setting the same data bittiming twice. Fixes: d99755f71a80 ("can: netlink: add interface for CAN-FD Transmitter Delay Compensation (TDC)") Signed-off-by: Maxime Jayat <maxime.jayat@mobile-devices.fr> Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Link: https://lore.kernel.org/all/40579c18-63c0-43a4-8d4c-f3a6c1c0b417@munic.io Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-02-14can: j1939: Fix UAF in j1939_sk_match_filter during setsockopt(SO_J1939_FILTER)Oleksij Rempel2-4/+19
Lock jsk->sk to prevent UAF when setsockopt(..., SO_J1939_FILTER, ...) modifies jsk->filters while receiving packets. Following trace was seen on affected system: ================================================================== BUG: KASAN: slab-use-after-free in j1939_sk_recv_match_one+0x1af/0x2d0 [can_j1939] Read of size 4 at addr ffff888012144014 by task j1939/350 CPU: 0 PID: 350 Comm: j1939 Tainted: G W OE 6.5.0-rc5 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: print_report+0xd3/0x620 ? kasan_complete_mode_report_info+0x7d/0x200 ? j1939_sk_recv_match_one+0x1af/0x2d0 [can_j1939] kasan_report+0xc2/0x100 ? j1939_sk_recv_match_one+0x1af/0x2d0 [can_j1939] __asan_load4+0x84/0xb0 j1939_sk_recv_match_one+0x1af/0x2d0 [can_j1939] j1939_sk_recv+0x20b/0x320 [can_j1939] ? __kasan_check_write+0x18/0x20 ? __pfx_j1939_sk_recv+0x10/0x10 [can_j1939] ? j1939_simple_recv+0x69/0x280 [can_j1939] ? j1939_ac_recv+0x5e/0x310 [can_j1939] j1939_can_recv+0x43f/0x580 [can_j1939] ? __pfx_j1939_can_recv+0x10/0x10 [can_j1939] ? raw_rcv+0x42/0x3c0 [can_raw] ? __pfx_j1939_can_recv+0x10/0x10 [can_j1939] can_rcv_filter+0x11f/0x350 [can] can_receive+0x12f/0x190 [can] ? __pfx_can_rcv+0x10/0x10 [can] can_rcv+0xdd/0x130 [can] ? __pfx_can_rcv+0x10/0x10 [can] __netif_receive_skb_one_core+0x13d/0x150 ? __pfx___netif_receive_skb_one_core+0x10/0x10 ? __kasan_check_write+0x18/0x20 ? _raw_spin_lock_irq+0x8c/0xe0 __netif_receive_skb+0x23/0xb0 process_backlog+0x107/0x260 __napi_poll+0x69/0x310 net_rx_action+0x2a1/0x580 ? __pfx_net_rx_action+0x10/0x10 ? __pfx__raw_spin_lock+0x10/0x10 ? handle_irq_event+0x7d/0xa0 __do_softirq+0xf3/0x3f8 do_softirq+0x53/0x80 </IRQ> <TASK> __local_bh_enable_ip+0x6e/0x70 netif_rx+0x16b/0x180 can_send+0x32b/0x520 [can] ? __pfx_can_send+0x10/0x10 [can] ? __check_object_size+0x299/0x410 raw_sendmsg+0x572/0x6d0 [can_raw] ? __pfx_raw_sendmsg+0x10/0x10 [can_raw] ? apparmor_socket_sendmsg+0x2f/0x40 ? __pfx_raw_sendmsg+0x10/0x10 [can_raw] sock_sendmsg+0xef/0x100 sock_write_iter+0x162/0x220 ? __pfx_sock_write_iter+0x10/0x10 ? __rtnl_unlock+0x47/0x80 ? security_file_permission+0x54/0x320 vfs_write+0x6ba/0x750 ? __pfx_vfs_write+0x10/0x10 ? __fget_light+0x1ca/0x1f0 ? __rcu_read_unlock+0x5b/0x280 ksys_write+0x143/0x170 ? __pfx_ksys_write+0x10/0x10 ? __kasan_check_read+0x15/0x20 ? fpregs_assert_state_consistent+0x62/0x70 __x64_sys_write+0x47/0x60 do_syscall_64+0x60/0x90 ? do_syscall_64+0x6d/0x90 ? irqentry_exit+0x3f/0x50 ? exc_page_fault+0x79/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Allocated by task 348: kasan_save_stack+0x2a/0x50 kasan_set_track+0x29/0x40 kasan_save_alloc_info+0x1f/0x30 __kasan_kmalloc+0xb5/0xc0 __kmalloc_node_track_caller+0x67/0x160 j1939_sk_setsockopt+0x284/0x450 [can_j1939] __sys_setsockopt+0x15c/0x2f0 __x64_sys_setsockopt+0x6b/0x80 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Freed by task 349: kasan_save_stack+0x2a/0x50 kasan_set_track+0x29/0x40 kasan_save_free_info+0x2f/0x50 __kasan_slab_free+0x12e/0x1c0 __kmem_cache_free+0x1b9/0x380 kfree+0x7a/0x120 j1939_sk_setsockopt+0x3b2/0x450 [can_j1939] __sys_setsockopt+0x15c/0x2f0 __x64_sys_setsockopt+0x6b/0x80 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: 9d71dd0c70099 ("can: add support of SAE J1939 protocol") Reported-by: Sili Luo <rootlab@huawei.com> Suggested-by: Sili Luo <rootlab@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20231020133814.383996-1-o.rempel@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-02-14can: j1939: prevent deadlock by changing j1939_socks_lock to rwlockZiqi Zhao3-14/+14
The following 3 locks would race against each other, causing the deadlock situation in the Syzbot bug report: - j1939_socks_lock - active_session_list_lock - sk_session_queue_lock A reasonable fix is to change j1939_socks_lock to an rwlock, since in the rare situations where a write lock is required for the linked list that j1939_socks_lock is protecting, the code does not attempt to acquire any more locks. This would break the circular lock dependency, where, for example, the current thread already locks j1939_socks_lock and attempts to acquire sk_session_queue_lock, and at the same time, another thread attempts to acquire j1939_socks_lock while holding sk_session_queue_lock. NOTE: This patch along does not fix the unregister_netdevice bug reported by Syzbot; instead, it solves a deadlock situation to prepare for one or more further patches to actually fix the Syzbot bug, which appears to be a reference counting problem within the j1939 codebase. Reported-by: <syzbot+1591462f226d9cbf0564@syzkaller.appspotmail.com> Signed-off-by: Ziqi Zhao <astrajoan@yahoo.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20230721162226.8639-1-astrajoan@yahoo.com [mkl: remove unrelated newline change] Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-02-14ethernet: cpts: fix function pointer cast warningsArnd Bergmann1-5/+12
clang-16 warns about the mismatched prototypes for the devm_* callbacks: drivers/net/ethernet/ti/cpts.c:691:12: error: cast from 'void (*)(struct clk_hw *)' to 'void (*)(void *)' converts to incompatible function type [-Werror,-Wcast-function-type-strict] 691 | (void(*)(void *))clk_hw_unregister_mux, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/device.h:406:34: note: expanded from macro 'devm_add_action_or_reset' 406 | __devm_add_action_or_reset(dev, action, data, #action) | ^~~~~~ drivers/net/ethernet/ti/cpts.c:703:12: error: cast from 'void (*)(struct device_node *)' to 'void (*)(void *)' converts to incompatible function type [-Werror,-Wcast-function-type-strict] 703 | (void(*)(void *))of_clk_del_provider, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/device.h:406:34: note: expanded from macro 'devm_add_action_or_reset' 406 | __devm_add_action_or_reset(dev, action, data, #action) Use separate helper functions for this instead, using the expected prototypes with a void* argument. Fixes: a3047a81ba13 ("net: ethernet: ti: cpts: add support for ext rftclk selection") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14bnad: fix work_queue type mismatchArnd Bergmann1-7/+5
clang-16 warns about a function pointer cast: drivers/net/ethernet/brocade/bna/bnad.c:1995:4: error: cast from 'void (*)(struct delayed_work *)' to 'work_func_t' (aka 'void (*)(struct work_struct *)') converts to incompatible function type [-Werror,-Wcast-function-type-strict] 1995 | (work_func_t)bnad_tx_cleanup); drivers/net/ethernet/brocade/bna/bnad.c:2252:4: error: cast from 'void (*)(void *)' to 'work_func_t' (aka 'void (*)(struct work_struct *)') converts to incompatible function type [-Werror,-Wcast-function-type-strict] 2252 | (work_func_t)(bnad_rx_cleanup)); The problem here is mixing up work_struct and delayed_work, which relies the former being the first member of the latter. Change the code to use consistent types here to address the warning and make it more robust against workqueue interface changes. Side note: the use of a delayed workqueue for cleaning up TX descriptors is probably a bad idea since this introduces a noticeable delay. The driver currently does not appear to use BQL, but if one wanted to add that, this would have to be changed as well. Fixes: 01b54b145185 ("bna: tx rx cleanup fix") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14Merge patch series "can: tcan4x5x: support resume upon rx can frame"Marc Kleine-Budde6-6/+55
Martin Hundebøll <martin@geanix.com> says: This is the third iteration of the previous submitted patches [0] and [1]. This revision replaces the "wake_source" function parameters to a flag in the class device structure, and adds a patch to document the "wakeup-source" device tree property. Also, the previous revisions forgot to mention that the patches are based on Markus' coalescing patches [2]. Those implements caching of the enabled interrupts, which is handy when restoring the set of interrupts in the resume path. [0] https://lore.kernel.org/linux-can/20230912093807.1383720-1-martin@geanix.com [1] https://lore.kernel.org/linux-can/20230919122841.3803289-1-martin@geanix.com [2] https://lore.kernel.org/linux-can/20230929141304.3934380-1-msp@baylibre.com Link: https://lore.kernel.org/all/20231113131452.214961-1-martin@geanix.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-02-14can: tcan4x5x: support resuming from rx interrupt signalMartin Hundebøll1-2/+32
Implement the "wakeup-source" device tree property, so the chip is left running when suspending, and its rx interrupt is used as a wakeup source to resume operation. Signed-off-by: Martin Hundebøll <martin@geanix.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-02-14can: m_can: allow keeping the transceiver running in suspendMartin Hundebøll5-5/+21
Add a flag to the device class structure that leaves the chip in a running state with rx interrupt enabled, so that an m_can device driver can configure and use the interrupt as a wakeup source. Signed-off-by: Martin Hundebøll <martin@geanix.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-02-14dt-bindings: can: tcan4x5x: Document the wakeup-source flagMartin Hundebøll1-0/+3
Let it be known that the tcan4x5x device can now be configured to wake the host from suspend when a can frame is received. Signed-off-by: Martin Hundebøll <martin@geanix.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> [mkl: make first the first patch] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-02-14net: phy: dp83826: support TX data voltage tuningCatalin Popescu1-4/+126
DP83826 offers the possibility to tune the voltage of logical levels of the MLT-3 encoded TX data. This is useful when there is a voltage drop in between the PHY and the connector and we want to increase the voltage levels to compensate for that drop. Prior to PHY configuration, the driver SW resets the PHY which has the same effect as the HW reset pin according to the datasheet. Hence, there's no need to force update the VOD_CFG registers to make sure they hold their reset values. VOD_CFG registers need to be updated only if the DT has been configured with values other than the reset ones. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Catalin Popescu <catalin.popescu@leica-geosystems.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14dt-bindings: net: dp83826: support TX data voltage tuningCatalin Popescu1-0/+18
Add properties ti,cfg-dac-minus-one-bp/ti,cfg-dac-plus-one-bp to support voltage tuning of logical levels -1/+1 of the MLT-3 encoded TX data. Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Catalin Popescu <catalin.popescu@leica-geosystems.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14Merge branch 'dev_base_lock-remove'David S. Miller17-145/+112
Eric Dumazet says: ==================== net: complete dev_base_lock removal Back in 2009 we started an effort to get rid of dev_base_lock in favor of RCU. It is time to finish this work. Say goodbye to dev_base_lock ! v4: rebase, and move dev_addr_sem to net/core/dev.h in patch 06/13 (Jakub) v3: I misread kbot reports, the issue was with dev->operstate (patch 10/13) So dev->reg_state is back to u8, and dev->operstate becomes an u32. Sorry for the noise. v2: dev->reg_state must be a standard enum, some arches do not support cmpxchg() on u8. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: remove dev_base_lockEric Dumazet2-37/+4
dev_base_lock is not needed anymore, all remaining users also hold RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: remove dev_base_lock from register_netdevice() and friends.Eric Dumazet1-13/+7
RTNL already protects writes to dev->reg_state, we no longer need to hold dev_base_lock to protect the readers. unlist_netdevice() second argument can be removed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: remove dev_base_lock from do_setlink()Eric Dumazet1-2/+0
We hold RTNL here, and dev->link_mode readers already are using READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: add netdev_set_operstate() helperEric Dumazet5-31/+26
dev_base_lock is going away, add netdev_set_operstate() helper so that hsr does not have to know core internals. Remove dev_base_lock acquisition from rfc2863_policy() v3: use an "unsigned int" for dev->operstate, so that try_cmpxchg() can work on all arches. ( https://lore.kernel.org/oe-kbuild-all/202402081918.OLyGaea3-lkp@intel.com/ ) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: remove stale mentions of dev_base_lock in commentsEric Dumazet6-8/+8
Change comments incorrectly mentioning dev_base_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net-sysfs: convert netstat_show() to RCUEric Dumazet1-3/+3
dev_get_stats() can be called from RCU, there is no need to acquire dev_base_lock. Change dev_isalive() comment to reflect we no longer use dev_base_lock from net/core/net-sysfs.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net-sysfs: convert dev->operstate reads to lockless onesEric Dumazet6-14/+13
operstate_show() can omit dev_base_lock acquisition only to read dev->operstate. Annotate accesses to dev->operstate. Writers still acquire dev_base_lock for mutual exclusion. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net-sysfs: use dev_addr_sem to remove races in address_show()Eric Dumazet3-4/+11
Using dev_base_lock is not preventing from reading garbage. Use dev_addr_sem instead. v4: place dev_addr_sem extern in net/core/dev.h (Jakub Kicinski) Link: https://lore.kernel.org/netdev/20240212175845.10f6680a@kernel.org/ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net-sysfs: convert netdev_show() to RCUEric Dumazet1-7/+10
Make clear dev_isalive() can be called with RCU protection. Then convert netdev_show() to RCU, to remove dev_base_lock dependency. Also add RCU to broadcast_show(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: convert dev->reg_state to u8Eric Dumazet2-13/+18
Prepares things so that dev->reg_state reads can be lockless, by adding WRITE_ONCE() on write side. READ_ONCE()/WRITE_ONCE() do not support bitfields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14dev: annotate accesses to dev->linkEric Dumazet1-1/+1
Following patch will read dev->link locklessly, annotate the write from do_setlink(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14ip_tunnel: annotate data-races around t->parms.linkEric Dumazet1-14/+13
t->parms.link is read locklessly, annotate these reads and opposite writes accordingly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: annotate data-races around dev->name_assign_typeEric Dumazet2-5/+5
name_assign_type_show() runs locklessly, we should annotate accesses to dev->name_assign_type. Alternative would be to grab devnet_rename_sem semaphore from name_assign_type_show(), but this would not bring more accuracy. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14Merge branch 'per-epoll-context-busy-poll'David S. Miller3-7/+138
Joe Damato says: ==================== Per epoll context busy poll support Greetings: Welcome to v8. TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to epoll with socket fds.") by allowing user applications to enable epoll-based busy polling, set a busy poll packet budget, and enable or disable prefer busy poll on a per epoll context basis. This makes epoll-based busy polling much more usable for user applications than the current system-wide sysctl and hardcoded budget. To allow for this, two ioctls have been added for epoll contexts for getting and setting a new struct, struct epoll_params. ioctl was chosen vs a new syscall after reviewing a suggestion by Willem de Bruijn [1]. I am open to using a new syscall instead of an ioctl, but it seemed that: - Busy poll affects all existing epoll_wait and epoll_pwait variants in the same way, so new verions of many syscalls might be needed. It seems much simpler for users to use the correct epoll_wait/epoll_pwait for their app and add a call to ioctl to enable or disable busy poll as needed. This also probably means less work to get an existing epoll app using busy poll. - previously added epoll_pwait2 helped to bring epoll closer to existing syscalls (like pselect and ppoll) and this busy poll change reflected as a new syscall would not have the same effect. Note: patch 1/4 as of v4 uses an or (||) instead of an xor. I thought about it some more and I realized that if the user enables both the per-epoll context setting and the system wide sysctl, then busy poll should be enabled and not disabled. Using xor doesn't seem to make much sense after thinking through this a bit. Longer explanation: Presently epoll has support for a very useful form of busy poll based on the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [2]). This form of busy poll allows epoll_wait to drive NAPI packet processing which allows for a few interesting user application designs which can reduce latency and also potentially improve L2/L3 cache hit rates by deferring NAPI until userland has finished its work. The documentation available on this is, IMHO, a bit confusing so please allow me to explain how one might use this: 1. Ensure each application thread has its own epoll instance mapping 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to direct connections with specific dest ports to these queues. 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy polling will occur. This can help avoid the userland app from being pre-empted by a hard IRQ while userland is running. Note this means that userland must take care to call epoll_wait and not take too long in userland since it now drives NAPI via epoll_wait. 3. Optionally: Consider using napi_defer_hard_irqs and gro_flush_timeout to further restrict IRQ generation from the NIC. These settings are system-wide so their impact must be carefully weighed against the running applications. 4. Ensure that all incoming connections added to an epoll instance have the same NAPI ID. This can be done with a BPF filter when SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single accept thread is used which dispatches incoming connections to threads. 5. Lastly, busy poll must be enabled via a sysctl (/proc/sys/net/core/busy_poll). Please see Eric Dumazet's paper about busy polling [3] and a recent academic paper about measured performance improvements of busy polling [4] (albeit with a modification that is not currently present in the kernel) for additional context. The unfortunate part about step 5 above is that this enables busy poll system-wide which affects all user applications on the system, including epoll-based network applications which were not intended to be used this way or applications where increased CPU usage for lower latency network processing is unnecessary or not desirable. If the user wants to run one low latency epoll-based server application with epoll-based busy poll, but would like to run the rest of the applications on the system (which may also use epoll) without busy poll, this system-wide sysctl presents a significant problem. This change preserves the system-wide sysctl, but adds a mechanism (via ioctl) to enable or disable busy poll for epoll contexts as needed by individual applications, making epoll-based busy poll more usable. Note that this change includes an or (as of v4) instead of an xor. If the user has enabled both the system-wide sysctl and also the per epoll-context busy poll settings, then epoll should probably busy poll (vs being disabled). Thanks, Joe v7 -> v8: - Reviewed-by tag from Eric Dumazet applied to commit message of patch 1/4. - patch 4/4: - EPIOCSPARAMS and EPIOCGPARAMS updated to use WRITE_ONCE and READ_ONCE, as requested by Eric Dumazet - Wrapped a long line (via netdev/checkpatch) v6 -> v7: - Acked-by tags from Stanislav Fomichev applied to commit messages of all patches. - Reviewed-by tags from Jakub Kicinski, Eric Dumazet applied to commit messages of patches 2 and 3. Jiri Slaby's Reviewed-by applied to patch 4. - patch 1/4: - busy_poll_usecs reduced from u64 to u32. - Unnecessary parens removed (via netdev/checkpatch) - Wrapped long line (via netdev/checkpatch) - Remove inline from busy_loop_ep_timeout as objdump suggests the function is already inlined - Moved struct eventpoll assignment to declaration - busy_loop_ep_timeout is moved within CONFIG_NET_RX_BUSY_POLL and the ifdefs internally have been removed as per Eric Dumazet's review - Removed ep_busy_loop_on from the !defined CONFIG_NET_RX_BUSY_POLL section as it is only called when CONFIG_NET_RX_BUSY_POLL is defined - patch 3/4: - Fix whitespace alignment issue (via netdev/checkpatch) - patch 4/4: - epoll_params.busy_poll_usecs has been reduced to u32 - epoll_params.busy_poll_usecs is now checked to ensure it is <= S32_MAX - __pad has been reduced to a single u8 - memchr_inv has been dropped and replaced with a simple check for the single __pad byte - Removed space after cast (via netdev/checkpatch) - Wrap long line (via netdev/checkpatch) - Move struct eventpoll *ep assignment to declaration as per Jiri Slaby's review - Remove unnecessary !! as per Jiri Slaby's review - Reorganized variables to be reverse christmas tree order v5 -> v6: - patch 1/3 no functional change, but commit message corrected to explain that an or (||) is being used instead of xor. - patch 3/4 is a new patch which adds support for per epoll context prefer busy poll setting. - patch 4/4 updated to allow getting/setting per epoll context prefer busy poll setting; this setting is limited to either 0 or 1. v4 -> v5: - patch 3/3 updated to use memchr_inv to ensure that __pad is zero for the EPIOCSPARAMS ioctl. Recommended by Greg K-H [5], Dave Chinner [6], and Jiri Slaby [7]. v3 -> v4: - patch 1/3 was updated to include an important functional change: ep_busy_loop_on was updated to use or (||) instead of xor (^). After thinking about it a bit more, I thought xor didn't make much sense. Enabling both the per-epoll context and the system-wide sysctl should probably enable busy poll, not disable it. So, or (||) makes more sense, I think. - patch 3/3 was updated: - to change the epoll_params fields to be __u64, __u16, and __u8 and to pad the struct to a multiple of 64bits. Suggested by Greg K-H [8] and Arnd Bergmann [9]. - remove an unused pr_fmt, left over from the previous revision. - ioctl now returns -EINVAL when epoll_params.busy_poll_usecs > U32_MAX. v2 -> v3: - cover letter updated to mention why ioctl seems (to me) like a better choice vs a new syscall. - patch 3/4 was modified in 3 ways: - when an unknown ioctl is received, -ENOIOCTLCMD is returned instead of -EINVAL as the ioctl documentation requires. - epoll_params.busy_poll_budget can only be set to a value larger than NAPI_POLL_WEIGHT if code is run by privileged (CAP_NET_ADMIN) users. Otherwise, -EPERM is returned. - busy poll specific ioctl code moved out to its own function. On kernels without busy poll support, -EOPNOTSUPP is returned. This also makes the kernel build robot happier without littering the code with more #ifdefs. - dropped patch 4/4 after Eric Dumazet's review of it when it was sent independently to the list [10]. v1 -> v2: - cover letter updated to make a mention of napi_defer_hard_irqs and gro_flush_timeout as an added step 3 and to cite both Eric Dumazet's busy polling paper and a paper from University of Waterloo for additional context. Specifically calling out the xor in patch 1/4 incase it is missed by reviewers. - Patch 2/4 has its commit message updated, but no functional changes. Commit message now describes that allowing for a settable budget helps to improve throughput and is more consistent with other busy poll mechanisms that allow a settable budget via SO_BUSY_POLL_BUDGET. - Patch 3/4 was modified to check if the epoll_params.busy_poll_budget exceeds NAPI_POLL_WEIGHT. The larger value is allowed, but an error is printed. This was done for consistency with netif_napi_add_weight, which does the same. - Patch 3/4 the struct epoll_params was updated to fix the type of the data field; it was uint8_t and was changed to u8. - Patch 4/4 added to check if SO_BUSY_POLL_BUDGET exceeds NAPI_POLL_WEIGHT. The larger value is allowed, but an error is printed. This was done for consistency with netif_napi_add_weight, which does the same. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>