summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-07-19usb: common: usb-conn-gpio: Set last role to unknown before initial detectionPrashanth K1-0/+1
[ Upstream commit edd60d24bd858cef165274e4cd6cab43bdc58d15 ] Currently if we bootup a device without cable connected, then usb-conn-gpio won't call set_role() since last_role is same as current role. This happens because during probe last_role gets initialised to zero. To avoid this, added a new constant in enum usb_role, last_role is set to USB_ROLE_UNKNOWN before performing initial detection. While at it, also handle default case for the usb_role switch in cdns3, intel-xhci-usb-role-switch & musb/jz4740 to avoid build warnings. Fixes: 4602f3bff266 ("usb: common: add USB GPIO based connection detection driver") Signed-off-by: Prashanth K <quic_prashk@quicinc.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Message-ID: <1685544074-17337-1-git-send-email-quic_prashk@quicinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19sh: Avoid using IRQ0 on SH3 and SH4Sergey Shtylyov1-3/+3
[ Upstream commit a8ac2961148e8c720dc760f2e06627cd5c55a154 ] IRQ0 is no longer returned by platform_get_irq() and its ilk -- they now return -EINVAL instead. However, the kernel code supporting SH3/4-based SoCs still maps the IRQ #s starting at 0 -- modify that code to start the IRQ #s from 16 instead. The patch should mostly affect the AP-SH4A-3A/AP-SH4AD-0A boards as they indeed are using IRQ0 for the SMSC911x compatible Ethernet chip. Fixes: ce753ad1549c ("platform: finally disallow IRQ0 in platform_get_irq() and its ilk") Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Link: https://lore.kernel.org/r/71105dbf-cdb0-72e1-f9eb-eeda8e321696@omp.ru Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19PCI: Add pci_clear_master() stub for non-CONFIG_PCISui Jingfeng1-0/+1
[ Upstream commit 2aa5ac633259843f656eb6ecff4cf01e8e810c5e ] Add a pci_clear_master() stub when CONFIG_PCI is not set so drivers that support both PCI and platform devices don't need #ifdefs or extra Kconfig symbols for the PCI parts. [bhelgaas: commit log] Fixes: 6a479079c072 ("PCI: Add pci_clear_master() as opposite of pci_set_master()") Link: https://lore.kernel.org/r/20230531102744.2354313-1-suijingfeng@loongson.cn Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19bootmem: remove the vmemmap pages from kmemleak in free_bootmem_pageLiu Shixin1-0/+2
commit 028725e73375a1ff080bbdf9fb503306d0116f28 upstream. commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem") fix an overlaps existing problem of kmemleak. But the problem still existed when HAVE_BOOTMEM_INFO_NODE is disabled, because in this case, free_bootmem_page() will call free_reserved_page() directly. Fix the problem by adding kmemleak_free_part() in free_bootmem_page() when HAVE_BOOTMEM_INFO_NODE is disabled. Link: https://lkml.kernel.org/r/20230704101942.2819426-1-liushixin2@huawei.com Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-19can: length: fix bitstuffing countVincent Mailhol1-6/+8
[ Upstream commit 9fde4c557f78ee2f3626e92b4089ce9d54a2573a ] The Stuff Bit Count is always coded on 4 bits [1]. Update the Stuff Bit Count size accordingly. In addition, the CRC fields of CAN FD Frames contain stuff bits at fixed positions called fixed stuff bits [2]. The CRC field starts with a fixed stuff bit and then has another fixed stuff bit after each fourth bit [2], which allows us to derive this formula: FSB count = 1 + round_down(len(CRC field)/4) The length of the CRC field is [1]: len(CRC field) = len(Stuff Bit Count) + len(CRC) = 4 + len(CRC) with len(CRC) either 17 or 21 bits depending of the payload length. In conclusion, for CRC17: FSB count = 1 + round_down((4 + 17)/4) = 6 and for CRC 21: FSB count = 1 + round_down((4 + 21)/4) = 7 Add a Fixed Stuff bits (FSB) field with above values and update CANFD_FRAME_OVERHEAD_SFF and CANFD_FRAME_OVERHEAD_EFF accordingly. [1] ISO 11898-1:2015 section 10.4.2.6 "CRC field": The CRC field shall contain the CRC sequence followed by a recessive CRC delimiter. For FD Frames, the CRC field shall also contain the stuff count. Stuff count If FD Frames, the stuff count shall be at the beginning of the CRC field. It shall consist of the stuff bit count modulo 8 in a 3-bit gray code followed by a parity bit [...] [2] ISO 11898-1:2015 paragraph 10.5 "Frame coding": In the CRC field of FD Frames, the stuff bits shall be inserted at fixed positions; they are called fixed stuff bits. There shall be a fixed stuff bit before the first bit of the stuff count, even if the last bits of the preceding field are a sequence of five consecutive bits of identical value, there shall be only the fixed stuff bit, there shall not be two consecutive stuff bits. A further fixed stuff bit shall be inserted after each fourth bit of the CRC field [...] Fixes: 85d99c3e2a13 ("can: length: can_skb_get_frame_len(): introduce function to get data length of frame in data link layer") Suggested-by: Thomas Kopp <Thomas.Kopp@microchip.com> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Reviewed-by: Thomas Kopp <Thomas.Kopp@microchip.com> Link: https://lore.kernel.org/all/20230611025728.450837-2-mailhol.vincent@wanadoo.fr Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindingsGilad Sever1-0/+9
[ Upstream commit 9a5cb79762e0eda17ca15c2a6eaca4622383c21c ] When calling bpf_sk_lookup_tcp(), bpf_sk_lookup_udp() or bpf_skc_lookup_tcp() from tc/xdp ingress, VRF socket bindings aren't respoected, i.e. unbound sockets are returned, and bound sockets aren't found. VRF binding is determined by the sdif argument to sk_lookup(), however when called from tc the IP SKB control block isn't initialized and thus inet{,6}_sdif() always returns 0. Fix by calculating sdif for the tc/xdp flows by observing the device's l3 enslaved state. The cg/sk_skb hooking points which are expected to support inet{,6}_sdif() pass sdif=-1 which makes __bpf_skc_lookup() use the existing logic. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Gilad Sever <gilad9366@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Eyal Birger <eyal.birger@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/20230621104211.301902-4-gilad9366@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19mmc: Add MMC_QUIRK_BROKEN_SD_CACHE for Kingston Canvas Go Plus from 11/2019Marek Vasut1-0/+1
[ Upstream commit c467c8f081859d4f4ca4eee4fba54bb5d85d6c97 ] This microSD card never clears Flush Cache bit after cache flush has been started in sd_flush_cache(). This leads e.g. to failure to mount file system. Add a quirk which disables the SD cache for this specific card from specific manufacturing date of 11/2019, since on newer dated cards from 05/2023 the cache flush works correctly. Fixes: 08ebf903af57 ("mmc: core: Fixup support for writeback-cache for eMMC and SD") Signed-off-by: Marek Vasut <marex@denx.de> Link: https://lore.kernel.org/r/20230620102713.7701-1-marex@denx.de Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19wifi: ieee80211: Fix the common size calculation for reconfiguration MLIlan Peer1-4/+1
[ Upstream commit ce6e1f600b0cfc563a7d607de702262a58cd835d ] The common information length is found in the first octet of the common information. Fixes: 0f48b8b88aa9 ("wifi: ieee80211: add definitions for multi-link element") Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230618214435.3c7ed4817338.I42ef706cb827b4dade6e4ffbb6e7f341eaccd398@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19wifi: cfg80211/mac80211: Fix ML element common size calculationIlan Peer1-6/+5
[ Upstream commit 1403b109c9a5244dc6ab79154f04eecc209ef3d2 ] The common size is part of the length in the data so don't add it again. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Stable-dep-of: ce6e1f600b0c ("wifi: ieee80211: Fix the common size calculation for reconfiguration ML") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19watchdog/perf: define dummy watchdog_update_hrtimer_threshold() on correct ↵Douglas Anderson1-1/+1
config [ Upstream commit 5e008df11c55228a86a1bae692cc2002503572c9 ] Patch series "watchdog/hardlockup: Add the buddy hardlockup detector", v5. This patch series adds the "buddy" hardlockup detector. In brief, the buddy hardlockup detector can detect hardlockups without arch-level support by having CPUs checkup on a "buddy" CPU periodically. Given the new design of this patch series, testing all combinations is fairly difficult. I've attempted to make sure that all combinations of CONFIG_ options are good, but it wouldn't surprise me if I missed something. I apologize in advance and I'll do my best to fix any problems that are found. This patch (of 18): The real watchdog_update_hrtimer_threshold() is defined in kernel/watchdog_hld.c. That file is included if CONFIG_HARDLOCKUP_DETECTOR_PERF and the function is defined in that file if CONFIG_HARDLOCKUP_CHECK_TIMESTAMP. The dummy version of the function in "nmi.h" didn't get that quite right. While this doesn't appear to be a huge deal, it's nice to make it consistent. It doesn't break builds because CHECK_TIMESTAMP is only defined by x86 so others don't get a double definition, and x86 uses perf lockup detector, so it gets the out of line version. Link: https://lkml.kernel.org/r/20230519101840.v5.18.Ia44852044cdcb074f387e80df6b45e892965d4a1@changeid Link: https://lkml.kernel.org/r/20230519101840.v5.1.I8cbb2f4fa740528fcfade4f5439b6cdcdd059251@changeid Fixes: 7edaeb6841df ("kernel/watchdog: Prevent false positives with turbo modes") Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guenter Roeck <groeck@chromium.org> Cc: Ian Rogers <irogers@google.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Sumit Garg <sumit.garg@linaro.org> Cc: Tzung-Bi Shih <tzungbi@chromium.org> Cc: Will Deacon <will@kernel.org> Cc: Colin Cross <ccross@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19bpf: Remove bpf trampoline selectorYafang Shao1-1/+0
[ Upstream commit 47e79cbeea4b3891ad476047f4c68543eb51c8e0 ] After commit e21aa341785c ("bpf: Fix fexit trampoline."), the selector is only used to indicate how many times the bpf trampoline image are updated and been displayed in the trampoline ksym name. After the trampoline is freed, the selector will start from 0 again. So the selector is a useless value to the user. We can remove it. If the user want to check whether the bpf trampoline image has been updated or not, the user can compare the address. Each time the trampoline image is updated, the address will change consequently. Jiri also pointed out another issue that perf is still using the old name "bpf_trampoline_%lu", so this change can fix the issue in perf. Fixes: e21aa341785c ("bpf: Fix fexit trampoline.") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/bpf/ZFvOOlrmHiY9AgXE@krava Link: https://lore.kernel.org/bpf/20230515130849.57502-3-laoar.shao@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19blk-mq: fix potential io hang by wrong 'wake_batch'Yu Kuai1-2/+1
[ Upstream commit 4f1731df60f9033669f024d06ae26a6301260b55 ] In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating 'wake_batch' is not atomic: t1: t2: _blk_mq_tag_busy blk_mq_tag_busy inc active_queues // assume 1->2 inc active_queues // 2 -> 3 blk_mq_update_wake_batch // calculate based on 3 blk_mq_update_wake_batch /* calculate based on 2, while active_queues is actually 3. */ Fix this problem by protecting them wih 'tags->lock', this is not a hot path, so performance should not be concerned. And now that all writers are inside the lock, switch 'actives_queues' from atomic to unsigned int. Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19block: Fix the type of the second bdev_op_is_zoned_write() argumentBart Van Assche1-1/+1
[ Upstream commit 3ddbe2a7e0d4a155a805f69c906c9beed30d4cc4 ] Change the type of the second argument of bdev_op_is_zoned_write() from blk_opf_t into enum req_op because this function expects an operation without flags as second argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Fixes: 8cafdb5ab94c ("block: adapt blk_mq_plug() to not plug for writes that require a zone lock") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19fs: pipe: reveal missing function protoypesArnd Bergmann1-4/+0
[ Upstream commit 247c8d2f9837a3e29e3b6b7a4aa9c36c37659dd4 ] A couple of functions from fs/pipe.c are used both internally and for the watch queue code, but the declaration is only visible when the latter is enabled: fs/pipe.c:1254:5: error: no previous prototype for 'pipe_resize_ring' fs/pipe.c:758:15: error: no previous prototype for 'account_pipe_buffers' fs/pipe.c:764:6: error: no previous prototype for 'too_many_pipe_buffers_soft' fs/pipe.c:771:6: error: no previous prototype for 'too_many_pipe_buffers_hard' fs/pipe.c:777:6: error: no previous prototype for 'pipe_is_unprivileged_user' Make the visible unconditionally to avoid these warnings. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Message-Id: <20230516195629.551602-1-arnd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-05execve: always mark stack as growing down during early stack setupLinus Torvalds1-1/+3
commit f66066bc5136f25e36a2daff4896c768f18c211e upstream. While our user stacks can grow either down (all common architectures) or up (parisc and the ia64 register stack), the initial stack setup when we copy the argument and environment strings to the new stack at execve() time is always done by extending the stack downwards. But it turns out that in commit 8d7071af8907 ("mm: always expand the stack with the mmap write lock held"), as part of making the stack growing code more robust, 'expand_downwards()' was now made to actually check the vma flags: if (!(vma->vm_flags & VM_GROWSDOWN)) return -EFAULT; and that meant that this execve-time stack expansion started failing on parisc, because on that architecture, the stack flags do not contain the VM_GROWSDOWN bit. At the same time the new check in expand_downwards() is clearly correct, and simplified the callers, so let's not remove it. The solution is instead to just codify the fact that yes, during execve(), the stack grows down. This not only matches reality, it ends up being particularly simple: we already have special execve-time flags for the stack (VM_STACK_INCOMPLETE_SETUP) and use those flags to avoid page migration during this setup time (see vma_is_temporary_stack() and invalid_migration_vma()). So just add VM_GROWSDOWN to that set of temporary flags, and now our stack flags automatically match reality, and the parisc stack expansion works again. Note that the VM_STACK_INCOMPLETE_SETUP bits will be cleared when the stack is finalized, so we only add the extra VM_GROWSDOWN bit on CONFIG_STACK_GROWSUP architectures (ie parisc) rather than adding it in general. Link: https://lore.kernel.org/all/612eaa53-6904-6e16-67fc-394f4faa0e16@bell.net/ Link: https://lore.kernel.org/all/5fd98a09-4792-1433-752d-029ae3545168@gmx.de/ Fixes: 8d7071af8907 ("mm: always expand the stack with the mmap write lock held") Reported-by: John David Anglin <dave.anglin@bell.net> Reported-and-tested-by: Helge Deller <deller@gmx.de> Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-01xtensa: fix NOMMU build with lock_mm_and_find_vma() conversionLinus Torvalds1-2/+3
commit d85a143b69abb4d7544227e26d12c4c7735ab27d upstream. It turns out that xtensa has a really odd configuration situation: you can do a no-MMU config, but still have the page fault code enabled. Which doesn't sound all that sensible, but it turns out that xtensa can have protection faults even without the MMU, and we have this: config PFAULT bool "Handle protection faults" if EXPERT && !MMU default y help Handle protection faults. MMU configurations must enable it. noMMU configurations may disable it if used memory map never generates protection faults or faults are always fatal. If unsure, say Y. which completely violated my expectations of the page fault handling. End result: Guenter reports that the xtensa no-MMU builds all fail with arch/xtensa/mm/fault.c: In function ‘do_page_fault’: arch/xtensa/mm/fault.c:133:8: error: implicit declaration of function ‘lock_mm_and_find_vma’ because I never exposed the new lock_mm_and_find_vma() function for the no-MMU case. Doing so is simple enough, and fixes the problem. Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: a050ba1e7422 ("mm/fault: convert remaining simple cases to lock_mm_and_find_vma()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-01mm: always expand the stack with the mmap write lock heldLinus Torvalds1-12/+4
commit 8d7071af890768438c14db6172cc8f9f4d04e184 upstream This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again. For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that. It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures. As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [6.1: Patch drivers/iommu/io-pgfault.c instead] Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-01mm: make find_extend_vma() fail if write lock not heldLiam R. Howlett1-3/+7
commit f440fa1ac955e2898893f9301568435eb5cdfc4b upstream. Make calls to extend_vma() and find_extend_vma() fail if the write lock is required. To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order. Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-01mm: introduce new 'lock_mm_and_find_vma()' page fault helperLinus Torvalds1-0/+2
commit c2508ec5a58db67093f4fb8bf89a9a7c53a109e9 upstream. .. and make x86 use it. This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma. We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking. That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma. So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper. Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic. It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity. But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [6.1: Ignore CONFIG_PER_VMA_LOCK context] Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-01mm, hwpoison: when copy-on-write hits poison, take page offlineTony Luck1-1/+4
commit d302c2398ba269e788a4f37ae57c07a7fcabaa42 upstream. Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use memory_failure_queue() to request a call to memory_failure() for the page with the error. Also provide a stub version for CONFIG_MEMORY_FAILURE=n Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Due to missing commits e591ef7d96d6e ("mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage") 5033091de814a ("mm/hwpoison: introduce per-memory_block hwpoison counter") The impact of e591ef7d96d6e is its introduction of an additional flag in __get_huge_page_for_hwpoison() that serves as an indication a hwpoisoned hugetlb page should have its migratable bit cleared. The impact of 5033091de814a is contexual. Resolve by ignoring both missing commits. - jane] Signed-off-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-01mm, hwpoison: try to recover from copy-on write faultsTony Luck1-0/+26
commit a873dfe1032a132bf89f9e19a6ac44f5a0b78754 upstream. Patch series "Copy-on-write poison recovery", v3. Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com> for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: This patch (of 2): If the kernel is copying a page as the result of a copy-on-write fault and runs into an uncorrectable error, Linux will crash because it does not have recovery code for this case where poison is consumed by the kernel. It is easy to set up a test case. Just inject an error into a private page, fork(2), and have the child process write to the page. I wrapped that neatly into a test at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git just enable ACPI error injection and run: # ./einj_mem-uc -f copy-on-write Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() on architectures where that is available (currently x86 and powerpc). When an error is detected during the page copy, return VM_FAULT_HWPOISON to caller of wp_page_copy(). This propagates up the call stack. Both x86 and powerpc have code in their fault handler to deal with this code by sending a SIGBUS to the application. Note that this patch avoids a system crash and signals the process that triggered the copy-on-write action. It does not take any action for the memory error that is still in the shared page. To handle that a call to memory_failure() is needed. But this cannot be done from wp_page_copy() because it holds mmap_lock(). Perhaps the architecture fault handlers can deal with this loose end in a subsequent patch? On Intel/x86 this loose end will often be handled automatically because the memory controller provides an additional notification of the h/w poison in memory, the handler for this will call memory_failure(). This isn't a 100% solution. If there are multiple errors, not all may be logged in this way. [tony.luck@intel.com: add call to kmsan_unpoison_memory(), per Miaohe Lin] Link: https://lkml.kernel.org/r/20221031201029.102123-2-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-1-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-2-tony.luck@intel.com Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Igned-off-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28gpiolib: Fix irq_domain resource tracking for gpiochip_irqchip_add_domain()Michael Walle1-0/+8
[ Upstream commit ff7a1790fbf92f1bdd0966d3f0da3ea808ede876 ] Up until commit 6a45b0e2589f ("gpiolib: Introduce gpiochip_irqchip_add_domain()") all irq_domains were allocated by gpiolib itself and thus gpiolib also takes care of freeing it. With gpiochip_irqchip_add_domain() a user of gpiolib can associate an irq_domain with the gpio_chip. This irq_domain is not managed by gpiolib and therefore must not be freed by gpiolib. Fixes: 6a45b0e2589f ("gpiolib: Introduce gpiochip_irqchip_add_domain()") Reported-by: Jiawen Wu <jiawenwu@trustnetic.com> Signed-off-by: Michael Walle <mwalle@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28regulator: pca9450: Fix LDO3OUT and LDO4OUT MASKTeresa Remmet1-2/+2
[ Upstream commit 7257d930aadcd62d1c7971ab14f3b1126356abdc ] L3_OUT and L4_OUT Bit fields range from Bit 0:4 and thus the mask should be 0x1F instead of 0x0F. Fixes: 0935ff5f1f0a ("regulator: pca9450: add pca9450 pmic driver") Signed-off-by: Teresa Remmet <t.remmet@phytec.de> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de> Link: https://lore.kernel.org/r/20230614125240.3946519-1-t.remmet@phytec.de Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28ata: libata-scsi: Avoid deadlock on rescan after device resumeDamien Le Moal1-1/+1
[ Upstream commit 6aa0365a3c8512587fffd42fe438768709ddef8e ] When an ATA port is resumed from sleep, the port is reset and a power management request issued to libata EH to reset the port and rescanning the device(s) attached to the port. Device rescanning is done by scheduling an ata_scsi_dev_rescan() work, which will execute scsi_rescan_device(). However, scsi_rescan_device() takes the generic device lock, which is also taken by dpm_resume() when the SCSI device is resumed as well. If a device rescan execution starts before the completion of the SCSI device resume, the rcu locking used to refresh the cached VPD pages of the device, combined with the generic device locking from scsi_rescan_device() and from dpm_resume() can cause a deadlock. Avoid this situation by changing struct ata_port scsi_rescan_task to be a delayed work instead of a simple work_struct. ata_scsi_dev_rescan() is modified to check if the SCSI device associated with the ATA device that must be rescanned is not suspended. If the SCSI device is still suspended, ata_scsi_dev_rescan() returns early and reschedule itself for execution after an arbitrary delay of 5ms. Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Reported-by: Joe Breuer <linux-kernel@jmbreuer.net> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217530 Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Joe Breuer <linux-kernel@jmbreuer.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21RDMA/mlx5: Fix affinity assignmentMark Bloch1-0/+12
[ Upstream commit 617f5db1a626f18d5cbb7c7faf7bf8f9ea12be78 ] The cited commit aimed to ensure that Virtual Functions (VFs) assign a queue affinity to a Queue Pair (QP) to distribute traffic when the LAG master creates a hardware LAG. If the affinity was set while the hardware was not in LAG, the firmware would ignore the affinity value. However, this commit unintentionally assigned an affinity to QPs on the LAG master's VPORT even if the RDMA device was not marked as LAG-enabled. In most cases, this was not an issue because when the hardware entered hardware LAG configuration, the RDMA device of the LAG master would be destroyed and a new one would be created, marked as LAG-enabled. The problem arises when a user configures Equal-Cost Multipath (ECMP). In ECMP mode, traffic can be directed to different physical ports based on the queue affinity, which is intended for use by VPORTS other than the E-Switch manager. ECMP mode is supported only if both E-Switch managers are in switchdev mode and the appropriate route is configured via IP. In this configuration, the RDMA device is not destroyed, and we retain the RDMA device that is not marked as LAG-enabled. To ensure correct behavior, Send Queues (SQs) opened by the E-Switch manager through verbs should be assigned strict affinity. This means they will only be able to communicate through the native physical port associated with the E-Switch manager. This will prevent the firmware from assigning affinity and will not allow the SQs to be remapped in case of failover. Fixes: 802dcc7fc5ec ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode") Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21EDAC/qcom: Get rid of hardcoded register offsetsManivannan Sadhasivam1-6/+0
[ Upstream commit cbd77119b6355872cd308a60e99f9ca678435d15 ] The LLCC EDAC register offsets varies between each SoC. Hardcoding the register offsets won't work and will often result in crash due to accessing the wrong locations. Hence, get the register offsets from the LLCC driver matching the individual SoCs. Cc: <stable@vger.kernel.org> # 6.0: 5365cea199c7 ("soc: qcom: llcc: Rename reg_offset structs to reflect LLCC version") Cc: <stable@vger.kernel.org> # 6.0: c13d7d261e36 ("soc: qcom: llcc: Pass LLCC version based register offsets to EDAC driver") Cc: <stable@vger.kernel.org> # 6.0 Fixes: a6e9d7ef252c ("soc: qcom: llcc: Add configuration data for SM8450 SoC") Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230517114635.76358-3-manivannan.sadhasivam@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21qcom: llcc/edac: Fix the base address used for accessing LLCC banksManivannan Sadhasivam1-4/+2
[ Upstream commit ee13b5008707948d3052c1b5aab485c6cd53658e ] The Qualcomm LLCC/EDAC drivers were using a fixed register stride for accessing the (Control and Status Registers) CSRs of each LLCC bank. This stride only works for some SoCs like SDM845 for which driver support was initially added. But the later SoCs use different register stride that vary between the banks with holes in-between. So it is not possible to use a single register stride for accessing the CSRs of each bank. By doing so could result in a crash. For fixing this issue, let's obtain the base address of each LLCC bank from devicetree and get rid of the fixed stride. This also means, there is no need to rely on reg-names property and the base addresses can be obtained using the index. First index is LLCC bank 0 and last index is LLCC broadcast. If the SoC supports more than one bank, then those need to be defined in devicetree for index from 1..N-1. Reported-by: Parikshit Pareek <quic_ppareek@quicinc.com> Tested-by: Luca Weiss <luca.weiss@fairphone.com> Tested-by: Steev Klimaszewski <steev@kali.org> # Thinkpad X13s Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8540p-ride Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230314080443.64635-13-manivannan.sadhasivam@linaro.org Stable-dep-of: cbd77119b635 ("EDAC/qcom: Get rid of hardcoded register offsets") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-14mm: page_table_check: Ensure user pages are not slab pagesRuihan Li1-0/+6
commit 44d0fb387b53e56c8a050bac5c7d460e21eb226f upstream. The current uses of PageAnon in page table check functions can lead to type confusion bugs between struct page and slab [1], if slab pages are accidentally mapped into the user space. This is because slab reuses the bits in struct page to store its internal states, which renders PageAnon ineffective on slab pages. Since slab pages are not expected to be mapped into the user space, this patch adds BUG_ON(PageSlab(page)) checks to make sure that slab pages are not inadvertently mapped. Otherwise, there must be some bugs in the kernel. Reported-by: syzbot+fcf1a817ceb50935ce99@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/000000000000258e5e05fae79fc1@google.com/ [1] Fixes: df4e817b7108 ("mm: page table check") Cc: <stable@vger.kernel.org> # 5.17 Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/r/20230515130958.32471-5-lrh2000@pku.edu.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-14usb: usbfs: Enforce page requirements for mmapRuihan Li1-0/+5
commit 0143d148d1e882fb1538dc9974c94d63961719b9 upstream. The current implementation of usbdev_mmap uses usb_alloc_coherent to allocate memory pages that will later be mapped into the user space. Meanwhile, usb_alloc_coherent employs three different methods to allocate memory, as outlined below: * If hcd->localmem_pool is non-null, it uses gen_pool_dma_alloc to allocate memory; * If DMA is not available, it uses kmalloc to allocate memory; * Otherwise, it uses dma_alloc_coherent. However, it should be noted that gen_pool_dma_alloc does not guarantee that the resulting memory will be page-aligned. Furthermore, trying to map slab pages (i.e., memory allocated by kmalloc) into the user space is not resonable and can lead to problems, such as a type confusion bug when PAGE_TABLE_CHECK=y [1]. To address these issues, this patch introduces hcd_alloc_coherent_pages, which addresses the above two problems. Specifically, hcd_alloc_coherent_pages uses gen_pool_dma_alloc_align instead of gen_pool_dma_alloc to ensure that the memory is page-aligned. To replace kmalloc, hcd_alloc_coherent_pages directly allocates pages by calling __get_free_pages. Reported-by: syzbot+fcf1a817ceb50935ce99@syzkaller.appspotmail.comm Closes: https://lore.kernel.org/lkml/000000000000258e5e05fae79fc1@google.com/ [1] Fixes: f7d34b445abc ("USB: Add support for usbfs zerocopy.") Fixes: ff2437befd8f ("usb: host: Fix excessive alignment restriction for local memory allocations") Cc: stable@vger.kernel.org Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Acked-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/r/20230515130958.32471-2-lrh2000@pku.edu.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-14net: sched: add rcu annotations around qdisc->qdisc_sleepingEric Dumazet1-1/+1
[ Upstream commit d636fc5dd692c8f4e00ae6e0359c0eceeb5d9bdb ] syzbot reported a race around qdisc->qdisc_sleeping [1] It is time we add proper annotations to reads and writes to/from qdisc->qdisc_sleeping. [1] BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1: qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331 __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174 tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547 rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0: dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115 qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103 tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023 Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@nvidia.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-14rfs: annotate lockless accesses to RFS sock flow tableEric Dumazet1-2/+5
[ Upstream commit 5c3b74a92aa285a3df722bf6329ba7ccf70346d6 ] Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table. This also prevents a (smart ?) compiler to remove the condition in: if (table->ents[index] != newval) table->ents[index] = newval; We need the condition to avoid dirtying a shared cache line. Fixes: fec5e652e58f ("rfs: Receive Flow Steering") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09nfsd: fix double fget() bug in __write_ports_addfd()Dan Carpenter1-4/+3
[ Upstream commit c034203b6a9dae6751ef4371c18cb77983e30c28 ] The bug here is that you cannot rely on getting the same socket from multiple calls to fget() because userspace can influence that. This is a kind of double fetch bug. The fix is to delete the svc_alien_sock() function and instead do the checking inside the svc_addsock() function. Fixes: 3064639423c4 ("nfsd: check passed socket's net matches NFSd superblock's one") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: NeilBrown <neilb@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05bpf, sockmap: Improved check for empty queueJohn Fastabend1-1/+0
[ Upstream commit 405df89dd52cbcd69a3cd7d9a10d64de38f854b2 ] We noticed some rare sk_buffs were stepping past the queue when system was under memory pressure. The general theory is to skip enqueueing sk_buffs when its not necessary which is the normal case with a system that is properly provisioned for the task, no memory pressure and enough cpu assigned. But, if we can't allocate memory due to an ENOMEM error when enqueueing the sk_buff into the sockmap receive queue we push it onto a delayed workqueue to retry later. When a new sk_buff is received we then check if that queue is empty. However, there is a problem with simply checking the queue length. When a sk_buff is being processed from the ingress queue but not yet on the sockmap msg receive queue its possible to also recv a sk_buff through normal path. It will check the ingress queue which is zero and then skip ahead of the pkt being processed. Previously we used sock lock from both contexts which made the problem harder to hit, but not impossible. To fix instead of popping the skb from the queue entirely we peek the skb from the queue and do the copy there. This ensures checks to the queue length are non-zero while skb is being processed. Then finally when the entire skb has been copied to user space queue or another socket we pop it off the queue. This way the queue length check allows bypassing the queue only after the list has been completely processed. To reproduce issue we run NGINX compliance test with sockmap running and observe some flakes in our testing that we attributed to this issue. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05bpf, sockmap: Convert schedule_work into delayed_workJohn Fastabend1-1/+1
[ Upstream commit 29173d07f79883ac94f5570294f98af3d4287382 ] Sk_buffs are fed into sockmap verdict programs either from a strparser (when the user might want to decide how framing of skb is done by attaching another parser program) or directly through tcp_read_sock. The tcp_read_sock is the preferred method for performance when the BPF logic is a stream parser. The flow for Cilium's common use case with a stream parser is, tcp_read_sock() sk_psock_verdict_recv ret = bpf_prog_run_pin_on_cpu() sk_psock_verdict_apply(sock, skb, ret) // if system is under memory pressure or app is slow we may // need to queue skb. Do this queuing through ingress_skb and // then kick timer to wake up handler skb_queue_tail(ingress_skb, skb) schedule_work(work); The work queue is wired up to sk_psock_backlog(). This will then walk the ingress_skb skb list that holds our sk_buffs that could not be handled, but should be OK to run at some later point. However, its possible that the workqueue doing this work still hits an error when sending the skb. When this happens the skbuff is requeued on a temporary 'state' struct kept with the workqueue. This is necessary because its possible to partially send an skbuff before hitting an error and we need to know how and where to restart when the workqueue runs next. Now for the trouble, we don't rekick the workqueue. This can cause a stall where the skbuff we just cached on the state variable might never be sent. This happens when its the last packet in a flow and no further packets come along that would cause the system to kick the workqueue from that side. To fix we could do simple schedule_work(), but while under memory pressure it makes sense to back off some instead of continue to retry repeatedly. So instead to fix convert schedule_work to schedule_delayed_work and add backoff logic to reschedule from backlog queue on errors. Its not obvious though what a good backoff is so use '1'. To test we observed some flakes whil running NGINX compliance test with sockmap we attributed these failed test to this bug and subsequent issue. >From on list discussion. This commit bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock") was intended to address similar race, but had a couple cases it missed. Most obvious it only accounted for receiving traffic on the local socket so if redirecting into another socket we could still get an sk_buff stuck here. Next it missed the case where copied=0 in the recv() handler and then we wouldn't kick the scheduler. Also its sub-optimal to require userspace to kick the internal mechanisms of sockmap to wake it up and copy data to user. It results in an extra syscall and requires the app to actual handle the EAGAIN correctly. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05tls: rx: strp: force mixed decrypted records into copy modeJakub Kicinski1-0/+10
[ Upstream commit 14c4be92ebb3e36e392aa9dd8f314038a9f96f3c ] If a record is partially decrypted we'll have to CoW it, anyway, so go into copy mode and allocate a writable skb right away. This will make subsequent fix simpler because we won't have to teach tls_strp_msg_make_copy() how to copy skbs while preserving decrypt status. Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: eca9bfafee3a ("tls: rx: strp: preserve decryption status of skbs when needed") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30net/mlx5: DR, Check force-loopback RC QP capability independently from RoCEYevgeny Kliteynik1-1/+3
commit c7dd225bc224726c22db08e680bf787f60ebdee3 upstream. SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB (loopback), and FL (force-loopback) QP is preferred for performance. FL is available when RoCE is enabled or disabled based on RoCE caps. This patch adds reading of FL capability from HCA caps in addition to the existing reading from RoCE caps, thus fixing the case where we didn't have loopback enabled when RoCE was disabled. Fixes: 7304d603a57a ("net/mlx5: DR, Add support for force-loopback QP") Signed-off-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30x86/pci/xen: populate MSI sysfs entriesMaximilian Heyne1-1/+8
commit 335b4223466dd75f9f3ea4918187afbadd22e5c8 upstream. Commit bf5e758f02fc ("genirq/msi: Simplify sysfs handling") reworked the creation of sysfs entries for MSI IRQs. The creation used to be in msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs. Then it moved into __msi_domain_alloc_irqs which is an implementation of domain_alloc_irqs. However, Xen comes with the only other implementation of domain_alloc_irqs and hence doesn't run the sysfs population code anymore. Commit 6c796996ee70 ("x86/pci/xen: Fixup fallout from the PCI/MSI overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info but that doesn't actually have an effect because Xen uses it's own domain_alloc_irqs implementation. Fix this by making use of the fallback functions for sysfs population. Fixes: bf5e758f02fc ("genirq/msi: Simplify sysfs handling") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230503131656.15928-1-mheyne@amazon.de Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30fs: fix undefined behavior in bit shift for SB_NOUSERHao Ge1-21/+21
commit f15afbd34d8fadbd375f1212e97837e32bc170cc upstream. Shifting signed 32-bit value by 31 bits is undefined, so changing significant bit to unsigned. It was spotted by UBSAN. So let's just fix this by using the BIT() helper for all SB_* flags. Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal superblock flags") Signed-off-by: Hao Ge <gehao@kylinos.cn> Message-Id: <20230424051835.374204-1-gehao@kylinos.cn> [brauner@kernel.org: use BIT() for all SB_* flags] Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30firmware: arm_ffa: Fix FFA device names for logical partitionsSudeep Holla1-0/+1
commit 19b8766459c41c6f318f8a548cc1c66dffd18363 upstream. Each physical partition can provide multiple services each with UUID. Each such service can be presented as logical partition with a unique combination of VM ID and UUID. The number of distinct UUID in a system will be less than or equal to the number of logical partitions. However, currently it fails to register more than one logical partition or service within a physical partition as the device name contains only VM ID while both VM ID and UUID are maintained in the partition information. The kernel complains with the below message: | sysfs: cannot create duplicate filename '/devices/arm-ffa-8001' | CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc7 #8 | Hardware name: FVP Base RevC (DT) | Call trace: | dump_backtrace+0xf8/0x118 | show_stack+0x18/0x24 | dump_stack_lvl+0x50/0x68 | dump_stack+0x18/0x24 | sysfs_create_dir_ns+0xe0/0x13c | kobject_add_internal+0x220/0x3d4 | kobject_add+0x94/0x100 | device_add+0x144/0x5d8 | device_register+0x20/0x30 | ffa_device_register+0x88/0xd8 | ffa_setup_partitions+0x108/0x1b8 | ffa_init+0x2ec/0x3a4 | do_one_initcall+0xcc/0x240 | do_initcall_level+0x8c/0xac | do_initcalls+0x54/0x94 | do_basic_setup+0x1c/0x28 | kernel_init_freeable+0x100/0x16c | kernel_init+0x20/0x1a0 | ret_from_fork+0x10/0x20 | kobject_add_internal failed for arm-ffa-8001 with -EEXIST, don't try to | register things with the same name in the same directory. | arm_ffa arm-ffa: unable to register device arm-ffa-8001 err=-17 | ARM FF-A: ffa_setup_partitions: failed to register partition ID 0x8001 By virtue of being random enough to avoid collisions when generated in a distributed system, there is no way to compress UUID keys to the number of bits required to identify each. We can eliminate '-' in the name but it is not worth eliminating 4 bytes and add unnecessary logic for doing that. Also v1.0 doesn't provide the UUID of the partitions which makes it hard to use the same for the device name. So to keep it simple, let us alloc an ID using ida_alloc() and append the same to "arm-ffa" to make up a unique device name. Also stash the id value in ffa_dev to help freeing the ID later when the device is destroyed. Fixes: e781858488b9 ("firmware: arm_ffa: Add initial FFA bus support for device enumeration") Reported-by: Lucian Paul-Trifu <lucian.paul-trifu@arm.com> Link: https://lore.kernel.org/r/20230419-ffa_fixes_6-4-v2-3-d9108e43a176@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30power: supply: bq27xxx: Ensure power_supply_changed() is called on current ↵Hans de Goede1-0/+3
sign changes commit 939a116142012926e25de0ea6b7e2f8d86a5f1b6 upstream. On gauges where the current register is signed, there is no charging flag in the flags register. So only checking flags will not result in power_supply_changed() getting called when e.g. a charger is plugged in and the current sign changes from negative (discharging) to positive (charging). This causes userspace's notion of the status to lag until userspace does a poll. And when a power_supply_leds.c LED trigger is used to indicate charging status with a LED, this LED will lag until the capacity percentage changes, which may take many minutes (because the LED trigger only is updated on power_supply_changed() calls). Fix this by calling bq27xxx_battery_current_and_status() on gauges with a signed current register and checking if the status has changed. Fixes: 297a533b3e62 ("bq27x00: Cache battery registers") Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30power: supply: bq27xxx: Fix poll_interval handling and races on removeHans de Goede1-0/+1
commit c00bc80462afc7963f449d7f21d896d2f629cacc upstream. Before this patch bq27xxx_battery_teardown() was setting poll_interval = 0 to avoid bq27xxx_battery_update() requeuing the delayed_work item. There are 2 problems with this: 1. If the driver is unbound through sysfs, rather then the module being rmmod-ed, this changes poll_interval unexpectedly 2. This is racy, after it being set poll_interval could be changed before bq27xxx_battery_update() checks it through /sys/module/bq27xxx_battery/parameters/poll_interval Fix this by added a removed attribute to struct bq27xxx_device_info and using that instead of setting poll_interval to 0. There also is another poll_interval related race on remove(), writing /sys/module/bq27xxx_battery/parameters/poll_interval will requeue the delayed_work item for all devices on the bq27xxx_battery_devices list and the device being removed was only removed from that list after cancelling the delayed_work item. Fix this by moving the removal from the bq27xxx_battery_devices list to before cancelling the delayed_work item. Fixes: 8cfaaa811894 ("bq27x00_battery: Fix OOPS caused by unregistring bq27x00 driver") Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30USB: core: Add routines for endpoint checks in old driversAlan Stern1-0/+5
commit 13890626501ffda22b18213ddaf7930473da5792 upstream. Many of the older USB drivers in the Linux USB stack were written based simply on a vendor's device specification. They use the endpoint information in the spec and assume these endpoints will always be present, with the properties listed, in any device matching the given vendor and product IDs. While that may have been true back then, with spoofing and fuzzing it is not true any more. More and more we are finding that those old drivers need to perform at least a minimum of checking before they try to use any endpoint other than ep0. To make this checking as simple as possible, we now add a couple of utility routines to the USB core. usb_check_bulk_endpoints() and usb_check_int_endpoints() take an interface pointer together with a list of endpoint addresses (numbers and directions). They check that the interface's current alternate setting includes endpoints with those addresses and that each of these endpoints has the right type: bulk or interrupt, respectively. Although we already have usb_find_common_endpoints() and related routines meant for a similar purpose, they are not well suited for this kind of checking. Those routines find endpoints of various kinds, but only one (either the first or the last) of each kind, and they don't verify that the endpoints' addresses agree with what the caller expects. In theory the new routines could be more general: They could take a particular altsetting as their argument instead of always using the interface's current altsetting. In practice I think this won't matter too much; multiple altsettings tend to be used for transferring media (audio or visual) over isochronous endpoints, not bulk or interrupt. Drivers for such devices will generally require more sophisticated checking than these simplistic routines provide. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/r/dd2c8e8c-2c87-44ea-ba17-c64b97e201c9@rowland.harvard.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30net: fix stack overflow when LRO is disabled for virtual interfacesTaehee Yoo1-0/+1
commit ae9b15fbe63447bc1d3bba3769f409d17ca6fdf6 upstream. When the virtual interface's feature is updated, it synchronizes the updated feature for its own lower interface. This propagation logic should be worked as the iteration, not recursively. But it works recursively due to the netdev notification unexpectedly. This problem occurs when it disables LRO only for the team and bonding interface type. team0 | +------+------+-----+-----+ | | | | | team1 team2 team3 ... team200 If team0's LRO feature is updated, it generates the NETDEV_FEAT_CHANGE event to its own lower interfaces(team1 ~ team200). It is worked by netdev_sync_lower_features(). So, the NETDEV_FEAT_CHANGE notification logic of each lower interface work iteratively. But generated NETDEV_FEAT_CHANGE event is also sent to the upper interface too. upper interface(team0) generates the NETDEV_FEAT_CHANGE event for its own lower interfaces again. lower and upper interfaces receive this event and generate this event again and again. So, the stack overflow occurs. But it is not the infinite loop issue. Because the netdev_sync_lower_features() updates features before generating the NETDEV_FEAT_CHANGE event. Already synchronized lower interfaces skip notification logic. So, it is just the problem that iteration logic is changed to the recursive unexpectedly due to the notification mechanism. Reproducer: ip link add team0 type team ethtool -K team0 lro on for i in {1..200} do ip link add team$i master team0 type team ethtool -K team$i lro on done ethtool -K team0 lro off In order to fix it, the notifier_ctx member of bonding/team is introduced. Reported-by: syzbot+60748c96cf5c6df8e581@syzkaller.appspotmail.com Fixes: fd867d51f889 ("net/core: generic support for disabling netdev features down stack") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230517143010.3596250-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30tpm: Prevent hwrng from activating during resumeJarkko Sakkinen1-0/+1
[ Upstream commit 99d46450625590d410f86fe4660a5eff7d3b8343 ] Set TPM_CHIP_FLAG_SUSPENDED in tpm_pm_suspend() and reset in tpm_pm_resume(). While the flag is set, tpm_hwrng() gives back zero bytes. This prevents hwrng from racing during resume. Cc: stable@vger.kernel.org Fixes: 6e592a065d51 ("tpm: Move Linux RNG connection to hwrng") Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30tpm: Re-enable TPM chip boostrapping non-tpm_tis TPM driversJarkko Sakkinen1-6/+7
[ Upstream commit 0c8862de05c1a087795ee0a87bf61a6394306cc0 ] TPM chip bootstrapping was removed from tpm_chip_register(), and it was relocated to tpm_tis_core. This breaks all drivers which are not based on tpm_tis because the chip will not get properly initialized. Take the corrective steps: 1. Rename tpm_chip_startup() as tpm_chip_bootstrap() and make it one-shot. 2. Call tpm_chip_bootstrap() in tpm_chip_register(), which reverts the things as tehy used to be. Cc: Lino Sanfilippo <l.sanfilippo@kunbus.com> Fixes: 548eb516ec0f ("tpm, tpm_tis: startup chip before testing for interrupts") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Link: https://lore.kernel.org/all/ZEjqhwHWBnxcaRV5@xpf.sh.intel.com/ Tested-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Stable-dep-of: 99d464506255 ("tpm: Prevent hwrng from activating during resume") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24SUNRPC: always free ctxt when freeing deferred requestNeilBrown2-2/+2
[ Upstream commit 948f072ada23e0a504c5e4d7d71d4c83bd0785ec ] Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not been sufficient to use kfree() to free a deferred request. We may need to free the ctxt as well. As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose it to explicit do that even when the ctxt is not stored in an rqst. So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt, which may have been taken either from an rqst or from a dreq. The caller is now responsible for clearing that pointer after the call to ->xpo_release_ctxt. We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when revisiting a deferred request. This ensures there is only one pointer to the ctxt, so the risk of double freeing in future is reduced. The new code in svc_xprt_release which releases both the ctxt and any rq_deferred depends on this. Fixes: 773f91b2cf3f ("SUNRPC: Fix NFSD's request deferral on RDMA transports") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24platform: Provide a remove callback that returns no valueUwe Kleine-König1-0/+11
[ Upstream commit 5c5a7680e67ba6fbbb5f4d79fa41485450c1985c ] struct platform_driver::remove returning an integer made driver authors expect that returning an error code was proper error handling. However the driver core ignores the error and continues to remove the device because there is nothing the core could do anyhow and reentering the remove callback again is only calling for trouble. So this is an source for errors typically yielding resource leaks in the error path. As there are too many platform drivers to neatly convert them all to return void in a single go, do it in several steps after this patch: a) Convert all drivers to implement .remove_new() returning void instead of .remove() returning int; b) Change struct platform_driver::remove() to return void and so make it identical to .remove_new(); c) Change all drivers back to .remove() now with the better prototype; d) drop struct platform_driver::remove_new(). While this touches all drivers eventually twice, steps a) and c) can be done one driver after another and so reduces coordination efforts immensely and simplifies review. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Link: https://lore.kernel.org/r/20221209150914.3557650-1-u.kleine-koenig@pengutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 17955aba7877 ("ASoC: fsl_micfil: Fix error handler with pm_runtime_enable") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24sched: Fix KCSAN noinstr violationJosh Poimboeuf1-1/+1
[ Upstream commit e0b081d17a9f4e5c0cbb0e5fbeb1abe3de0f7e4e ] With KCSAN enabled, end_of_stack() can get out-of-lined. Force it inline. Fixes the following warnings: vmlinux.o: warning: objtool: check_stackleak_irqoff+0x2b: call to end_of_stack() leaves .noinstr.text section Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/cc1b4d73d3a428a00d206242a68fdf99a934ca7b.1681320026.git.jpoimboe@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24netdev: Enforce index cap in netdev_get_tx_queueNick Child1-0/+1
[ Upstream commit 1cc6571f562774f1d928dc8b3cff50829b86e970 ] When requesting a TX queue at a given index, warn on out-of-bounds referencing if the index is greater than the allocated number of queues. Specifically, since this function is used heavily in the networking stack use DEBUG_NET_WARN_ON_ONCE to avoid executing a new branch on every packet. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://lore.kernel.org/r/20230321150725.127229-2-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4Shanker Donthineni1-0/+18
[ Upstream commit 35727af2b15d98a2dd2811d631d3a3886111312e ] The T241 platform suffers from the T241-FABRIC-4 erratum which causes unexpected behavior in the GIC when multiple transactions are received simultaneously from different sources. This hardware issue impacts NVIDIA server platforms that use more than two T241 chips interconnected. Each chip has support for 320 {E}SPIs. This issue occurs when multiple packets from different GICs are incorrectly interleaved at the target chip. The erratum text below specifies exactly what can cause multiple transfer packets susceptible to interleaving and GIC state corruption. GIC state corruption can lead to a range of problems, including kernel panics, and unexpected behavior. >From the erratum text: "In some cases, inter-socket AXI4 Stream packets with multiple transfers, may be interleaved by the fabric when presented to ARM Generic Interrupt Controller. GIC expects all transfers of a packet to be delivered without any interleaving. The following GICv3 commands may result in multiple transfer packets over inter-socket AXI4 Stream interface: - Register reads from GICD_I* and GICD_N* - Register writes to 64-bit GICD registers other than GICD_IROUTERn* - ITS command MOVALL Multiple commands in GICv4+ utilize multiple transfer packets, including VMOVP, VMOVI, VMAPP, and 64-bit register accesses." This issue impacts system configurations with more than 2 sockets, that require multi-transfer packets to be sent over inter-socket AXI4 Stream interface between GIC instances on different sockets. GICv4 cannot be supported. GICv3 SW model can only be supported with the workaround. Single and Dual socket configurations are not impacted by this issue and support GICv3 and GICv4." Link: https://developer.nvidia.com/docs/t241-fabric-4/nvidia-t241-fabric-4-errata.pdf Writing to the chip alias region of the GICD_In{E} registers except GICD_ICENABLERn has an equivalent effect as writing to the global distributor. The SPI interrupt deactivate path is not impacted by the erratum. To fix this problem, implement a workaround that ensures read accesses to the GICD_In{E} registers are directed to the chip that owns the SPI, and disable GICv4.x features. To simplify code changes, the gic_configure_irq() function uses the same alias region for both read and write operations to GICD_ICFGR. Co-developed-by: Vikram Sethi <vsethi@nvidia.com> Signed-off-by: Vikram Sethi <vsethi@nvidia.com> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> (for SMCCC/SOC ID bits) Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230319024314.3540573-2-sdonthineni@nvidia.com Signed-off-by: Sasha Levin <sashal@kernel.org>