summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2023-08-30netfilter: nf_tables: fix kdoc warnings after gc reworkFlorian Westphal1-0/+1
commit 08713cb006b6f07434f276c5ee214fb20c7fd965 upstream. Jakub Kicinski says: We've got some new kdoc warnings here: net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc' net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc' include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set' Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30clk: Fix undefined reference to `clk_rate_exclusive_{get,put}'Biju Das1-40/+40
[ Upstream commit 2746f13f6f1df7999001d6595b16f789ecc28ad1 ] The COMMON_CLK config is not enabled in some of the architectures. This causes below build issues: pwm-rz-mtu3.c:(.text+0x114): undefined reference to `clk_rate_exclusive_put' pwm-rz-mtu3.c:(.text+0x32c): undefined reference to `clk_rate_exclusive_get' Fix these issues by moving clk_rate_exclusive_{get,put} inside COMMON_CLK code block, as clk.c is enabled by COMMON_CLK. Fixes: 55e9b8b7b806 ("clk: add clk_rate_exclusive api") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/all/202307251752.vLfmmhYm-lkp@intel.com/ Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Link: https://lore.kernel.org/r/20230725175140.361479-1-biju.das.jz@bp.renesas.com Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30scsi: core: raid_class: Remove raid_component_add()Zhu Wang1-4/+0
commit 60c5fd2e8f3c42a5abc565ba9876ead1da5ad2b7 upstream. The raid_component_add() function was added to the kernel tree via patch "[SCSI] embryonic RAID class" (2005). Remove this function since it never has had any callers in the Linux kernel. And also raid_component_release() is only used in raid_component_add(), so it is also removed. Signed-off-by: Zhu Wang <wangzhu9@huawei.com> Link: https://lore.kernel.org/r/20230822015254.184270-1-wangzhu9@huawei.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Fixes: 04b5b5cb0136 ("scsi: core: Fix possible memory leak if device_add() fails") Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30cgroup/cpuset: Free DL BW in case can_attach() failsDietmar Eggemann1-1/+1
commit 2ef269ef1ac006acf974793d975539244d77b28f upstream. cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks have been checked. DL BW is not allocated per-task but as a sum over all DL tasks migrating. If multiple controllers are attached to the cgroup next to the cpuset controller a non-cpuset can_attach() can fail. In this case free DL BW in cpuset_cancel_attach(). Finally, update cpuset DL task count (nr_deadline_tasks) only in cpuset_attach(). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30sched/deadline: Create DL BW alloc, free & check overflow interfaceDietmar Eggemann1-0/+2
commit 85989106feb734437e2d598b639991b9185a43a6 upstream. While moving a set of tasks between exclusive cpusets, cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for DL BW overflow checking and per-task DL BW allocation on the destination root_domain for the DL tasks in this set. This approach has the issue of not freeing already allocated DL BW in the following error cases: (1) The set of tasks includes multiple DL tasks and DL BW overflow checking fails for one of the subsequent DL tasks. (2) Another controller next to the cpuset controller which is attached to the same cgroup fails in its can_attach(). To address this problem rework dl_cpu_busy(): (1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a dedicated dl_bw_free(). (2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of a `struct task_struct *p` used in dl_cpu_busy(). This allows to allocate DL BW for a set of tasks too rather than only for a single task. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30sched/cpuset: Keep track of SCHED_DEADLINE task in cpusetsJuri Lelli1-0/+4
commit 6c24849f5515e4966d94fa5279bdff4acf2e9489 upstream. Qais reported that iterating over all tasks when rebuilding root domains for finding out which ones are DEADLINE and need their bandwidth correctly restored on such root domains can be a costly operation (10+ ms delays on suspend-resume). To fix the problem keep track of the number of DEADLINE tasks belonging to each cpuset and then use this information (followup patch) to only perform the above iteration if DEADLINE tasks are actually present in the cpuset for which a corresponding root domain is being rebuilt. Reported-by: Qais Yousef (Google) <qyousef@layalina.io> Link: https://lore.kernel.org/lkml/20230206221428.2125324-1-qyousef@layalina.io/ Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30sched/cpuset: Bring back cpuset_mutexJuri Lelli1-4/+4
commit 111cd11bbc54850f24191c52ff217da88a5e639b upstream. Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea, as it has been reported to cause slowdowns in workloads that need to change cpuset configuration frequently and it is also not implementing priority inheritance (which causes troubles with realtime workloads). Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it only for SCHED_DEADLINE tasks (other policies don't care about stable cpusets anyway). Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30drm/display/dp: Fix the DP DSC Receiver cap sizeAnkit Nautiyal1-1/+1
commit 5ad1ab30ac0809d2963ddcf39ac34317a24a2f17 upstream. DP DSC Receiver Capabilities are exposed via DPCD 60h-6Fh. Fix the DSC RECEIVER CAP SIZE accordingly. Fixes: ffddc4363c28 ("drm/dp: Add DP DSC DPCD receiver capability size define and missing SHIFT") Cc: Anusha Srivatsa <anusha.srivatsa@intel.com> Cc: Manasi Navare <manasi.d.navare@intel.com> Cc: <stable@vger.kernel.org> # v5.0+ Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com> Reviewed-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230818044436.177806-1-ankit.k.nautiyal@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30drm: Add an HPD poll helper to reschedule the poll workImre Deak1-0/+1
commit a94e7ccfc400c024976f3c2f31689ed843498b7c upstream. Add a helper to reschedule drm_mode_config::output_poll_work after polling has been enabled for a connector (and needing a reschedule, since previously polling was disabled for all connectors and hence output_poll_work was not running). This is needed by the next patch fixing HPD polling on i915. CC: stable@vger.kernel.org # 6.4+ Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: dri-devel@lists.freedesktop.org Reviewed-by: Jouni Högander <jouni.hogander@intel.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230822113015.41224-1-imre.deak@intel.com (cherry picked from commit fe2352fd64029918174de4b460dfe6df0c6911cd) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULTDavid Hildenbrand2-6/+24
commit d74943a2f3cdade34e471b36f55f7979be656867 upstream. Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. While at it, remove a stale comment from follow_trans_huge_pmd(): That comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP migration"), which noted: THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages Nowadays, we do have PMD migration entries, so the comment no longer applies. Let's drop it. [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: liubo <liubo254@huawei.com> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com Reported-by: Peter Xu <peterx@redhat.com> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan1-0/+11
commit 49b0638502da097c15d46cd4e871dbaa022caf7c upstream. walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30wifi: mac80211: limit reorder_buf_filtered to avoid UBSAN warningPing-Ke Shih1-0/+1
commit b98c16107cc1647242abbd11f234c05a3a5864f6 upstream. The commit 06470f7468c8 ("mac80211: add API to allow filtering frames in BA sessions") added reorder_buf_filtered to mark frames filtered by firmware, and it can only work correctly if hw.max_rx_aggregation_subframes <= 64 since it stores the bitmap in a u64 variable. However, new HE or EHT devices can support BlockAck number up to 256 or 1024, and then using a higher subframe index leads UBSAN warning: UBSAN: shift-out-of-bounds in net/mac80211/rx.c:1129:39 shift exponent 215 is too large for 64-bit type 'long long unsigned int' Call Trace: <IRQ> dump_stack_lvl+0x48/0x70 dump_stack+0x10/0x20 __ubsan_handle_shift_out_of_bounds+0x1ac/0x360 ieee80211_release_reorder_frame.constprop.0.cold+0x64/0x69 [mac80211] ieee80211_sta_reorder_release+0x9c/0x400 [mac80211] ieee80211_prepare_and_rx_handle+0x1234/0x1420 [mac80211] ieee80211_rx_list+0xaef/0xf60 [mac80211] ieee80211_rx_napi+0x53/0xd0 [mac80211] Since only old hardware that supports <=64 BlockAck uses ieee80211_mark_rx_ba_filtered_frames(), limit the use as it is, so add a WARN_ONCE() and comment to note to avoid using this function if hardware capability is not suitable. Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20230818014004.16177-1-pkshih@realtek.com [edit commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30bonding: fix macvlan over alb bond supportHangbin Liu1-10/+1
[ Upstream commit e74216b8def3803e98ae536de78733e9d7f3b109 ] The commit 14af9963ba1e ("bonding: Support macvlans on top of tlb/rlb mode bonds") aims to enable the use of macvlans on top of rlb bond mode. However, the current rlb bond mode only handles ARP packets to update remote neighbor entries. This causes an issue when a macvlan is on top of the bond, and remote devices send packets to the macvlan using the bond's MAC address as the destination. After delivering the packets to the macvlan, the macvlan will rejects them as the MAC address is incorrect. Consequently, this commit makes macvlan over bond non-functional. To address this problem, one potential solution is to check for the presence of a macvlan port on the bond device using netif_is_macvlan_port(bond->dev) and return NULL in the rlb_arp_xmit() function. However, this approach doesn't fully resolve the situation when a VLAN exists between the bond and macvlan. So let's just do a partial revert for commit 14af9963ba1e in rlb_arp_xmit(). As the comment said, Don't modify or load balance ARPs that do not originate locally. Fixes: 14af9963ba1e ("bonding: Support macvlans on top of tlb/rlb mode bonds") Reported-by: susan.zheng@veritas.com Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2117816 Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30netfilter: nf_tables: defer gc run if previous batch is still pendingFlorian Westphal1-0/+5
[ Upstream commit 8e51830e29e12670b4c10df070a4ea4c9593e961 ] Don't queue more gc work, else we may queue the same elements multiple times. If an element is flagged as dead, this can mean that either the previous gc request was invalidated/discarded by a transaction or that the previous request is still pending in the system work queue. The latter will happen if the gc interval is set to a very low value, e.g. 1ms, and system work queue is backlogged. The sets refcount is 1 if no previous gc requeusts are queued, so add a helper for this and skip gc run if old requests are pending. Add a helper for this and skip the gc run in this case. Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30netfilter: nf_tables: validate all pending tablesFlorian Westphal1-0/+1
[ Upstream commit 4b80ced971b0d118f9a11dd503a5833a5016de92 ] We have to validate all tables in the transaction that are in VALIDATE_DO state, the blamed commit below did not move the break statement to its right location so we only validate one table. Moreover, we can't init table->validate to _SKIP when a table object is allocated. If we do, then if a transcaction creates a new table and then fails the transaction, nfnetlink will loop and nft will hang until user cancels the command. Add back the pernet state as a place to stash the last state encountered. This is either _DO (we hit an error during commit validation) or _SKIP (transaction passed all checks). Fixes: 00c320f9b755 ("netfilter: nf_tables: make validation state per table") Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30ipv4: fix data-races around inet->inet_idEric Dumazet2-3/+14
[ Upstream commit f866fbc842de5976e41ba874b76ce31710b634b5 ] UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1] Convert inet->inet_id to an atomic_t, but implement a dedicated path for TCP, avoiding cost of a locked instruction (atomic_add_return()) Note that this patch will cause a trivial merge conflict because we added inet->flags in net-next tree. v2: added missing change in drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c (David Ahern) [1] BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1: ip_select_ident_segs include/net/ip.h:542 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0: ip_select_ident_segs include/net/ip.h:541 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x184d -> 0x184e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 ================================================================== Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30net: validate veth and vxcan peer ifindexesJakub Kicinski1-2/+2
[ Upstream commit f534f6581ec084fe94d6759f7672bd009794b07e ] veth and vxcan need to make sure the ifindexes of the peer are not negative, core does not validate this. Using iproute2 with user-space-level checking removed: Before: # ./ip link add index 10 type veth peer index -1 # ip link show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 52:54:00:74:b2:03 brd ff:ff:ff:ff:ff:ff 10: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 8a:90:ff:57:6d:5d brd ff:ff:ff:ff:ff:ff -1: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ae:ed:18:e6:fa:7f brd ff:ff:ff:ff:ff:ff Now: $ ./ip link add index 10 type veth peer index -1 Error: ifindex can't be negative. This problem surfaced in net-next because an explicit WARN() was added, the root cause is older. Fixes: e6f8f1a739b6 ("veth: Allow to create peer link with given ifindex") Fixes: a8f820a380a2 ("can: add Virtual CAN Tunnel driver (vxcan)") Reported-by: syzbot+5ba06978f34abb058571@syzkaller.appspotmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30sock: annotate data-races around prot->memory_pressureEric Dumazet1-3/+4
[ Upstream commit 76f33296d2e09f63118db78125c95ef56df438e9 ] *prot->memory_pressure is read/writen locklessly, we need to add proper annotations. A recent commit added a new race, it is time to audit all accesses. Fixes: 2d0c88e84e48 ("sock: Fix misuse of sk_under_memory_pressure()") Fixes: 4d93df0abd50 ("[SCTP]: Rewrite of sctp buffer management code") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Abel Wu <wuyun.abel@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20230818015132.2699348-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30tracing/synthetic: Use union instead of castsSven Schnelle1-0/+11
[ Upstream commit ddeea494a16f32522bce16ee65f191d05d4b8282 ] The current code uses a lot of casts to access the fields member in struct synth_trace_events with different sizes. This makes the code hard to read, and had already introduced an endianness bug. Use a union and struct instead. Link: https://lkml.kernel.org/r/20230816154928.4171614-2-svens@linux.ibm.com Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 00cf3d672a9dd ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 887f92e09ef3 ("tracing/synthetic: Skip first entry for stack traces") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30jbd2: fix a race when checking checkpoint buffer busyZhang Yi1-0/+1
[ Upstream commit 46f881b5b1758dc4a35fba4a643c10717d0cf427 ] Before removing checkpoint buffer from the t_checkpoint_list, we have to check both BH_Dirty and BH_Lock bits together to distinguish buffers have not been or were being written back. But __cp_buffer_busy() checks them separately, it first check lock state and then check dirty, the window between these two checks could be raced by writing back procedure, which locks buffer and clears buffer dirty before I/O completes. So it cannot guarantee checkpointing buffers been written back to disk if some error happens later. Finally, it may clean checkpoint transactions and lead to inconsistent filesystem. jbd2_journal_forget() and __journal_try_to_free_buffer() also have the same problem (journal_unmap_buffer() escape from this issue since it's running under the buffer lock), so fix them through introducing a new helper to try holding the buffer lock and remove really clean buffer. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 Cc: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30jbd2: remove journal_clean_one_cp_list()Zhang Yi1-8/+4
[ Upstream commit b98dba273a0e47dbfade89c9af73c5b012a4eabb ] journal_clean_one_cp_list() and journal_shrink_one_cp_list() are almost the same, so merge them into journal_shrink_one_cp_list(), remove the nr_to_scan parameter, always scan and try to free the whole checkpoint list. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 46f881b5b175 ("jbd2: fix a race when checking checkpoint buffer busy") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30jbd2: remove t_checkpoint_io_listZhang Yi1-6/+0
[ Upstream commit be22255360f80d3af789daad00025171a65424a5 ] Since t_checkpoint_io_list was stop using in jbd2_log_do_checkpoint() now, it's time to remove the whole t_checkpoint_io_list logic. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 46f881b5b175 ("jbd2: fix a race when checking checkpoint buffer busy") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-23Revert "drm/edid: Fix csync detailed mode parsing"Jani Nikula1-9/+3
commit 50b6f2c8297793f7f3315623db78dcff85158e96 upstream. This reverts commit ca62297b2085b5b3168bd891ca24862242c635a1. Commit ca62297b2085 ("drm/edid: Fix csync detailed mode parsing") fixed EDID detailed mode sync parsing. Unfortunately, there are quite a few displays out there that have bogus (zero) sync field that are broken by the change. Zero means analog composite sync, which is not right for digital displays, and the modes get rejected. Regardless, it used to work, and it needs to continue to work. Revert the change. Rejecting modes with analog composite sync was the part that fixed the gitlab issue 8146 [1]. We'll need to get back to the drawing board with that. [1] https://gitlab.freedesktop.org/drm/intel/-/issues/8146 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8789 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8930 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/9044 Fixes: ca62297b2085 ("drm/edid: Fix csync detailed mode parsing") Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.4+ Signed-off-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230815101907.2900768-1-jani.nikula@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-23net: do not allow gso_size to be set to GSO_BY_FRAGSEric Dumazet1-0/+4
[ Upstream commit b616be6b97688f2f2bd7c4a47ab32f27f94fb2a9 ] One missing check in virtio_net_hdr_to_skb() allowed syzbot to crash kernels again [1] Do not allow gso_size to be set to GSO_BY_FRAGS (0xffff), because this magic value is used by the kernel. [1] general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 0 PID: 5039 Comm: syz-executor401 Not tainted 6.5.0-rc5-next-20230809-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 RIP: 0010:skb_segment+0x1a52/0x3ef0 net/core/skbuff.c:4500 Code: 00 00 00 e9 ab eb ff ff e8 6b 96 5d f9 48 8b 84 24 00 01 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e ea 21 00 00 48 8b 84 24 00 01 RSP: 0018:ffffc90003d3f1c8 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 000000000001fffe RCX: 0000000000000000 RDX: 000000000000000e RSI: ffffffff882a3115 RDI: 0000000000000070 RBP: ffffc90003d3f378 R08: 0000000000000005 R09: 000000000000ffff R10: 000000000000ffff R11: 5ee4a93e456187d6 R12: 000000000001ffc6 R13: dffffc0000000000 R14: 0000000000000008 R15: 000000000000ffff FS: 00005555563f2380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020020000 CR3: 000000001626d000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> udp6_ufo_fragment+0x9d2/0xd50 net/ipv6/udp_offload.c:109 ipv6_gso_segment+0x5c4/0x17b0 net/ipv6/ip6_offload.c:120 skb_mac_gso_segment+0x292/0x610 net/core/gso.c:53 __skb_gso_segment+0x339/0x710 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x3a5/0xf10 net/core/dev.c:3625 __dev_queue_xmit+0x8f0/0x3d60 net/core/dev.c:4329 dev_queue_xmit include/linux/netdevice.h:3082 [inline] packet_xmit+0x257/0x380 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3087 [inline] packet_sendmsg+0x24c7/0x5570 net/packet/af_packet.c:3119 sock_sendmsg_nosec net/socket.c:727 [inline] sock_sendmsg+0xd9/0x180 net/socket.c:750 ____sys_sendmsg+0x6ac/0x940 net/socket.c:2496 ___sys_sendmsg+0x135/0x1d0 net/socket.c:2550 __sys_sendmsg+0x117/0x1e0 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7ff27cdb34d9 Fixes: 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/20230816142158.1779798-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-23sock: Fix misuse of sk_under_memory_pressure()Abel Wu1-0/+6
[ Upstream commit 2d0c88e84e483982067a82073f6125490ddf3614 ] The status of global socket memory pressure is updated when: a) __sk_mem_raise_allocated(): enter: sk_memory_allocated(sk) > sysctl_mem[1] leave: sk_memory_allocated(sk) <= sysctl_mem[0] b) __sk_mem_reduce_allocated(): leave: sk_under_memory_pressure(sk) && sk_memory_allocated(sk) < sysctl_mem[0] So the conditions of leaving global pressure are inconstant, which may lead to the situation that one pressured net-memcg prevents the global pressure from being cleared when there is indeed no global pressure, thus the global constrains are still in effect unexpectedly on the other sockets. This patch fixes this by ignoring the net-memcg's pressure when deciding whether should leave global memory pressure. Fixes: e1aab161e013 ("socket: initial cgroup code.") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20230816091226.1542-1-wuyun.abel@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-23xfrm: don't skip free of empty state in acquire policyLeon Romanovsky1-0/+1
[ Upstream commit f3ec2b5d879ef5bbcb24678914641343cb6399a2 ] In destruction flow, the assignment of NULL to xso->dev caused to skip of xfrm_dev_state_free() call, which was called in xfrm_state_put(to_put) routine. Instead of open-coded variant of xfrm_dev_state_delete() and xfrm_dev_state_free(), let's use them directly. Fixes: f8a70afafc17 ("xfrm: add TX datapath support for IPsec packet offload mode") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-23media: v4l2-mem2mem: add lock to protect parameter num_rdyYunfei Dong1-2/+16
[ Upstream commit 56b5c3e67b0f9af3f45cf393be048ee8d8a92694 ] Getting below error when using KCSAN to check the driver. Adding lock to protect parameter num_rdy when getting the value with function: v4l2_m2m_num_src_bufs_ready/v4l2_m2m_num_dst_bufs_ready. kworker/u16:3: [name:report&]BUG: KCSAN: data-race in v4l2_m2m_buf_queue kworker/u16:3: [name:report&] kworker/u16:3: [name:report&]read-write to 0xffffff8105f35b94 of 1 bytes by task 20865 on cpu 7: kworker/u16:3:  v4l2_m2m_buf_queue+0xd8/0x10c Signed-off-by: Pina Chen <pina.chen@mediatek.com> Signed-off-by: Yunfei Dong <yunfei.dong@mediatek.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-23iopoll: Call cpu_relax() in busy loopsGeert Uytterhoeven1-0/+2
[ Upstream commit b407460ee99033503993ac7437d593451fcdfe44 ] It is considered good practice to call cpu_relax() in busy loops, see Documentation/process/volatile-considered-harmful.rst. This can not only lower CPU power consumption or yield to a hyperthreaded twin processor, but also allows an architecture to mitigate hardware issues (e.g. ARM Erratum 754327 for Cortex-A9 prior to r2p0) in the architecture-specific cpu_relax() implementation. In addition, cpu_relax() is also a compiler barrier. It is not immediately obvious that the @op argument "function" will result in an actual function call (e.g. in case of inlining). Where a function call is a C sequence point, this is lost on inlining. Therefore, with agressive enough optimization it might be possible for the compiler to hoist the: (val) = op(args); "load" out of the loop because it doesn't see the value changing. The addition of cpu_relax() would inhibit this. As the iopoll helpers lack calls to cpu_relax(), people are sometimes reluctant to use them, and may fall back to open-coded polling loops (including cpu_relax() calls) instead. Fix this by adding calls to cpu_relax() to the iopoll helpers: - For the non-atomic case, it is sufficient to call cpu_relax() in case of a zero sleep-between-reads value, as a call to usleep_range() is a safe barrier otherwise. However, it doesn't hurt to add the call regardless, for simplicity, and for similarity with the atomic case below. - For the atomic case, cpu_relax() must be called regardless of the sleep-between-reads value, as there is no guarantee all architecture-specific implementations of udelay() handle this. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Tony Lindgren <tony@atomide.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/45c87bec3397fdd704376807f0eec5cc71be440f.1685692810.git.geert+renesas@glider.be Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-23RDMA/mana_ib: Use v2 version of cfg_rx_steer_req to enable RX coalescingLong Li1-1/+3
[ Upstream commit 2145328515c8fa9b8a9f7889250bc6c032f2a0e6 ] With RX coalescing, one CQE entry can be used to indicate multiple packets on the receive queue. This saves processing time and PCI bandwidth over the CQ. The MANA Ethernet driver also uses the v2 version of the protocol. It doesn't use RX coalescing and its behavior is not changed. Link: https://lore.kernel.org/r/1684045095-31228-1-git-send-email-longli@linuxonhyperv.com Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-16wifi: cfg80211: fix sband iftype data lookup for AP_VLANFelix Fietkau1-0/+3
commit 5fb9a9fb71a33be61d7d8e8ba4597bfb18d604d0 upstream. AP_VLAN interfaces are virtual, so doesn't really exist as a type for capabilities. When passed in as a type, AP is the one that's really intended. Fixes: c4cbaf7973a7 ("cfg80211: Add support for HE") Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230622165919.46841-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16tcp: add missing family to tcp_set_ca_state() tracepointEric Dumazet1-1/+4
commit 8a70ed9520c5fafaac91053cacdd44625c39e188 upstream. Before this code is copied, add the missing family, as we did in commit 3dd344ea84e1 ("net: tracepoint: exposing sk_family in all tcp:tracepoints") Fixes: 15fcdf6ae116 ("tcp: Add tracepoint for tcp_set_ca_state") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ping Gan <jacky_gam_2001@163.com> Cc: Manjusaka <me@manjusaka.me> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230808084923.2239142-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16bpf, sockmap: Fix bug that strp_done cannot be calledXu Kuohai1-0/+1
commit 809e4dc71a0f2b8d2836035d98603694fff11d5d upstream. strp_done is only called when psock->progs.stream_parser is not NULL, but stream_parser was set to NULL by sk_psock_stop_strp(), called by sk_psock_drop() earlier. So, strp_done can never be called. Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock. Change the condition for calling strp_done from judging whether stream_parser is set to judging whether this flag is set. This flag is only set once when strp_init() succeeds, and will never be cleared later. Fixes: c0d95d3380ee ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16netfilter: nf_tables: GC transaction API to avoid race with control planePablo Neira Ayuso1-1/+63
commit 5f68718b34a531a556f2f50300ead2862278da26 upstream. The set types rhashtable and rbtree use a GC worker to reclaim memory. From system work queue, in periodic intervals, a scan of the table is done. The major caveat here is that the nft transaction mutex is not held. This causes a race between control plane and GC when they attempt to delete the same element. We cannot grab the netlink mutex from the work queue, because the control plane has to wait for the GC work queue in case the set is to be removed, so we get following deadlock: cpu 1 cpu2 GC work transaction comes in , lock nft mutex `acquire nft mutex // BLOCKS transaction asks to remove the set set destruction calls cancel_work_sync() cancel_work_sync will now block forever, because it is waiting for the mutex the caller already owns. This patch adds a new API that deals with garbage collection in two steps: 1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT so they are not visible via lookup. Annotate current GC sequence in the GC transaction. Enqueue GC transaction work as soon as it is full. If ruleset is updated, then GC transaction is aborted and retried later. 2) GC work grabs the mutex. If GC sequence has changed then this GC transaction lost race with control plane, abort it as it contains stale references to objects and let GC try again later. If the ruleset is intact, then this GC transaction deactivates and removes the elements and it uses call_rcu() to destroy elements. Note that no elements are removed from GC lockless path, the _DEAD bit is set and pointers are collected. GC catchall does not remove the elements anymore too. There is a new set->dead flag that is set on to abort the GC transaction to deal with set->ops->destroy() path which removes the remaining elements in the set from commit_release, where no mutex is held. To deal with GC when mutex is held, which allows safe deactivate and removal, add sync GC API which releases the set element object via call_rcu(). This is used by rbtree and pipapo backends which also perform garbage collection from control plane path. Since element removal from sets can happen from control plane and element garbage collection/timeout, it is necessary to keep the set structure alive until all elements have been deactivated and destroyed. We cannot do a cancel_work_sync or flush_work in nft_set_destroy because its called with the transaction mutex held, but the aforementioned async work queue might be blocked on the very mutex that nft_set_destroy() callchain is sitting on. This gives us the choice of ABBA deadlock or UaF. To avoid both, add set->refs refcount_t member. The GC API can then increment the set refcount and release it once the elements have been free'd. Set backends are adapted to use the GC transaction API in a follow up patch entitled: ("netfilter: nf_tables: use gc transaction API in set backends") This is joint work with Florian Westphal. Fixes: cfed7e1b1f8e ("netfilter: nf_tables: add set garbage collection helpers") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16x86/speculation: Add cpu_show_gds() prototypeArnd Bergmann1-0/+2
commit a57c27c7ad85c420b7de44c6ee56692d51709dda upstream. The newly added function has two definitions but no prototypes: drivers/base/cpu.c:605:16: error: no previous prototype for 'cpu_show_gds' [-Werror=missing-prototypes] Add a declaration next to the other ones for this file to avoid the warning. Fixes: 8974eb588283b ("x86/speculation: Add Gather Data Sampling mitigation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Cc: stable@kernel.org Link: https://lore.kernel.org/all/20230809130530.1913368-1-arnd%40kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16tpm: Disable RNG for all AMD fTPMsMario Limonciello1-0/+1
commit 554b841d470338a3b1d6335b14ee1cd0c8f5d754 upstream. The TPM RNG functionality is not necessary for entropy when the CPU already supports the RDRAND instruction. The TPM RNG functionality was previously disabled on a subset of AMD fTPM series, but reports continue to show problems on some systems causing stutter root caused to TPM RNG functionality. Expand disabling TPM RNG use for all AMD fTPMs whether they have versions that claim to have fixed or not. To accomplish this, move the detection into part of the TPM CRB registration and add a flag indicating that the TPM should opt-out of registration to hwrng. Cc: stable@vger.kernel.org # 6.1.y+ Fixes: b006c439d58d ("hwrng: core - start hwrng kthread also for untrusted sources") Fixes: f1324bbc4011 ("tpm: disable hwrng for fTPM on some AMD designs") Reported-by: daniil.stas@posteo.net Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217719 Reported-by: bitlord0xff@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217212 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11mtd: spi-nor: avoid holes in struct spi_mem_opArnd Bergmann1-0/+4
[ Upstream commit 71c8f9cf2623d0db79665f876b95afcdd8214aec ] gcc gets confused when -ftrivial-auto-var-init=pattern is used on sparse bit fields such as 'struct spi_mem_op', which caused the previous false positive warning about an uninitialized variable: drivers/mtd/spi-nor/spansion.c: error: 'op' is used uninitialized [-Werror=uninitialized] In fact, the variable is fully initialized and gcc does not see it being used, so the warning is entirely bogus. The problem appears to be a misoptimization in the initialization of single bit fields when the rest of the bytes are not initialized. A previous workaround added another initialization, which ended up shutting up the warning in spansion.c, though it apparently still happens in other files as reported by Peter Foley in the gcc bugzilla. The workaround of adding a fake initialization seems particularly bad because it would set values that can never be correct but prevent the compiler from warning about actually missing initializations. Revert the broken workaround and instead pad the structure to only have bitfields that add up to full bytes, which should avoid this behavior in all drivers. I also filed a new bug against gcc with what I found, so this can hopefully be addressed in future gcc releases. At the moment, only gcc-12 and gcc-13 are affected. Cc: Peter Foley <pefoley2@pefoley.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110743 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108402 Link: https://godbolt.org/z/efMMsG1Kx Fixes: 420c4495b5e56 ("mtd: spi-nor: spansion: make sure local struct does not contain garbage") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/20230719190045.4007391-1-arnd@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11f2fs: fix to do sanity check on direct node in truncate_dnode()Chao Yu1-0/+1
commit a6ec83786ab9f13f25fb18166dee908845713a95 upstream. syzbot reports below bug: BUG: KASAN: slab-use-after-free in f2fs_truncate_data_blocks_range+0x122a/0x14c0 fs/f2fs/file.c:574 Read of size 4 at addr ffff88802a25c000 by task syz-executor148/5000 CPU: 1 PID: 5000 Comm: syz-executor148 Not tainted 6.4.0-rc7-syzkaller-00041-ge660abd551f1 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351 print_report mm/kasan/report.c:462 [inline] kasan_report+0x11c/0x130 mm/kasan/report.c:572 f2fs_truncate_data_blocks_range+0x122a/0x14c0 fs/f2fs/file.c:574 truncate_dnode+0x229/0x2e0 fs/f2fs/node.c:944 f2fs_truncate_inode_blocks+0x64b/0xde0 fs/f2fs/node.c:1154 f2fs_do_truncate_blocks+0x4ac/0xf30 fs/f2fs/file.c:721 f2fs_truncate_blocks+0x7b/0x300 fs/f2fs/file.c:749 f2fs_truncate.part.0+0x4a5/0x630 fs/f2fs/file.c:799 f2fs_truncate include/linux/fs.h:825 [inline] f2fs_setattr+0x1738/0x2090 fs/f2fs/file.c:1006 notify_change+0xb2c/0x1180 fs/attr.c:483 do_truncate+0x143/0x200 fs/open.c:66 handle_truncate fs/namei.c:3295 [inline] do_open fs/namei.c:3640 [inline] path_openat+0x2083/0x2750 fs/namei.c:3791 do_filp_open+0x1ba/0x410 fs/namei.c:3818 do_sys_openat2+0x16d/0x4c0 fs/open.c:1356 do_sys_open fs/open.c:1372 [inline] __do_sys_creat fs/open.c:1448 [inline] __se_sys_creat fs/open.c:1442 [inline] __x64_sys_creat+0xcd/0x120 fs/open.c:1442 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The root cause is, inodeA references inodeB via inodeB's ino, once inodeA is truncated, it calls truncate_dnode() to truncate data blocks in inodeB's node page, it traverse mapping data from node->i.i_addr[0] to node->i.i_addr[ADDRS_PER_BLOCK() - 1], result in out-of-boundary access. This patch fixes to add sanity check on dnode page in truncate_dnode(), so that, it can help to avoid triggering such issue, and once it encounters such issue, it will record newly introduced ERROR_INVALID_NODE_REFERENCE error into superblock, later fsck can detect such issue and try repairing. Also, it removes f2fs_truncate_data_blocks() for cleanup due to the function has only one caller, and uses f2fs_truncate_data_blocks_range() instead. Reported-and-tested-by: syzbot+12cb4425b22169b52036@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000f3038a05fef867f8@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11vxlan: Fix nexthop hash sizeBenjamin Poirier1-2/+2
[ Upstream commit 0756384fb1bd38adb2ebcfd1307422f433a1d772 ] The nexthop code expects a 31 bit hash, such as what is returned by fib_multipath_hash() and rt6_multipath_hash(). Passing the 32 bit hash returned by skb_get_hash() can lead to problems related to the fact that 'int hash' is a negative number when the MSB is set. In the case of hash threshold nexthop groups, nexthop_select_path_hthr() will disproportionately select the first nexthop group entry. In the case of resilient nexthop groups, nexthop_select_path_res() may do an out of bounds access in nh_buckets[], for example: hash = -912054133 num_nh_buckets = 2 bucket_index = 65535 which leads to the following panic: BUG: unable to handle page fault for address: ffffc900025910c8 PGD 100000067 P4D 100000067 PUD 10026b067 PMD 0 Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI CPU: 4 PID: 856 Comm: kworker/4:3 Not tainted 6.5.0-rc2+ #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:nexthop_select_path+0x197/0xbf0 Code: c1 e4 05 be 08 00 00 00 4c 8b 35 a4 14 7e 01 4e 8d 6c 25 00 4a 8d 7c 25 08 48 01 dd e8 c2 25 15 ff 49 8d 7d 08 e8 39 13 15 ff <4d> 89 75 08 48 89 ef e8 7d 12 15 ff 48 8b 5d 00 e8 14 55 2f 00 85 RSP: 0018:ffff88810c36f260 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000002000c0 RCX: ffffffffaf02dd77 RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffffc900025910c8 RBP: ffffc900025910c0 R08: 0000000000000001 R09: fffff520004b2219 R10: ffffc900025910cf R11: 31392d2068736168 R12: 00000000002000c0 R13: ffffc900025910c0 R14: 00000000fffef608 R15: ffff88811840e900 FS: 0000000000000000(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc900025910c8 CR3: 0000000129d00000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x23/0x70 ? page_fault_oops+0x1ee/0x5c0 ? __pfx_is_prefetch.constprop.0+0x10/0x10 ? __pfx_page_fault_oops+0x10/0x10 ? search_bpf_extables+0xfe/0x1c0 ? fixup_exception+0x3b/0x470 ? exc_page_fault+0xf6/0x110 ? asm_exc_page_fault+0x26/0x30 ? nexthop_select_path+0x197/0xbf0 ? nexthop_select_path+0x197/0xbf0 ? lock_is_held_type+0xe7/0x140 vxlan_xmit+0x5b2/0x2340 ? __lock_acquire+0x92b/0x3370 ? __pfx_vxlan_xmit+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_register_lock_class+0x10/0x10 ? skb_network_protocol+0xce/0x2d0 ? dev_hard_start_xmit+0xca/0x350 ? __pfx_vxlan_xmit+0x10/0x10 dev_hard_start_xmit+0xca/0x350 __dev_queue_xmit+0x513/0x1e20 ? __pfx___dev_queue_xmit+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x44/0x90 ? skb_push+0x4c/0x80 ? eth_header+0x81/0xe0 ? __pfx_eth_header+0x10/0x10 ? neigh_resolve_output+0x215/0x310 ? ip6_finish_output2+0x2ba/0xc90 ip6_finish_output2+0x2ba/0xc90 ? lock_release+0x236/0x3e0 ? ip6_mtu+0xbb/0x240 ? __pfx_ip6_finish_output2+0x10/0x10 ? find_held_lock+0x83/0xa0 ? lock_is_held_type+0xe7/0x140 ip6_finish_output+0x1ee/0x780 ip6_output+0x138/0x460 ? __pfx_ip6_output+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_ip6_finish_output+0x10/0x10 NF_HOOK.constprop.0+0xc0/0x420 ? __pfx_NF_HOOK.constprop.0+0x10/0x10 ? ndisc_send_skb+0x2c0/0x960 ? __pfx_lock_release+0x10/0x10 ? __local_bh_enable_ip+0x93/0x110 ? lock_is_held_type+0xe7/0x140 ndisc_send_skb+0x4be/0x960 ? __pfx_ndisc_send_skb+0x10/0x10 ? mark_held_locks+0x65/0x90 ? find_held_lock+0x83/0xa0 ndisc_send_ns+0xb0/0x110 ? __pfx_ndisc_send_ns+0x10/0x10 addrconf_dad_work+0x631/0x8e0 ? lock_acquire+0x180/0x3f0 ? __pfx_addrconf_dad_work+0x10/0x10 ? mark_held_locks+0x24/0x90 process_one_work+0x582/0x9c0 ? __pfx_process_one_work+0x10/0x10 ? __pfx_do_raw_spin_lock+0x10/0x10 ? mark_held_locks+0x24/0x90 worker_thread+0x93/0x630 ? __kthread_parkme+0xdc/0x100 ? __pfx_worker_thread+0x10/0x10 kthread+0x1a5/0x1e0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 RIP: 0000:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: CR2: ffffc900025910c8 ---[ end trace 0000000000000000 ]--- RIP: 0010:nexthop_select_path+0x197/0xbf0 Code: c1 e4 05 be 08 00 00 00 4c 8b 35 a4 14 7e 01 4e 8d 6c 25 00 4a 8d 7c 25 08 48 01 dd e8 c2 25 15 ff 49 8d 7d 08 e8 39 13 15 ff <4d> 89 75 08 48 89 ef e8 7d 12 15 ff 48 8b 5d 00 e8 14 55 2f 00 85 RSP: 0018:ffff88810c36f260 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000002000c0 RCX: ffffffffaf02dd77 RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffffc900025910c8 RBP: ffffc900025910c0 R08: 0000000000000001 R09: fffff520004b2219 R10: ffffc900025910cf R11: 31392d2068736168 R12: 00000000002000c0 R13: ffffc900025910c0 R14: 00000000fffef608 R15: ffff88811840e900 FS: 0000000000000000(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 0000000129d00000 CR4: 0000000000750ee0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x2ca00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fix this problem by ensuring the MSB of hash is 0 using a right shift - the same approach used in fib_multipath_hash() and rt6_multipath_hash(). Fixes: 1274e1cc4226 ("vxlan: ecmp support for mac fdb entries") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11net: annotate data-races around sk->sk_markEric Dumazet3-6/+7
[ Upstream commit 3c5b4d69c358a9275a8de98f87caf6eda644b086 ] sk->sk_mark is often read while another thread could change the value. Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11net: gro: fix misuse of CB in udp socket lookupRichard Gobert1-0/+43
[ Upstream commit 7938cd15436873f649f31cb867bac2d88ca564d0 ] This patch fixes a misuse of IP{6}CB(skb) in GRO, while calling to `udp6_lib_lookup2` when handling udp tunnels. `udp6_lib_lookup2` fetch the device from CB. The fix changes it to fetch the device from `skb->dev`. l3mdev case requires special attention since it has a master and a slave device. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Reported-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11net: move gso declarations and functions to their own filesEric Dumazet5-96/+112
[ Upstream commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08 ] Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 7938cd154368 ("net: gro: fix misuse of CB in udp socket lookup") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11word-at-a-time: use the same return type for has_zero regardless of endiannessndesaulniers@google.com1-1/+1
[ Upstream commit 79e8328e5acbe691bbde029a52c89d70dcbc22f3 ] Compiling big-endian targets with Clang produces the diagnostic: fs/namei.c:2173:13: warning: use of bitwise '|' with boolean operands [-Wbitwise-instead-of-logical] } while (!(has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants))); ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ || fs/namei.c:2173:13: note: cast one or both operands to int to silence this warning It appears that when has_zero was introduced, two definitions were produced with different signatures (in particular different return types). Looking at the usage in hash_name() in fs/namei.c, I suspect that has_zero() is meant to be invoked twice per while loop iteration; using logical-or would not update `bdata` when `a` did not have zeros. So I think it's preferred to always return an unsigned long rather than a bool than update the while loop in hash_name() to use a logical-or rather than bitwise-or. [ Also changed powerpc version to do the same - Linus ] Link: https://github.com/ClangBuiltLinux/linux/issues/1832 Link: https://lore.kernel.org/lkml/20230801-bitwise-v1-1-799bec468dc4@google.com/ Fixes: 36126f8f2ed8 ("word-at-a-time: make the interfaces truly generic") Debugged-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-08x86/srso: Add a Speculative RAS Overflow mitigationBorislav Petkov (AMD)1-0/+2
Upstream commit: fb3bd914b3ec28f5fb697ac55c4846ac2d542855 Add a mitigation for the speculative return address stack overflow vulnerability found on AMD processors. The mitigation works by ensuring all RET instructions speculate to a controlled location, similar to how speculation is controlled in the retpoline sequence. To accomplish this, the __x86_return_thunk forces the CPU to mispredict every function return using a 'safe return' sequence. To ensure the safety of this mitigation, the kernel must ensure that the safe return sequence is itself free from attacker interference. In Zen3 and Zen4, this is accomplished by creating a BTB alias between the untraining function srso_untrain_ret_alias() and the safe return function srso_safe_ret_alias() which results in evicting a potentially poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08init: Remove check_bugs() leftoversThomas Gleixner1-11/+0
commit 61235b24b9cb37c13fcad5b9596d59a1afdcec30 upstream Everything is converted over to arch_cpu_finalize_init(). Remove the check_bugs() leftovers including the empty stubs in asm-generic, alpha, parisc, powerpc and xtensa. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Link: https://lore.kernel.org/r/20230613224545.553215951@linutronix.de Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08init: Provide arch_cpu_finalize_init()Thomas Gleixner1-0/+6
commit 7725acaa4f0c04fbefb0e0d342635b967bb7d414 upstream check_bugs() has become a dumping ground for all sorts of activities to finalize the CPU initialization before running the rest of the init code. Most are empty, a few do actual bug checks, some do alternative patching and some cobble a CPU advertisement string together.... Aside of that the current implementation requires duplicated function declaration and mostly empty header files for them. Provide a new function arch_cpu_finalize_init(). Provide a generic declaration if CONFIG_ARCH_HAS_CPU_FINALIZE_INIT is selected and a stub inline otherwise. This requires a temporary #ifdef in start_kernel() which will be removed along with check_bugs() once the architectures are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230613224544.957805717@linutronix.de Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03dma-buf: keep the signaling time of merged fences v3Christian König1-1/+1
commit f781f661e8c99b0cb34129f2e374234d61864e77 upstream. Some Android CTS is testing if the signaling time keeps consistent during merges. v2: use the current time if the fence is still in the signaling path and the timestamp not yet available. v3: improve comment, fix one more case to use the correct timestamp Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230630120041.109216-1-christian.koenig@amd.com Cc: Jindong Yue <jindong.yue@nxp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03mm: fix memory ordering for mm_lock_seq and vm_lock_seqJann Horn3-8/+59
commit b1f02b95758d05b799731d939e76a0bd6da312db upstream. mm->mm_lock_seq effectively functions as a read/write lock; therefore it must be used with acquire/release semantics. A specific example is the interaction between userfaultfd_register() and lock_vma_under_rcu(). userfaultfd_register() does the following from the point where it changes a VMA's flags to the point where concurrent readers are permitted again (in a simple scenario where only a single private VMA is accessed and no merging/splitting is involved): userfaultfd_register userfaultfd_set_vm_flags vm_flags_reset vma_start_write down_write(&vma->vm_lock->lock) vma->vm_lock_seq = mm_lock_seq [marks VMA as busy] up_write(&vma->vm_lock->lock) vm_flags_init [sets VM_UFFD_* in __vm_flags] vma->vm_userfaultfd_ctx.ctx = ctx mmap_write_unlock vma_end_write_all WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA] There are no memory barriers in between the __vm_flags update and the mm->mm_lock_seq update that unlocks the VMA, so the unlock can be reordered to above the `vm_flags_init()` call, which means from the perspective of a concurrent reader, a VMA can be marked as a userfaultfd VMA while it is not VMA-locked. That's bad, we definitely need a store-release for the unlock operation. The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly fine because all accesses to vma->vm_lock_seq that matter are always protected by the VMA lock. There is a racy read in vma_start_read() though that can tolerate false-positives, so we should be using WRITE_ONCE() to keep things tidy and data-race-free (including for KCSAN). On the other side, lock_vma_under_rcu() works as follows in the relevant region for locking and userfaultfd check: lock_vma_under_rcu vma_start_read vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout] down_read_trylock(&vma->vm_lock->lock) vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check] userfaultfd_armed checks vma->vm_flags & __VM_UFFD_FLAGS Here, the interesting aspect is how far down the mm->mm_lock_seq read can be reordered - if this read is reordered down below the vma->vm_flags access, this could cause lock_vma_under_rcu() to partly operate on information that was read while the VMA was supposed to be locked. To prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we need to read it with a load-acquire. Some of the comment wording is based on suggestions by Suren. BACKPORT WARNING: One of the functions changed by this patch (which I've written against Linus' tree) is vma_try_start_write(), but this function no longer exists in mm/mm-everything. I don't know whether the merged version of this patch will be ordered before or after the patch that removes vma_try_start_write(). If you're backporting this patch to a tree with vma_try_start_write(), make sure this patch changes that function. Link: https://lkml.kernel.org/r/20230721225107.942336-1-jannh@google.com Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03block: Fix a source code comment in include/uapi/linux/blkzoned.hBart Van Assche1-5/+5
[ Upstream commit e0933b526fbfd937c4a8f4e35fcdd49f0e22d411 ] Fix the symbolic names for zone conditions in the blkzoned.h header file. Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Fixes: 6a0cb1bc106f ("block: Implement support for zoned block devices") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230706201422.3987341-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03tcp: Reduce chance of collisions in inet6_hashfn().Stewart Smith1-6/+2
[ Upstream commit d11b0df7ddf1831f3e170972f43186dad520bfcc ] For both IPv4 and IPv6 incoming TCP connections are tracked in a hash table with a hash over the source & destination addresses and ports. However, the IPv6 hash is insufficient and can lead to a high rate of collisions. The IPv6 hash used an XOR to fit everything into the 96 bits for the fast jenkins hash, meaning it is possible for an external entity to ensure the hash collides, thus falling back to a linear search in the bucket, which is slow. We take the approach of hash the full length of IPv6 address in __ipv6_addr_jhash() so that all users can benefit from a more secure version. While this may look like it adds overhead, the reality of modern CPUs means that this is unmeasurable in real world scenarios. In simulating with llvm-mca, the increase in cycles for the hashing code was ~16 cycles on Skylake (from a base of ~155), and an extra ~9 on Nehalem (base of ~173). In commit dd6d2910c5e0 ("netfilter: conntrack: switch to siphash") netfilter switched from a jenkins hash to a siphash, but even the faster hsiphash is a more significant overhead (~20-30%) in some preliminary testing. So, in this patch, we keep to the more conservative approach to ensure we don't add much overhead per SYN. In testing, this results in a consistently even spread across the connection buckets. In both testing and real-world scenarios, we have not found any measurable performance impact. Fixes: 08dcdbf6a7b9 ("ipv6: use a stronger hash for tcp") Signed-off-by: Stewart Smith <trawets@amazon.com> Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230721222410.17914-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03vxlan: calculate correct header length for GPEJiri Benc1-4/+9
[ Upstream commit 94d166c5318c6edd1e079df8552233443e909c33 ] VXLAN-GPE does not add an extra inner Ethernet header. Take that into account when calculating header length. This causes problems in skb_tunnel_check_pmtu, where incorrect PMTU is cached. In the collect_md mode (which is the only mode that VXLAN-GPE supports), there's no magic auto-setting of the tunnel interface MTU. It can't be, since the destination and thus the underlying interface may be different for each packet. So, the administrator is responsible for setting the correct tunnel interface MTU. Apparently, the administrators are capable enough to calculate that the maximum MTU for VXLAN-GPE is (their_lower_MTU - 36). They set the tunnel interface MTU to 1464. If you run a TCP stream over such interface, it's then segmented according to the MTU 1464, i.e. producing 1514 bytes frames. Which is okay, this still fits the lower MTU. However, skb_tunnel_check_pmtu (called from vxlan_xmit_one) uses 50 as the header size and thus incorrectly calculates the frame size to be 1528. This leads to ICMP too big message being generated (locally), PMTU of 1450 to be cached and the TCP stream to be resegmented. The fix is to use the correct actual header size, especially for skb_tunnel_check_pmtu calculation. Fixes: e1e5314de08ba ("vxlan: implement GPE") Signed-off-by: Jiri Benc <jbenc@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>