summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-04-03page_pool: check for PP direct cache locality laterAlexander Lobakin5-59/+58
Since we have pool->p.napi (Jakub) and pool->cpuid (Lorenzo) to check whether it's safe to use direct recycling, we can use both globally for each page instead of relying solely on @allow_direct argument. Let's assume that @allow_direct means "I'm sure it's local, don't waste time rechecking this" and when it's false, try the mentioned params to still recycle the page directly. If neither is true, we'll lose some CPU cycles, but then it surely won't be hotpath. On the other hand, paths where it's possible to use direct cache, but not possible to safely set @allow_direct, will benefit from this move. The whole propagation of @napi_safe through a dozen of skb freeing functions can now go away, which saves us some stack space. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20240329165507.3240110-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03rhashtable: Improve grammarJonathan Neuschäfer1-5/+5
Change "a" to "an" according to the usual rules, fix an "if" that was mistyped as "in", improve grammar in "considerable slow" -> "considerably slower". Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240329-misc-rhashtable-v1-1-5862383ff798@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03tools: ynl: add ynl_dump_empty() helperJakub Kicinski2-0/+14
Checking if dump is empty requires a couple of casts. Add a convenient wrapper. Add an example use in the netdev sample, loopback is always present so an empty dump is an error. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20240329181651.319326-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03nfp: Avoid -Wflex-array-member-not-at-end warningsGustavo A. R. Silva1-16/+25
-Wflex-array-member-not-at-end is coming in GCC-14, and we are getting ready to enable it globally. There is currently an object (`tl`), at the beginning of multiple structures, that contains a flexible structure (`struct nfp_dump_tl`), for example: struct nfp_dumpspec_csr { struct nfp_dump_tl tl; ... __be32 register_width; /* in bits */ }; So, in order to avoid ending up with flexible-array members in the middle of multiple other structs, we use the `struct_group_tagged()` helper to separate the flexible array from the rest of the members in the flexible structure: struct nfp_dump_tl { struct_group_tagged(nfp_dump_tl_hdr, hdr, ... the rest of members ); char data[]; }; With the change described above, we now declare objects of the type of the tagged struct, in this case `struct nfp_dump_tl_hdr`, without embedding flexible arrays in the middle of another struct: struct nfp_dumpspec_csr { struct nfp_dump_tl_hdr tl; ... __be32 register_width; /* in bits */ }; Also, use `container_of()` whenever we need to retrieve a pointer to the flexible structure, through which we can access the flexible array if needed. So, with these changes, fix 33 of the following warnings: drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c:58:28: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c:64:28: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c:70:28: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c:78:28: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c:87:28: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c:92:28: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Link: https://github.com/KSPP/linux/issues/202 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/ZgYWlkxdrrieDYIu@neat Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03net: phy: aquantia: add support for AQR114C PHY IDPaweł Owoc1-0/+21
Add support for AQR114C PHY ID. This PHY advertise 10G speed: SPEED(0x04): 0x6031 capabilities: -400g +5g +2.5g -200g -25g -10g-xr -100g -40g -10g/1g -10 +100 +1000 -10-ts -2-tl +10g EXTABLE(0x0B): 0x40fc capabilities: -10g-cx4 -10g-lrm +10g-t +10g-kx4 +10g-kr +1000-t +1000-kx +100-tx -10-t -p2mp -40g/100g -1000/100-t1 -25g -200g/400g +2.5g/5g -1000-h but supports only up to 5G speed (as with AQR111/111B0). AQR111 init config is used to set max speed 5G. Signed-off-by: Paweł Owoc <frut3k7@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240401145114.1699451-1-frut3k7@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02Merge tag 'ath-next-20240402' of ↵Kalle Valo34-218/+660
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath ath.git patches for v6.10 ath drivers now have no remaining sparse warnings, otherwise smaller fixes and some refactoring. ath11k * P2P support for QCA6390, WCN6855 and QCA2066
2024-04-02ipv6: remove RTNL protection from inet6_dump_fib()Eric Dumazet1-25/+26
No longer hold RTNL while calling inet6_dump_fib(). Also change return value for a completed dump, so that NLMSG_DONE can be appended to current skb, saving one recvmsg() system call. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240329183053.644630-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02Merge branch 'genetlink-remove-linux-genetlink-h'Jakub Kicinski11-27/+26
Jakub Kicinski says: ==================== genetlink: remove linux/genetlink.h There are two genetlink headers net/genetlink.h and linux/genetlink.h This is similar to netlink.h, but for netlink.h both contain good amount of code. For genetlink.h the linux/ version is leftover from before uAPI headers were split out, it has 10 lines of code. Move those 10 lines into other appropriate headers and delete linux/genetlink.h. I occasionally open the wrong header in the editor when coding, I guess I'm not the only one. v2: https://lore.kernel.org/all/20240325173716.2390605-1-kuba@kernel.org/ v1: https://lore.kernel.org/all/20240309183458.3014713-1-kuba@kernel.org ==================== Link: https://lore.kernel.org/r/20240329175710.291749-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02genetlink: remove linux/genetlink.hJakub Kicinski7-20/+12
genetlink.h is a shell of what used to be a combined uAPI and kernel header over a decade ago. It has fewer than 10 lines of code. Merge it into net/genetlink.h. In some ways it'd be better to keep the combined header under linux/ but it would make looking through git history harder. Acked-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20240329175710.291749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02net: openvswitch: remove unnecessary linux/genetlink.h includeJakub Kicinski1-1/+0
The only legit reason I could think of for net/genetlink.h and linux/genetlink.h to be separate would be if one was included by other headers and we wanted to keep it lightweight. That is not the case, net/openvswitch/meter.h includes linux/genetlink.h but for no apparent reason (for struct genl_family perhaps? it's not necessary, types of externs do not need to be known). Link: https://lore.kernel.org/r/20240329175710.291749-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02netlink: create a new header for internal genetlink symbolsJakub Kicinski4-6/+14
There are things in linux/genetlink.h which are only used under net/netlink/. Move them to a new local header. A new header with just 2 externs isn't great, but alternative would be to include af_netlink.h in genetlink.c which feels even worse. Link: https://lore.kernel.org/r/20240329175710.291749-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02Merge branch '1GbE' of ↵Jakub Kicinski13-649/+582
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-03-29 (net: intel) This series contains updates to most Intel drivers. Jesse moves declaration of pci_driver struct to remove need for forward declarations in igb and converts Intel drivers to user newer power management ops. Sasha reworks power management flow on igc to avoid using rtnl_lock() during those flows. Maciej reorganizes i40e_nvm file to avoid forward declarations. * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: i40e: avoid forward declarations in i40e_nvm.c igc: Refactor runtime power management flow net: intel: implement modern PM ops declarations igb: simplify pci ops declaration ==================== Link: https://lore.kernel.org/r/20240329175632.211340-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02tcp/dccp: do not care about families in inet_twsk_purge()Eric Dumazet8-25/+10
We lost ability to unload ipv6 module a long time ago. Instead of calling expensive inet_twsk_purge() twice, we can handle all families in one round. Also remove an extra line added in my prior patch, per Kuniyuki Iwashima feedback. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20240327192934.6843-1-kuniyu@amazon.com/ Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240329153203.345203-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02inet: preserve const qualifier in inet_csk()Eric Dumazet7-11/+8
We can change inet_csk() to propagate its argument const qualifier, thanks to container_of_const(). We have to fix few places that had mistakes, like tcp_bound_rto(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240329144931.295800-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02Merge branch 'doc-netlink-add-hyperlinks-to-generated-docs'Jakub Kicinski2-19/+94
Donald Hunter says: ==================== doc: netlink: Add hyperlinks to generated docs Extend ynl-gen-rst to generate hyperlinks to definitions, attribute sets and sub-messages from all the places that reference them. ==================== Link: https://lore.kernel.org/r/20240329135021.52534-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02doc: netlink: Update tc spec with missing definitionsDonald Hunter1-0/+51
The tc spec referenced tc-u32-mark and tc-act-police-attrs but did not define them. The missing definitions were discovered when building the docs with generated hyperlinks because the hyperlink target labels were missing. Add definitions for tc-u32-mark and tc-act-police-attrs. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240329135021.52534-4-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02doc: netlink: Add hyperlinks to generated Netlink docsDonald Hunter1-18/+42
Update ynl-gen-rst to generate hyperlinks to definitions, attribute sets and sub-messages from all the places that reference them. Note that there is a single label namespace for all of the kernel docs. Hyperlinks within a single netlink doc need to be qualified by the family name to avoid collisions. The label format is 'family-type-name' which gives, for example, 'rt-link-attribute-set-link-attrs' as the link id. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240329135021.52534-3-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02doc: netlink: Change generated docs to limit TOC to depth 3Donald Hunter1-1/+1
The tables of contents in the generated Netlink docs include individual attribute definitions. This can make the contents exceedingly long and repeats a lot of what is on the rest of the pages. See for example: https://docs.kernel.org/networking/netlink_spec/tc.html Add a depth limit to the contents directive in generated .rst files to limit the contents depth to 3 levels. This reduces the contents to: - Family - Summary - Operations - op-one - op-two - ... - Definitions - struct-one - struct-two - enum-one - ... - Attribute sets - attrs-one - attrs-two - ... Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240329135021.52534-2-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-01ice: hold devlink lock for whole init/cleanupMichal Swiatkowski2-20/+19
Simplify devlink lock code in driver by taking it for whole init/cleanup path. Instead of calling devlink functions that taking lock call the lockless versions. Suggested-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01ice: move devlink port code to a separate filePiotr Raczynski7-424/+445
Keep devlink related code in a separate file. More devlink port code and handlers will be added here for other port operations. Remove no longer needed include of our devlink.h in ice_lib.c. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Piotr Raczynski <piotr.raczynski@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01ice: move ice_devlink.[ch] to devlink folderMichal Swiatkowski8-7/+8
Only moving whole files, fixing Makefile and bunch of includes. Some changes to ice_devlink file was done even in representor part (Tx topology), so keep it as final patch to not mess up with rebasing. After moving to devlink folder there is no need to have such long name for these files. Rename them to simple devlink. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01ice: Remove newlines in NL_SET_ERR_MSG_MODThorsten Blum1-3/+3
Fixes Coccinelle/coccicheck warnings reported by newline_in_nl_msg.cocci. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01ice: Add switch recipe reusing featureSteven Zou5-17/+177
New E810 firmware supports the corresponding functionality, so the driver allows PFs to subscribe the same switch recipes. Then when the PF is done with a switch recipes, the PF can ask firmware to free that switch recipe. When users configure a rule to PFn into E810 switch component, if there is no existing recipe matching this rule's pattern, the driver will request firmware to allocate and return a new recipe resource for the rule by calling ice_add_sw_recipe() and ice_alloc_recipe(). If there is an existing recipe matching this rule's pattern with different key value, or this is a same second rule to PFm into switch component, the driver checks out this recipe by calling ice_find_recp(), the driver will tell firmware to share using this same recipe resource by calling ice_subscribable_recp_shared() and ice_subscribe_recipe(). When firmware detects that all subscribing PFs have freed the switch recipe, firmware will free the switch recipe so that it can be reused. This feature also fixes a problem where all switch recipes would eventually be exhausted because switch recipes could not be freed, as freeing a shared recipe could potentially break other PFs that were using it. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Andrii Staikov <andrii.staikov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steven Zou <steven.zou@intel.com> Tested-by: Mayank Sharma <mayank.sharma@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01ice: fold ice_ptp_read_time into ice_ptp_gettimex64Michal Schmidt1-22/+3
This is a cleanup. It is unnecessary to have this function just to call another function. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Sai Krishna <saikrishnag@marvell.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01ice: avoid the PTP hardware semaphore in gettimex64 pathMichal Schmidt4-7/+12
The PTP hardware semaphore (PFTSYN_SEM) is used to synchronize operations that program the PTP timers. The operations involve issuing commands to the sideband queue. The E810 does not have a hardware sideband queue, so the admin queue is used. The admin queue is slow. I have observed delays in hundreds of milliseconds waiting for ice_sq_done. When phc2sys reads the time from the ice PTP clock and PFTSYN_SEM is held by a task performing one of the slow operations, ice_ptp_lock can easily time out. phc2sys gets -EBUSY and the kernel prints: ice 0000:XX:YY.0: PTP failed to get time These messages appear once every few seconds, causing log spam. The E810 datasheet recommends an algorithm for reading the upper 64 bits of the GLTSYN_TIME register. It matches what's implemented in ice_ptp_read_src_clk_reg. It is robust against wrap-around, but not necessarily against the concurrent setting of the register (with GLTSYN_CMD_{INIT,ADJ}_TIME commands). Perhaps that's why ice_ptp_gettimex64 also takes PFTSYN_SEM. The race with time setters can be prevented without relying on the PTP hardware semaphore. Using the "ice_adapter" from the previous patch, we can have a common spinlock for the PFs that share the clock hardware. It will protect the reading and writing to the GLTSYN_TIME register. The writing is performed indirectly, by the hardware, as a result of the driver writing GLTSYN_CMD_SYNC in ice_ptp_exec_tmr_cmd. I wasn't sure if the ice_flush there is enough to make sure GLTSYN_TIME has been updated, but it works well in my testing. My test code can be seen here: https://gitlab.com/mschmidt2/linux/-/commits/ice-ptp-host-side-lock-10 It consists of: - kernel threads reading the time in a busy loop and looking at the deltas between consecutive values, reporting new maxima. - a shell script that sets the time repeatedly; - a bpftrace probe to produce a histogram of the measured deltas. Without the spinlock ptp_gltsyn_time_lock, it is easy to see tearing. Deltas in the [2G, 4G) range appear in the histograms. With the spinlock added, there is no tearing and the biggest delta I saw was in the range [1M, 2M), that is under 2 ms. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01ice: add ice_adapter for shared data across PFs on the same NICMichal Schmidt5-1/+148
There is a need for synchronization between ice PFs on the same physical adapter. Add a "struct ice_adapter" for holding data shared between PFs of the same multifunction PCI device. The struct is refcounted - each ice_pf holds a reference to it. Its first use will be for PTP. I expect it will be useful also to improve the ugliness that is ice_prot_id_tbl. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-01Merge branch 'net-rps-misc'David S. Miller5-70/+95
Eric Dumazet says: ==================== net: rps: misc changes Make RPS/RFS a bit more efficient with better cache locality and heuristics. Aso shrink include/linux/netdevice.h a bit. v2: fixed a build issue in patch 6/8 with CONFIG_RPS=n (Jakub and kernel build bots) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: rps: move received_rps field to a better locationEric Dumazet1-1/+1
Commit 14d898f3c1b3 ("dev: Move received_rps counter next to RPS members in softnet data") was unfortunate: received_rps is dirtied by a cpu and never read by other cpus in fast path. Its presence in the hot RPS cache line (shared by many cpus) is hurting RPS/RFS performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: rps: add rps_input_queue_head_add() helperEric Dumazet2-7/+15
process_backlog() can batch increments of sd->input_queue_head, saving some memory bandwidth. Also add READ_ONCE()/WRITE_ONCE() annotations around sd->input_queue_head accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: rps: change input_queue_tail_incr_save()Eric Dumazet3-23/+35
input_queue_tail_incr_save() is incrementing the sd queue_tail and save it in the flow last_qtail. Two issues here : - no lock protects the write on last_qtail, we should use appropriate annotations. - We can perform this write after releasing the per-cpu backlog lock, to decrease this lock hold duration (move away the cache line miss) Also move input_queue_head_incr() and rps helpers to include/net/rps.h, while adding rps_ prefix to better reflect their role. v2: Fixed a build issue (Jakub and kernel build bots) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: enqueue_to_backlog() cleanupEric Dumazet1-13/+11
We can remove a goto and a label by reversing a condition. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: make softnet_data.dropped an atomic_tEric Dumazet3-6/+13
If under extreme cpu backlog pressure enqueue_to_backlog() has to drop a packet, it could do this without dirtying a cache line and potentially slowing down the target cpu. Move sd->dropped into a separate cache line, and make it atomic. In non pressure mode, this field is not touched, no need to consume valuable space in a hot cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: enqueue_to_backlog() change vs not running deviceEric Dumazet1-4/+5
If the device attached to the packet given to enqueue_to_backlog() is not running, we drop the packet. But we accidentally increase sd->dropped, giving false signals to admins: sd->dropped should be reserved to cpu backlog pressure, not to temporary glitches at device dismantles. While we are at it, perform the netif_running() test before we get the rps lock, and use REASON_DEV_READY drop reason instead of NOT_SPECIFIED. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: move dev_xmit_recursion() helpers to net/core/dev.hEric Dumazet2-17/+17
Move dev_xmit_recursion() and friends to net/core/dev.h They are only used from net/core/dev.c and net/core/filter.c. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: move kick_defer_list_purge() to net/core/dev.hEric Dumazet2-4/+3
kick_defer_list_purge() is defined in net/core/dev.c and used from net/core/skubff.c Because we need softnet_data, include <linux/netdevice.h> from net/core/dev.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01Merge branch 'ice-pfcp-filter'David S. Miller78-624/+1988
Alexander Lobakin says: ==================== ice: add PFCP filter support Add support for creating PFCP filters in switchdev mode. Add pfcp module that allows to create a PFCP-type netdev. The netdev then can be passed to tc when creating a filter to indicate that PFCP filter should be created. To add a PFCP filter, a special netdev must be created and passed to tc command: ip link add pfcp0 type pfcp tc filter add dev eth0 ingress prio 1 flower pfcp_opts \ 1:12ab/ff:fffffffffffffff0 skip_hw action mirred egress redirect \ dev pfcp0 Changes in iproute2 [1] are required to use pfcp_opts in tc. ICE COMMS package is required as it contains PFCP profiles. Part of this patchset modifies IP_TUNNEL_*_OPTs, which were previously stored in a __be16. All possible values have already been used, making it impossible to add new ones. * 1-3: add new bitmap_{read,write}(), which is used later in the IP tunnel flags code (from Alexander's ARM64 MTE series[2]); * 4-14: some bitmap code preparations also used later in IP tunnels; * 15-17: convert IP tunnel flags from __be16 to a bitmap; * 18-21: add PFCP module and support for it in ice. [1] https://lore.kernel.org/netdev/20230614091758.11180-1-marcin.szycik@linux.intel.com [2] https://lore.kernel.org/linux-kernel/20231218124033.551770-1-glider@google.com ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01ice: Add support for PFCP hardware offload in switchdevMarcin Szycik7-7/+169
Add support for creating PFCP filters in switchdev mode. Add support for parsing PFCP-specific tc options: S flag and SEID. To create a PFCP filter, a special netdev must be created and passed to tc command: ip link add pfcp0 type pfcp tc filter add dev eth0 ingress prio 1 flower pfcp_opts \ 1:123/ff:fffffffffffffff0 skip_hw action mirred egress redirect \ dev pfcp0 Changes in iproute2 [1] are required to be able to use pfcp_opts in tc. ICE COMMS package is required to create a filter as it contains PFCP profiles. Link: https://lore.kernel.org/netdev/20230614091758.11180-1-marcin.szycik@linux.intel.com [1] Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01ice: refactor ICE_TC_FLWR_FIELD_ENC_OPTSMarcin Szycik2-6/+6
FLOW_DISSECTOR_KEY_ENC_OPTS can be used for multiple headers, but currently it is treated as GTP-exclusive in ice. Rename ICE_TC_FLWR_FIELD_ENC_OPTS to ICE_TC_FLWR_FIELD_GTP_OPTS and check for tunnel type earlier. After this refactor, it is easier to add new headers using FLOW_DISSECTOR_KEY_ENC_OPTS - instead of checking tunnel type in ice_tc_count_lkups() and ice_tc_fill_tunnel_outer(), it needs to be checked only once, in ice_parse_tunnel_attr(). Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01pfcp: always set pfcp metadataMichal Swiatkowski7-6/+282
In PFCP receive path set metadata needed by flower code to do correct classification based on this metadata. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01pfcp: add PFCP moduleWojciech Drewek4-0/+254
Packet Forwarding Control Protocol (PFCP) is a 3GPP Protocol used between the control plane and the user plane function. It is specified in TS 29.244[1]. Note that this module is not designed to support this Protocol in the kernel space. There is no support for parsing any PFCP messages. There is no API that could be used by any userspace daemon. Basically it does not support PFCP. This protocol is sophisticated and there is no need for implementing it in the kernel. The purpose of this module is to allow users to setup software and hardware offload of PFCP packets using tc tool. When user requests to create a PFCP device, a new socket is created. The socket is set up with port number 8805 which is specific for PFCP [29.244 4.2.2]. This allow to receive PFCP request messages, response messages use other ports. Note that only one PFCP netdev can be created. Only IPv4 is supported at this time. [1] https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=3111 Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: net_test: add tests for IP tunnel flags conversion helpersAlexander Lobakin2-9/+125
Now that there are helpers for converting IP tunnel flags between the old __be16 format and the bitmap format, make sure they work as expected by adding a couple of tests to the networking testing suite. The helpers are all inline, so no dependencies on the related CONFIG_* (or a standalone module) are needed. Cover three possible cases: 1. No bits past BIT(15) are set, VTI/SIT bits are not set. This conversion is almost a direct assignment. 2. No bits past BIT(15) are set, but VTI/SIT bit is set. During the conversion, it must be transformed into BIT(16) in the bitmap, but still compatible with the __be16 format. 3. The bitmap has bits past BIT(15) set (not the VTI/SIT one). The result will be truncated. Note that currently __IP_TUNNEL_FLAG_NUM is 17 (incl. special), which means that the result of this case is currently semi-false-positive. When BIT(17) is finally here, it will be adjusted accordingly. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01ip_tunnel: convert __be16 tunnel flags to bitmapsAlexander Lobakin40-415/+715
Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT have been defined as __be16. Now all of those 16 bits are occupied and there's no more free space for new flags. It can't be simply switched to a bigger container with no adjustments to the values, since it's an explicit Endian storage, and on LE systems (__be16)0x0001 equals to (__be64)0x0001000000000000. We could probably define new 64-bit flags depending on the Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on LE, but that would introduce an Endianness dependency and spawn a ton of Sparse warnings. To mitigate them, all of those places which were adjusted with this change would be touched anyway, so why not define stuff properly if there's no choice. Define IP_TUNNEL_*_BIT counterparts as a bit number instead of the value already coded and a fistful of <16 <-> bitmap> converters and helpers. The two flags which have a different bit position are SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as __cpu_to_be16(), but as (__force __be16), i.e. had different positions on LE and BE. Now they both have strongly defined places. Change all __be16 fields which were used to store those flags, to IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) -> unsigned long[1] for now, and replace all TUNNEL_* occurrences to their bitmap counterparts. Use the converters in the places which talk to the userspace, hardware (NFP) or other hosts (GRE header). The rest must explicitly use the new flags only. This must be done at once, otherwise there will be too many conversions throughout the code in the intermediate commits. Finally, disable the old __be16 flags for use in the kernel code (except for the two 'irregular' flags mentioned above), to prevent any accidental (mis)use of them. For the userspace, nothing is changed, only additions were made. Most noticeable bloat-o-meter difference (.text): vmlinux: 307/-1 (306) gre.ko: 62/0 (62) ip_gre.ko: 941/-217 (724) [*] ip_tunnel.ko: 390/-900 (-510) [**] ip_vti.ko: 138/0 (138) ip6_gre.ko: 534/-18 (516) [*] ip6_tunnel.ko: 118/-10 (108) [*] gre_flags_to_tnl_flags() grew, but still is inlined [**] ip_tunnel_find() got uninlined, hence such decrease The average code size increase in non-extreme case is 100-200 bytes per module, mostly due to sizeof(long) > sizeof(__be16), as %__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers are able to expand the majority of bitmap_*() calls here into direct operations on scalars. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01ip_tunnel: use a separate struct to store tunnel params in the kernelAlexander Lobakin13-75/+137
Unlike IPv6 tunnels which use purely-kernel __ip6_tnl_parm structure to store params inside the kernel, IPv4 tunnel code uses the same ip_tunnel_parm which is being used to talk with the userspace. This makes it difficult to alter or add any fields or use a different format for whatever data. Define struct ip_tunnel_parm_kern, a 1:1 copy of ip_tunnel_parm for now, and use it throughout the code. Define the pieces, where the copy user <-> kernel happens, as standalone functions, and copy the data there field-by-field, so that the kernel-side structure could be easily modified later on and the users wouldn't have to care about this. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01lib/bitmap: add compile-time test for __assign_bit() optimizationAlexander Lobakin1-8/+10
Commit dc34d5036692 ("lib: test_bitmap: add compile-time optimization/evaluations assertions") initially missed __assign_bit(), which led to that quite a time passed before I realized it doesn't get optimized at compilation time. Now that it does, add test for that just to make sure nothing will break one day. To make things more interesting, use bitmap_complement() and bitmap_full(), thus checking their compile-time evaluation as well. And remove the misleading comment mentioning the workaround removed recently in favor of adding the whole file to GCov exceptions. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01bitmap: make bitmap_{get,set}_value8() use bitmap_{read,write}()Alexander Lobakin1-33/+5
Now that we have generic bitmap_read() and bitmap_write(), which are inline and try to take care of non-bound-crossing and aligned cases to keep them optimized, collapse bitmap_{get,set}_value8() into simple wrappers around the former ones. bloat-o-meter shows no difference in vmlinux and -2 bytes for gpio-pca953x.ko, which says the optimization didn't suffer due to that change. The converted helpers have the value width embedded and always compile-time constant and that helps a lot. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01bitmap: introduce generic optimized bitmap_size()Alexander Lobakin6-15/+11
The number of times yet another open coded `BITS_TO_LONGS(nbits) * sizeof(long)` can be spotted is huge. Some generic helper is long overdue. Add one, bitmap_size(), but with one detail. BITS_TO_LONGS() uses DIV_ROUND_UP(). The latter works well when both divident and divisor are compile-time constants or when the divisor is not a pow-of-2. When it is however, the compilers sometimes tend to generate suboptimal code (GCC 13): 48 83 c0 3f add $0x3f,%rax 48 c1 e8 06 shr $0x6,%rax 48 8d 14 c5 00 00 00 00 lea 0x0(,%rax,8),%rdx %BITS_PER_LONG is always a pow-2 (either 32 or 64), but GCC still does full division of `nbits + 63` by it and then multiplication by 8. Instead of BITS_TO_LONGS(), use ALIGN() and then divide by 8. GCC: 8d 50 3f lea 0x3f(%rax),%edx c1 ea 03 shr $0x3,%edx 81 e2 f8 ff ff 1f and $0x1ffffff8,%edx Now it shifts `nbits + 63` by 3 positions (IOW performs fast division by 8) and then masks bits[2:0]. bloat-o-meter: add/remove: 0/0 grow/shrink: 20/133 up/down: 156/-773 (-617) Clang does it better and generates the same code before/after starting from -O1, except that with the ALIGN() approach it uses %edx and thus still saves some bytes: add/remove: 0/0 grow/shrink: 9/133 up/down: 18/-538 (-520) Note that we can't expand DIV_ROUND_UP() by adding a check and using this approach there, as it's used in array declarations where expressions are not allowed. Add this helper to tools/ as well. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01tools: move alignment-related macros to new <linux/align.h>Alexander Lobakin3-5/+14
Currently, tools have *ALIGN*() macros scattered across the unrelated headers, as there are only 3 of them and they were added separately each time on an as-needed basis. Anyway, let's make it more consistent with the kernel headers and allow using those macros outside of the mentioned headers. Create <linux/align.h> inside the tools/ folder and include it where needed. Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01btrfs: rename bitmap_set_bits() -> btrfs_bitmap_set_bits()Alexander Lobakin1-4/+4
bitmap_set_bits() does not start with the FS' prefix and may collide with a new generic helper one day. It operates with the FS-specific types, so there's no change those two could do the same thing. Just add the prefix to exclude such possible conflict. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01fs/ntfs3: add prefix to bitmap_size() and use BITS_TO_U64()Alexander Lobakin5-11/+12
bitmap_size() is a pretty generic name and one may want to use it for a generic bitmap API function. At the same time, its logic is NTFS-specific, as it aligns to the sizeof(u64), not the sizeof(long) (although it uses ideologically right ALIGN() instead of division). Add the prefix 'ntfs3_' used for that FS (not just 'ntfs_' to not mix it with the legacy module) and use generic BITS_TO_U64() while at it. Suggested-by: Yury Norov <yury.norov@gmail.com> # BITS_TO_U64() Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01s390/cio: rename bitmap_size() -> idset_bitmap_size()Alexander Lobakin1-4/+6
bitmap_size() is a pretty generic name and one may want to use it for a generic bitmap API function. At the same time, its logic is not "generic", i.e. it's not just `nbits -> size of bitmap in bytes` converter as it would be expected from its name. Add the prefix 'idset_' used throughout the file where the function resides. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>