summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-05-24splice: Make do_splice_to() generic and export itDavid Howells1-0/+3
Rename do_splice_to() to vfs_splice_read() and export it so that it can be used as a helper when calling down to a lower layer filesystem as it performs all the necessary checks[1]. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Miklos Szeredi <miklos@szeredi.hu> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: John Hubbard <jhubbard@nvidia.com> cc: David Hildenbrand <david@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-unionfs@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/CAJfpeguGksS3sCigmRi9hJdUec8qtM9f+_9jC1rJhsXT+dV01w@mail.gmail.com/ [1] Link: https://lore.kernel.org/r/20230522135018.2742245-6-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24splice: Rename direct_splice_read() to copy_splice_read()David Howells1-3/+3
Rename direct_splice_read() to copy_splice_read() to better reflect as to what it does. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Steve French <sfrench@samba.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-4-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24ARM/musb: omap2: Remove global GPIO numbers from TUSB6010Linus Walleij1-13/+0
The TUSB6010 (MUSB) device is picking up some GPIO lines hardcoded by number and passing on to the TUSB6010 device when registering it. Instead of nasty workarounds, provide a GPIO descriptor table and then make the TUSB6010 MUSB glue driver pick up the GPIO lines directly, convert it to an IRQ and pass down to the MUSB driver. OMAP2 is the only system using the TUSB6010. Stash the GPIO descriptors in the glue layer and use then to power up and down the TUSB6010 on-demand, instead of using boardfile callbacks. Since the OMAP2 boards are the only boards using the .set_power() and .board_set_power() callbacks, we can just delete them as the power is now handled directly in the TUSB6010 glue code. Cc: Bin Liu <b-liu@ti.com> Cc: linux-usb@vger.kernel.org Fixes: 92bf78b33b0b ("gpio: omap: use dynamic allocation of base") Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-24ARM/gpio: Push OMAP2 quirk down into TWL4030 driverLinus Walleij1-3/+0
The TWL4030 GPIO driver has a custom platform data .set_up() callback to call back into the platform and do misc stuff such as hog and export a GPIO for WLAN PWR on a specific OMAP3 board. Avoid all the kludgery in the platform data and the boardfile and just put the quirks right into the driver. Make it conditional on OMAP3. I think the exported GPIO is used by some kind of userspace so ordinary DTS hogs will probably not work. Fixes: 92bf78b33b0b ("gpio: omap: use dynamic allocation of base") Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-24ARM/mmc: Convert old mmci-omap to GPIO descriptorsLinus Walleij1-2/+0
A recent change to the OMAP driver making it use a dynamic GPIO base created problems with some old OMAP1 board files, among them Nokia 770, SX1 and also the OMAP2 Nokia n8x0. Fix up all instances of GPIOs being used for the MMC driver by pushing the handling of power, slot selection and MMC "cover" into the driver as optional GPIOs. This is maybe not the most perfect solution as the MMC framework have some central handlers for some of the stuff, but it at least makes the situtation better and solves the immediate issue. Fixes: 92bf78b33b0b ("gpio: omap: use dynamic allocation of base") Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-24Input: ads7846 - Convert to use software nodesLinus Walleij2-4/+0
The Nokia 770 is using GPIOs from the global numberspace on the CBUS node to pass down to the LCD controller. This regresses when we let the OMAP GPIO driver use dynamic GPIO base. The Nokia 770 now has dynamic allocation of IRQ numbers, so this needs to be fixed for it to work. As this is the only user of LCD MIPID we can easily augment the driver to use a GPIO descriptor instead and resolve the issue. The platform data .shutdown() callback wasn't even used in the code, but we encode a shutdown asserting RESET in the remove() callback for completeness sake. The CBUS also has the ADS7846 touchscreen attached. Populate the devices on the Nokia 770 CBUS I2C using software nodes instead of platform data quirks. This includes the LCD and the ADS7846 touchscreen so the conversion just brings the LCD along with it as software nodes is an all-or-nothing design pattern. The ADS7846 has some limited support for using GPIO descriptors, let's convert it over completely to using device properties and then fix all remaining boardfile users to provide all platform data using software nodes. Dump the of includes and of_match_ptr() in the ADS7846 driver as part of the job. Since we have to move ADS7846 over to obtaining the GPIOs it is using exclusively from descriptors, we provide descriptor tables for the two remaining in-kernel boardfiles using ADS7846: - PXA Spitz - MIPS Alchemy DB1000 development board It was too hard for me to include software node conversion of these two remaining users at this time: the spitz is using a hscync callback in the platform data that would require further GPIO descriptor conversion of the Spitz, and moving the hsync callback down into the driver: it will just become too big of a job, but it can be done separately. The MIPS Alchemy DB1000 is simply something I cannot test, so take the easier approach of just providing some GPIO descriptors in this case as I don't want the patch to grow too intrusive. As we see that several device trees have incorrect polarity flags and just expect to bypass the gpiolib polarity handling, fix up all device trees too, in a separate patch. Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Fixes: 92bf78b33b0b ("gpio: omap: use dynamic allocation of base") Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-24ARM/mfd/gpio: Fixup TPS65010 regression on OMAP1 OSK1Linus Walleij1-7/+4
Aaro reports problems on the OSK1 board after we altered the dynamic base for GPIO allocations. It appears this happens because the OMAP driver now allocates GPIO numbers dynamically, so all that is references by number is a bit up in the air. Let's bite the bullet and try to just move the gpio_chip in the tps65010 MFD driver over to using dynamic allocations. Alter everything in the OSK1 board file to use a GPIO descriptor table and lookups. Utilize the NULL device to define some board-specific GPIO lookups and use these to immediately look up the same GPIOs, convert to IRQ numbers and pass as resources to the devices. This is ugly but should work. The .setup() callback for tps65010 was used for some GPIO hogging, but since the OSK1 is the only user in the entire kernel we can alter the signatures to something that is helpful and make a clean transition. Fixes: 92bf78b33b0b ("gpio: omap: use dynamic allocation of base") Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: andy.shevchenko@gmail.com Cc: Andreas Kemnade <andreas@kemnade.info> Acked-by: Lee Jones <lee@kernel.org> Reviewed-by: Lee Jones <lee@kernel.org> Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-24spi: Merge up v6.4-rc3Mark Brown13-18/+43
Merge up v6.4-rc3 in order to get fixes to improve the stability of my CI.
2023-05-24regulator: Merge up v6.4-rc3Mark Brown13-18/+43
Merge up v6.4-rc3 in order to get fixes to improve the stability of my CI.
2023-05-24genirq: Use hlist for managing resend handlersShanker Donthineni1-0/+3
The current implementation utilizes a bitmap for managing interrupt resend handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros. However, this method may not efficiently utilize memory during runtime, particularly when IRQ_BITMAP_BITS is large. Address this issue by using an hlist to manage interrupt resend handlers instead of relying on a static bitmap memory allocation. Additionally, a new function, clear_irq_resend(), is introduced and called from irq_shutdown to ensure a graceful teardown of the interrupt. Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230519134902.1495562-2-sdonthineni@nvidia.com
2023-05-24net: phy: avoid kernel warning dump when stopping an errored PHYRussell King (Oracle)1-2/+5
When taking a network interface down (or removing a SFP module) after the PHY has encountered an error, phy_stop() complains incorrectly that it was called from HALTED state. The reason this is incorrect is that the network driver will have called phy_start() when the interface was brought up, and the fact that the PHY has a problem bears no relationship to the administrative state of the interface. Taking the interface administratively down (which calls phy_stop()) is always the right thing to do after a successful phy_start() call, whether or not the PHY has encountered an error. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-24dmaengine: dw-edma: Add support for native HDMACai Huoqing1-1/+2
Add support for HDMA NATIVE, as long the IP design has set the compatible register map parameter-HDMA_NATIVE, which allows compatibility for native HDMA register configuration. The HDMA Hyper-DMA IP is an enhancement of the eDMA embedded-DMA IP. And the native HDMA registers are different from eDMA, so this patch add support for HDMA NATIVE mode. HDMA write and read channels operate independently to maximize the performance of the HDMA read and write data transfer over the link When you configure the HDMA with multiple read channels, then it uses a round robin (RR) arbitration scheme to select the next read channel to be serviced.The same applies when you have multiple write channels. The native HDMA driver also supports a maximum of 16 independent channels (8 write + 8 read), which can run simultaneously. Both SAR (Source Address Register) and DAR (Destination Address Register) are aligned to byte. Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev> Reviewed-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Tested-by: Serge Semin <fancer.lancer@gmail.com> Link: https://lore.kernel.org/r/20230520050854.73160-4-cai.huoqing@linux.dev Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-05-24dmaengine: dw-edma: Rename dw_edma_core_ops structure to dw_edma_plat_opsCai Huoqing1-2/+2
The dw_edma_core_ops structure contains a set of the operations: device IRQ numbers getter, CPU/PCI address translation. Based on the functions semantics the structure name "dw_edma_plat_ops" looks more descriptive since indeed the operations are platform-specific. The "dw_edma_core_ops" name shall be used for a structure with the IP-core specific set of callbacks in order to abstract out DW eDMA and DW HDMA setups. Such structure will be added in one of the next commit in the framework of the set of changes adding the DW HDMA device support. Anyway the renaming was necessary to distinguish two types of the implementation callbacks: 1. DW eDMA/hDMA IP-core specific operations: device-specific CSR setups in one or another aspect of the DMA-engine initialization. 2. DW eDMA/hDMA platform specific operations: the DMA device environment configs like IRQs, address translation, etc. Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev> Reviewed-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Tested-by: Serge Semin <fancer.lancer@gmail.com> Link: https://lore.kernel.org/r/20230520050854.73160-2-cai.huoqing@linux.dev Signed-off-by: Vinod Koul <vkoul@kernel.org>
2023-05-24sysctl: Refactor base paths registrationsJoel Granados1-23/+0
This is part of the general push to deprecate register_sysctl_paths and register_sysctl_table. The old way of doing this through register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a call to register_sysctl_init. The 5 base paths affected are: "kernel", "vm", "debug", "dev" and "fs". We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE macro since they are no longer needed. In order to quickly acertain that the paths did not actually change I executed `find /proc/sys/ | sha1sum` and made sure that the sha was the same before and after the commit. We end up saving 563 bytes with this change: ./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563) Function old new delta sysctl_init_bases 55 111 +56 init_fs_sysctls 12 33 +21 vm_base_table 128 - -128 kernel_base_table 128 - -128 fs_base_table 128 - -128 dev_base_table 128 - -128 debug_base_table 128 - -128 Total: Before=21258215, After=21257652, chg -0.00% [mcgrof: modified to use register_sysctl_init() over register_sysctl() and add bloat-o-meter stats] Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Christian Brauner <brauner@kernel.org>
2023-05-24sysctl: stop exporting register_sysctl_tableJoel Granados1-7/+1
We make register_sysctl_table static because the only function calling it is in fs/proc/proc_sysctl.c (__register_sysctl_base). We remove it from the sysctl.h header and modify the documentation in both the header and proc_sysctl.c files to mention "register_sysctl" instead of "register_sysctl_table". This plus the commits that remove register_sysctl_table from parport save 217 bytes: ./scripts/bloat-o-meter .bsysctl/vmlinux.old .bsysctl/vmlinux.new add/remove: 0/1 grow/shrink: 5/1 up/down: 458/-675 (-217) Function old new delta __register_sysctl_base 8 286 +278 parport_proc_register 268 379 +111 parport_device_proc_register 195 247 +52 kzalloc.constprop 598 608 +10 parport_default_proc_register 62 69 +7 register_sysctl_table 291 - -291 parport_sysctl_template 1288 904 -384 Total: Before=8603076, After=8602859, chg -0.00% Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-24parport: Move magic number "15" to a defineJoel Granados1-0/+2
Put the size of a parport name behind a define so we can use it in other files. This is a preparation patch to be able to use this size in parport/procfs.c. Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-24net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGESDavid Howells1-0/+3
Add a function to handle MSG_SPLICE_PAGES being passed internally to sendmsg(). Pages are spliced into the given socket buffer if possible and copied in if not (e.g. they're slab pages or have a zero refcount). Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net: Pass max frags into skb_append_pagefrags()David Howells1-1/+1
Pass the maximum number of fragments into skb_append_pagefrags() rather than using MAX_SKB_FRAGS so that it can be used from code that wants to specify sysctl_max_skb_frags. Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net: Declare MSG_SPLICE_PAGES internal sendmsg() flagDavid Howells1-0/+3
Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a network protocol that it should splice pages from the source iterator rather than copying the data if it can. This flag is added to a list that is cleared by sendmsg syscalls on entry. This is intended as a replacement for the ->sendpage() op, allowing a way to splice in several multipage folios in one go. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24tracing: Rename stacktrace field to common_stacktraceSteven Rostedt (Google)1-0/+1
The histogram and synthetic events can use a pseudo event called "stacktrace" that will create a stacktrace at the time of the event and use it just like it was a normal field. We have other pseudo events such as "common_cpu" and "common_timestamp". To stay consistent with that, convert "stacktrace" to "common_stacktrace". As this was used in older kernels, to keep backward compatibility, this will act just like "common_cpu" did with "cpu". That is, "cpu" will be the same as "common_cpu" unless the event has a "cpu" field. In which case, the event's field is used. The same is true with "stacktrace". Also update the documentation to reflect this change. Link: https://lore.kernel.org/linux-trace-kernel/20230523230913.6860e28d@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-24tracing/user_events: Document user_event_mm one-shot list usageBeau Belgrave1-0/+1
During 6.4 development it became clear that the one-shot list used by the user_event_mm's next field was confusing to others. It is not clear how this list is protected or what the next field usage is for unless you are familiar with the code. Add comments into the user_event_mm struct indicating lock requirement and usage. Also document how and why this approach was used via comments in both user_event_enabler_update() and user_event_mm_get_all() and the rules to properly use it. Link: https://lkml.kernel.org/r/20230519230741.669-5-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-24tracing/user_events: Rename link fields for clarityBeau Belgrave1-1/+1
Currently most list_head fields of various structs within user_events are simply named link. This causes folks to keep additional context in their head when working with the code, which can be confusing. Instead of using link, describe what the actual link is, for example: list_del_rcu(&mm->link); Changes into: list_del_rcu(&mm->mms_link); The reader now is given a hint the link is to the mms global list instead of having to remember or spot check within the code. Link: https://lkml.kernel.org/r/20230519230741.669-4-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-24regmap: Merge up v6.4-rc3Mark Brown13-18/+43
Merge up v6.4-rc3 to get fixes which make my CI more stable.
2023-05-24vfio/pci: Probe and store ability to support dynamic MSI-XReinette Chatre1-0/+1
Not all MSI-X devices support dynamic MSI-X allocation. Whether a device supports dynamic MSI-X should be queried using pci_msix_can_alloc_dyn(). Instead of scattering code with pci_msix_can_alloc_dyn(), probe this ability once and store it as a property of the virtual device. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/f1ae022c060ecb7e527f4f53c8ccafe80768da47.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-05-24vfio/pci: Use bitfield for struct vfio_pci_core_device flagsReinette Chatre1-11/+11
struct vfio_pci_core_device contains eleven boolean flags. Boolean flags clearly indicate their usage but space usage starts to be a concern when there are many. An upcoming change adds another boolean flag to struct vfio_pci_core_device, thereby increasing the concern that the boolean flags are consuming unnecessary space. Transition the boolean flags to use bitfields. On a system that uses one byte per boolean this reduces the space consumed by existing flags from 11 bytes to 2 bytes with room for a few more flags without increasing the structure's size. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/cf34bf0499c889554a8105eeb18cc0ab673005be.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-05-24vfio/pci: Remove interrupt context counterReinette Chatre1-1/+0
struct vfio_pci_core_device::num_ctx counts how many interrupt contexts have been allocated. When all interrupt contexts are allocated simultaneously num_ctx provides the upper bound of all vectors that can be used as indices into the interrupt context array. With the upcoming support for dynamic MSI-X the number of interrupt contexts does not necessarily span the range of allocated interrupts. Consequently, num_ctx is no longer a trusted upper bound for valid indices. Stop using num_ctx to determine if a provided vector is valid. Use the existence of allocated interrupt. This changes behavior on the error path when user space provides an invalid vector range. Behavior changes from early exit without any modifications to possible modifications to valid vectors within the invalid range. This is acceptable considering that an invalid range is not a valid scenario, see link to discussion. The checks that ensure that user space provides a range of vectors that is valid for the device are untouched. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230316155646.07ae266f.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/e27d350f02a65b8cbacd409b4321f5ce35b3186d.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-05-24vfio/pci: Use xarray for interrupt context storageReinette Chatre1-1/+1
Interrupt context is statically allocated at the time interrupts are allocated. Following allocation, the context is managed by directly accessing the elements of the array using the vector as index. The storage is released when interrupts are disabled. It is possible to dynamically allocate a single MSI-X interrupt after MSI-X is enabled. A dynamic storage for interrupt context is needed to support this. Replace the interrupt context array with an xarray (similar to what the core uses as store for MSI descriptors) that can support the dynamic expansion while maintaining the custom that uses the vector as index. With a dynamic storage it is no longer required to pre-allocate interrupt contexts at the time the interrupts are allocated. MSI and MSI-X interrupt contexts are only used when interrupts are enabled. Their allocation can thus be delayed until interrupt enabling. Only enabled interrupts will have associated interrupt contexts. Whether an interrupt has been allocated (a Linux irq number exists for it) becomes the criteria for whether an interrupt can be enabled. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/lkml/20230404122444.59e36a99.alex.williamson@redhat.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/40e235f38d427aff79ae35eda0ced42502aa0937.1683740667.git.reinette.chatre@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-05-24bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commandsAndrii Nakryiko1-2/+2
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall forces users to specify pinning location as a string-based absolute or relative (to current working directory) path. This has various implications related to security (e.g., symlink-based attacks), forces BPF FS to be exposed in the file system, which can cause races with other applications. One of the feedbacks we got from folks working with containers heavily was that inability to use purely FD-based location specification was an unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET commands. This patch closes this oversight, adding path_fd field to BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by *at() syscalls for dirfd + pathname combinations. This now allows interesting possibilities like working with detached BPF FS mount (e.g., to perform multiple pinnings without running a risk of someone interfering with them), and generally making pinning/getting more secure and not prone to any races and/or security attacks. This is demonstrated by a selftest added in subsequent patch that takes advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate creating detached BPF FS mount, pinning, and then getting BPF map out of it, all while never exposing this private instance of BPF FS to outside worlds. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-23iio: trigger: Add simple trigger_validation helperMatti Vaittinen1-0/+1
Some triggers can only be attached to the IIO device that corresponds to the same physical device. Implement generic helper which can be used as a validate_trigger callback for such devices. Suggested-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com> Link: https://lore.kernel.org/r/51cd3e3e74a6addf8d333f4a109fb9c5a11086ee.1683541225.git.mazziesaccount@gmail.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2023-05-23rcuwait: Support timeoutsDavidlohr Bueso1-3/+20
The rcuwait utility provides an efficient and safe single wait/wake mechanism. It is used in situations where queued wait is the wrong semantics, and often too bulky. For example, cases where the wait is already done under a lock. In the past, rcuwait has been extended to support beyond only uninterruptible sleep, and similarly, there are users that can benefit for the addition of timeouts. As such, tntroduce rcuwait_wait_event_timeout(), with semantics equivalent to calls for queued wait counterparts. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/r/20230523170927.20685-2-dave@stgolabs.net Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-05-23regulator: expose regulator_find_closest_biggerSebastian Reichel1-0/+2
Expose and document the table lookup logic used by regulator_set_ramp_delay_regmap, so that it can be reused for devices that cannot be configured via regulator_set_ramp_delay_regmap. Tested-by: Diederik de Haas <didi.debian@cknow.org> # Rock64, Quartz64 Model A + B Tested-by: Vincent Legoll <vincent.legoll@gmail.com> # Pine64 QuartzPro64 Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Link: https://lore.kernel.org/r/20230504173618.142075-11-sebastian.reichel@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-05-23block/rq_qos: protect rq_qos apis with a new lockYu Kuai1-0/+1
commit 50e34d78815e ("block: disable the elevator int del_gendisk") move rq_qos_exit() from disk_release() to del_gendisk(), this will introduce some problems: 1) If rq_qos_add() is triggered by enabling iocost/iolatency through cgroupfs, then it can concurrent with del_gendisk(), it's not safe to write 'q->rq_qos' concurrently. 2) Activate cgroup policy that is relied on rq_qos will call rq_qos_add() and blkcg_activate_policy(), and if rq_qos_exit() is called in the middle, null-ptr-dereference will be triggered in blkcg_activate_policy(). 3) blkg_conf_open_bdev() can call blkdev_get_no_open() first to find the disk, then if rq_qos_exit() from del_gendisk() is done before rq_qos_add(), then memory will be leaked. This patch add a new disk level mutex 'rq_qos_mutex': 1) The lock will protect rq_qos_exit() directly. 2) For wbt that doesn't relied on blk-cgroup, rq_qos_add() can only be called from disk initialization for now because wbt can't be destructed until rq_qos_exit(), so it's safe not to protect wbt for now. Hoever, in case that rq_qos dynamically destruction is supported in the furture, this patch also protect rq_qos_add() from wbt_init() directly, this is enough because blk-sysfs already synchronize writers with disk removal. 3) For iocost and iolatency, in order to synchronize disk removal and cgroup configuration, the lock is held after blkdev_get_no_open() from blkg_conf_open_bdev(), and is released in blkg_conf_exit(). In order to fix the above memory leak, disk_live() is checked after holding the new lock. Fixes: 50e34d78815e ("block: disable the elevator int del_gendisk") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230414084008.2085155-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-23block: remove redundant req_op in blk_rq_is_passthroughLi Nan1-1/+1
op &= REQ_OP_MASK in blk_op_is_passthrough() is exactly what req_op() do. Therefore, it is redundant to call req_op() for blk_op_is_passthrough(). Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230522085355.1740772-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-23bpf, sockmap: Improved check for empty queueJohn Fastabend1-1/+0
We noticed some rare sk_buffs were stepping past the queue when system was under memory pressure. The general theory is to skip enqueueing sk_buffs when its not necessary which is the normal case with a system that is properly provisioned for the task, no memory pressure and enough cpu assigned. But, if we can't allocate memory due to an ENOMEM error when enqueueing the sk_buff into the sockmap receive queue we push it onto a delayed workqueue to retry later. When a new sk_buff is received we then check if that queue is empty. However, there is a problem with simply checking the queue length. When a sk_buff is being processed from the ingress queue but not yet on the sockmap msg receive queue its possible to also recv a sk_buff through normal path. It will check the ingress queue which is zero and then skip ahead of the pkt being processed. Previously we used sock lock from both contexts which made the problem harder to hit, but not impossible. To fix instead of popping the skb from the queue entirely we peek the skb from the queue and do the copy there. This ensures checks to the queue length are non-zero while skb is being processed. Then finally when the entire skb has been copied to user space queue or another socket we pop it off the queue. This way the queue length check allows bypassing the queue only after the list has been completely processed. To reproduce issue we run NGINX compliance test with sockmap running and observe some flakes in our testing that we attributed to this issue. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
2023-05-23bpf, sockmap: Convert schedule_work into delayed_workJohn Fastabend1-1/+1
Sk_buffs are fed into sockmap verdict programs either from a strparser (when the user might want to decide how framing of skb is done by attaching another parser program) or directly through tcp_read_sock. The tcp_read_sock is the preferred method for performance when the BPF logic is a stream parser. The flow for Cilium's common use case with a stream parser is, tcp_read_sock() sk_psock_verdict_recv ret = bpf_prog_run_pin_on_cpu() sk_psock_verdict_apply(sock, skb, ret) // if system is under memory pressure or app is slow we may // need to queue skb. Do this queuing through ingress_skb and // then kick timer to wake up handler skb_queue_tail(ingress_skb, skb) schedule_work(work); The work queue is wired up to sk_psock_backlog(). This will then walk the ingress_skb skb list that holds our sk_buffs that could not be handled, but should be OK to run at some later point. However, its possible that the workqueue doing this work still hits an error when sending the skb. When this happens the skbuff is requeued on a temporary 'state' struct kept with the workqueue. This is necessary because its possible to partially send an skbuff before hitting an error and we need to know how and where to restart when the workqueue runs next. Now for the trouble, we don't rekick the workqueue. This can cause a stall where the skbuff we just cached on the state variable might never be sent. This happens when its the last packet in a flow and no further packets come along that would cause the system to kick the workqueue from that side. To fix we could do simple schedule_work(), but while under memory pressure it makes sense to back off some instead of continue to retry repeatedly. So instead to fix convert schedule_work to schedule_delayed_work and add backoff logic to reschedule from backlog queue on errors. Its not obvious though what a good backoff is so use '1'. To test we observed some flakes whil running NGINX compliance test with sockmap we attributed these failed test to this bug and subsequent issue. >From on list discussion. This commit bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock") was intended to address similar race, but had a couple cases it missed. Most obvious it only accounted for receiving traffic on the local socket so if redirecting into another socket we could still get an sk_buff stuck here. Next it missed the case where copied=0 in the recv() handler and then we wouldn't kick the scheduler. Also its sub-optimal to require userspace to kick the internal mechanisms of sockmap to wake it up and copy data to user. It results in an extra syscall and requires the app to actual handle the EAGAIN correctly. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
2023-05-23ALSA: usb-audio: Define USB MIDI 2.0 specsTakashi Iwai1-0/+94
Define new structs and constants from USB MIDI 2.0 specification, to be used in the upcoming MIDI 2.0 support in USB-audio driver. A new class-specific endpoint descriptor and group terminal block descriptors are defined. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20230523075358.9672-9-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-05-23net/mlx5: DR, Check force-loopback RC QP capability independently from RoCEYevgeny Kliteynik1-1/+3
SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB (loopback), and FL (force-loopback) QP is preferred for performance. FL is available when RoCE is enabled or disabled based on RoCE caps. This patch adds reading of FL capability from HCA caps in addition to the existing reading from RoCE caps, thus fixing the case where we didn't have loopback enabled when RoCE was disabled. Fixes: 7304d603a57a ("net/mlx5: DR, Add support for force-loopback QP") Signed-off-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-05-23Merge patch series "Add Command Duration Limits support"Martin K. Petersen3-14/+45
Niklas Cassel <nks@flawful.org> says: This series adds support for Command Duration Limits. The series is based on linux tag: v6.4-rc1 The series can also be found in git: https://github.com/floatious/linux/commits/cdl-v7 ================= CDL in ATA / SCSI ================= Command Duration Limits is defined in: T13 ATA Command Set - 5 (ACS-5) and T10 SCSI Primary Commands - 6 (SPC-6) respectively (a simpler version of CDL is defined in T10 SPC-5). CDL defines Duration Limits Descriptors (DLD). 7 DLDs for read commands and 7 DLDs for write commands. Simply put, a DLD contains a limit and a policy. A command can specify that a certain limit should be applied by setting the DLD index field (3 bits, so 0-7) in the command itself. The DLD index points to one of the 7 DLDs. DLD index 0 means no descriptor, so no limit. DLD index 1-7 means DLD 1-7. A DLD can have a few different policies, but the two major ones are: -Policy 0xF (abort), command will be completed with command aborted error (ATA) or status CHECK CONDITION (SCSI), with sense data indicating that the command timed out. -Policy 0xD (complete-unavailable), command will be completed without error (ATA) or status GOOD (SCSI), with sense data indicating that the command timed out. Note that the command will not have transferred any data to/from the device when the command timed out, even though the command returned success. Regardless of the CDL policy, in case of a CDL timeout, the I/O will result in a -ETIME error to user-space. The DLDs are defined in the CDL log page(s) and are readable and writable. Reading and writing the CDL DLDs are outside the scope of the kernel. If a user wants to read or write the descriptors, they can do so using a user-space application that sends passthrough commands, such as cdl-tools: https://github.com/westerndigitalcorporation/cdl-tools ================================ The introduction of ioprio hints ================================ What the kernel does provide, is a method to let I/O use one of the CDL DLDs defined in the device. Note that the kernel will simply forward the DLD index to the device, so the kernel currently does not know, nor does it need to know, how the DLDs are defined inside the device. The way that the CDL DLD index is supplied to the kernel is by introducing a new 10 bit "ioprio hint" field within the existing 16 bit ioprio definition. Currently, only 6 out of the 16 ioprio bits are in use, the remaining 10 bits are unused, and are currently explicitly disallowed to be set by the kernel. For now, we only add ioprio hints representing CDL DLD index 1-7. Additional ioprio hints for other QoS features could be defined in the future. A theoretical future work could be to make an I/O scheduler aware of these hints. E.g. for CDL, an I/O scheduler could make use of the duration limit in each descriptor, and take that information into account while scheduling commands. Right now, the ioprio hints will be ignored by the I/O schedulers. ============================== How to use CDL from user-space ============================== Since CDL is mutually exclusive with NCQ priority (see ncq_prio_enable and sas_ncq_prio_enable in Documentation/ABI/testing/sysfs-block-device), CDL has to be explicitly enabled using: echo 1 > /sys/block/$bdev/device/cdl_enable Since the ioprio hints are supplied through the existing I/O priority API, it should be simple for an application to make use of the ioprio hints. It simply has to reuse one of the new macros defined in include/uapi/linux/ioprio.h: IOPRIO_PRIO_HINT() or IOPRIO_PRIO_VALUE_HINT(), and supply one of the new hints defined in include/uapi/linux/ioprio.h: IOPRIO_HINT_DEV_DURATION_LIMIT_[1-7], which indicates that the I/O should use the corresponding CDL DLD index 1-7. By reusing the I/O priority API, the user can both define a DLD to use per AIO (io_uring sqe->ioprio or libaio iocb->aio_reqprio) or per-thread (ioprio_set()). ======= Testing ======= With the following fio patches: https://github.com/floatious/fio/commits/cdl fio adds support for ioprio hints, such that CDL can be tested using e.g.: fio --ioengine=io_uring --cmdprio_percentage=10 --cmdprio_hint=DLD_index A simple way to test is to use a DLD with a very short duration limit, and send large reads. Regardless of the CDL policy, in case of a CDL timeout, the I/O will result in a -ETIME error to user-space. We also provide a CDL test suite located in the cdl-tools repo, see: https://github.com/westerndigitalcorporation/cdl-tools#testing-a-system-command-duration-limits-support We have tested this patch series using: -real hardware -the following QEMU implementation: https://github.com/floatious/qemu/tree/cdl (NOTE: the QEMU implementation requires you to define the CDL policy at compile time, so you currently need to recompile QEMU when switching between policies.) =================== Further information =================== For further information about CDL, see Damien's slides: Presented at SDC 2021: https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf Presented at Lund Linux Con 2022: https://drive.google.com/file/d/1I6ChFc0h4JY9qZdO1bY5oCAdYCSZVqWw/view?usp=sharing ================ Changes since V6 ================ -Rebased series on v6.4-rc1. -Picked up Reviewed-by tags from Hannes (Thank you Hannes!) -Picked up Reviewed-by tag from Christoph (Thank you Christoph!) -Changed KernelVersion from 6.4 to 6.5 for new sysfs attributes. For older change logs, see previous patch series versions: https://lore.kernel.org/linux-scsi/20230406113252.41211-1-nks@flawful.org/ https://lore.kernel.org/linux-scsi/20230404182428.715140-1-nks@flawful.org/ https://lore.kernel.org/linux-scsi/20230309215516.3800571-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20230124190308.127318-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20230112140412.667308-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20221208105947.2399894-1-niklas.cassel@wdc.com/ Link: https://lore.kernel.org/r/20230511011356.227789-1-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-23scsi: ata: libata: Handle completion of CDL commands using policy 0xDNiklas Cassel2-1/+13
A CDL timeout for policy 0xF is defined as a NCQ error, just with a CDL specific sk/asc/ascq in the sense data. Therefore, the existing code in libata does not need to be modified to handle a policy 0xF CDL timeout. For Command Duration Limits policy 0xD: The device shall complete the command without error with the additional sense code set to DATA CURRENTLY UNAVAILABLE. Since a CDL timeout for policy 0xD is not an error, we cannot use the NCQ Command Error log (10h). Instead, we need to read the Sense Data for Successful NCQ Commands log (0Fh). In the success case, just like in the error case, we cannot simply read a log page from the interrupt handler itself, since reading a log page involves sending a READ LOG DMA EXT or READ LOG EXT command. Therefore, we add a new EH action ATA_EH_GET_SUCCESS_SENSE. When a command completes without error, and when the ATA_SENSE bit is set, this new action is set as pending, and EH is scheduled. This way, similar to the NCQ error case, the log page will be read from EH context. An alternative would have been to add a new kthread or workqueue to handle this. However, extending EH can be done with minimal changes and avoids the need to synchronize a new kthread/workqueue with EH. Co-developed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-20-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-23scsi: ata: libata: Set read/write commands CDL indexDamien Le Moal1-0/+1
For devices supporting the command duration limits feature, translate the dld field of read and write operation to set the command duration limit index field of the command task file when the duration limit feature is enabled. The function ata_set_tf_cdl() is introduced to do this. For unqueued (non NCQ) read and write operations, this function sets the command duration limit index set as the lower 3 bits of the feature field. For queued NCQ read/write commands, the index is set as the lower 3 bits of the auxiliary field. The flag ATA_QCFLAG_HAS_CDL is introduced to indicate that a command taskfile has a non zero cdl field. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Igor Pylypiv <ipylypiv@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-19-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-23scsi: ata: libata: Add ATA feature control sub-page translationDamien Le Moal2-0/+4
Add support for the ATA feature control sub-page of the control mode page to enable/disable the command duration limits feature using the cdl_ctrl field of the ATA feature control sub-page. Both mode sense and mode select translation are supported. For mode sense, the ata device flag ATA_DFLAG_CDL_ENABLED is used to cache the status of the command duration limits feature. Enabling this feature is done using a SET FEATURES command with a cdl action set to 1 when the page cdl_ctrl field value is 0x2 (T2A and T2B pages supported). If this field is 0, CDL is disabled using the SET FEATURES command with a cdl action set to 0. Since a device CDL and NCQ priority features should not be used simultaneously, ata_mselect_control_ata_feature() returns an error when attempting to enable CDL with the device priority feature enabled. Conversely, the function ata_ncq_prio_enable_store() used to enable the use of the device NCQ priority feature through sysfs is modified to return an error if the device CDL feature is enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-18-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-23scsi: ata: libata: Detect support for command duration limitsDamien Le Moal2-13/+21
Use the supported capabilities identify device data log page to detect if a device supports the command duration limits feature. For devices supporting this feature, set the device flag ATA_DFLAG_CDL. To support SCSI-ATA translation, retrieve the command duration limits log page 18h and cache this page content using the cdl array added to the ata_device data structure. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-15-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-23scsi: block: Introduce BLK_STS_DURATION_LIMITDamien Le Moal1-0/+6
Introduce the new block I/O status BLK_STS_DURATION_LIMIT for LLDDs to report command that failed due to a command duration limit being exceeded. This new status is mapped to the ETIME error code to allow users to differentiate "soft" duration limit failures from other more serious hardware related errors. If we compare BLK_STS_DURATION_LIMIT with BLK_STS_TIMEOUT: -BLK_STS_DURATION_LIMIT means that the drive gave a reply indicating that the command duration limit was exceeded before the command could be completed. This I/O status is mapped to ETIME for user space. -BLK_STS_TIMEOUT means that the drive never gave a reply at all. This I/O status is mapped to ETIMEDOUT for user space. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-4-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-22Merge patch series "Use block pr_ops in LIO"Martin K. Petersen3-10/+70
Mike Christie <michael.christie@oracle.com> says: The patches in this thread allow us to use the block pr_ops with LIO's target_core_iblock module to support cluster applications in VMs. They were built over Linus's tree. They also apply over linux-next and Martin's tree and Jens's trees. Currently, to use windows clustering or linux clustering (pacemaker + cluster labs scsi fence agents) in VMs with LIO and vhost-scsi, you have to use tcmu or pscsi or use a cluster aware FS/framework for the LIO pr file. Setting up a cluster FS/framework is pain and waste when your real backend device is already a distributed device, and pscsi and tcmu are nice for specific use cases, but iblock gives you the best performance and allows you to use stacked devices like dm-multipath. So these patches allow iblock to work like pscsi/tcmu where they can pass a PR command to the backend module. And then iblock will use the pr_ops to pass the PR command to the real devices similar to what we do for unmap today. The patches are separated in the following groups: Patch 1 - 2: - Add block layer callouts for reading reservations and rename reservation error code. Patch 3 - 5: - SCSI support for new callouts. Patch 6: - DM support for new callouts. Patch 7 - 13: - NVMe support for new callouts. Patch 14 - 18: - LIO support for new callouts. This patchset has been tested with the libiscsi PGR ops and with window's failover cluster verification test. Note that for scsi backend devices we need this patchset: https://lore.kernel.org/linux-scsi/20230123221046.125483-1-michael.christie@oracle.com/T/#m4834a643ffb5bac2529d65d40906d3cfbdd9b1b7 to handle UAs. To reduce the size of this patchset that's being done separately to make reviewing easier. And to make merging easier this patchset and the one above do not have any conflicts so can be merged in different trees. Link: https://lore.kernel.org/r/20230407200551.12660-1-michael.christie@oracle.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-22exportfs: add explicit flag to request non-decodeable file handlesAmir Goldstein1-1/+11
So far, all callers of exportfs_encode_inode_fh(), except for fsnotify's show_mark_fhandle(), check that filesystem can decode file handles, but we would like to add more callers that do not require a file handle that can be decoded. Introduce a flag to explicitly request a file handle that may not to be decoded later and a wrapper exportfs_encode_fid() that sets this flag and convert show_mark_fhandle() to use the new wrapper. This will be used to allow adding fanotify support to filesystems that do not support NFS export. Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230502124817.3070545-3-amir73il@gmail.com>
2023-05-22exportfs: change connectable argument to bit flagsAmir Goldstein1-2/+4
Convert the bool connectable arguemnt into a bit flags argument and define the EXPORT_FS_CONNECTABLE flag as a requested property of the file handle. We are going to add a flag for requesting non-decodeable file handles. Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230502124817.3070545-2-amir73il@gmail.com>
2023-05-22iommu: Use flush queue capabilityRobin Murphy1-0/+1
It remains really handy to have distinct DMA domain types within core code for the sake of default domain policy selection, but we can now hide that detail from drivers by using the new capability instead. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Jerry Snitselaar <jsnitsel@redhat.com> # amd, intel, smmu-v3 Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1c552d99e8ba452bdac48209fa74c0bdd52fd9d9.1683233867.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-05-22iommu: Add a capability for flush queue supportRobin Murphy1-0/+5
Passing a special type to domain_alloc to indirectly query whether flush queues are a worthwhile optimisation with the given driver is a bit clunky, and looking increasingly anachronistic. Let's put that into an explicit capability instead. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Jerry Snitselaar <jsnitsel@redhat.com> # amd, intel, smmu-v3 Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/f0086a93dbccb92622e1ace775846d81c1c4b174.1683233867.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-05-22nubus: Don't list slot resources by defaultFinn Thain1-0/+1
Some Nubus card ROMs contain many slot resources. A single Radius video card produced well over a thousand entries under /proc/bus/nubus/. Populating /proc/bus/nubus/ on a slow machine with several such cards installed takes long enough that the user may think that the system is wedged. All those procfs entries also consume significant RAM though they are not normally needed (except by developers). Omit these resources from /proc/bus/nubus/ by default and add a kernel parameter to enable them when needed. On the test machine, this saved 300 kB and 10 seconds. Cc: Brad Boyer <flar@allandria.com> Reviewed-by: Brad Boyer <flar@allandria.com> Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@linux-m68k.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/71ed7fb234a5f7381a50253b0d841a656d53e64c.1684200125.git.fthain@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2023-05-22net: phy: add helpers for comparing phy IDsRussell King1-0/+28
There are several places which open code comparing PHY IDs. Provide a couple of helpers to assist with this, using a slightly simpler test than the original: - phy_id_compare() compares two arbitary PHY IDs and a mask of the significant bits in the ID. - phydev_id_compare() compares the bound phydev with the specified PHY ID, using the bound driver's mask. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>