summaryrefslogtreecommitdiff
path: root/drivers/xen
AgeCommit message (Collapse)AuthorFilesLines
2021-03-17xen/events: avoid handling the same event on two cpus at the same timeJuergen Gross2-8/+18
commit b6622798bc50b625a1e62f82c7190df40c1f5b21 upstream. When changing the cpu affinity of an event it can happen today that (with some unlucky timing) the same event will be handled on the old and the new cpu at the same time. Avoid that by adding an "event active" flag to the per-event data and call the handler only if this flag isn't set. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Link: https://lore.kernel.org/r/20210306161833.4552-4-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17xen/events: don't unmask an event channel when an eoi is pendingJuergen Gross4-49/+88
commit 25da4618af240fbec6112401498301a6f2bc9702 upstream. An event channel should be kept masked when an eoi is pending for it. When being migrated to another cpu it might be unmasked, though. In order to avoid this keep three different flags for each event channel to be able to distinguish "normal" masking/unmasking from eoi related masking/unmasking and temporary masking. The event channel should only be able to generate an interrupt if all flags are cleared. Cc: stable@vger.kernel.org Fixes: 54c9de89895e ("xen/events: add a new "late EOI" evtchn framework") Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Link: https://lore.kernel.org/r/20210306161833.4552-3-jgross@suse.com [boris -- corrected Fixed tag format] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17xen/events: reset affinity of 2-level event when tearing it downJuergen Gross3-0/+24
commit 9e77d96b8e2724ed00380189f7b0ded61113b39f upstream. When creating a new event channel with 2-level events the affinity needs to be reset initially in order to avoid using an old affinity from earlier usage of the event channel port. So when tearing an event channel down reset all affinity bits. The same applies to the affinity when onlining a vcpu: all old affinity settings for this vcpu must be reset. As percpu events get initialized before the percpu event channel hook is called, resetting of the affinities happens after offlining a vcpu (this is working, as initial percpu memory is zeroed out). Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Link: https://lore.kernel.org/r/20210306161833.4552-2-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23xen-scsiback: don't "handle" error by BUG()Jan Beulich1-2/+2
commit 7c77474b2d22176d2bfb592ec74e0f2cb71352c9 upstream. In particular -ENOMEM may come back here, from set_foreign_p2m_mapping(). Don't make problems worse, the more that handling elsewhere (together with map's status fields now indicating whether a mapping wasn't even attempted, and hence has to be considered failed) doesn't require this odd way of dealing with errors. This is part of XSA-362. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: stable@vger.kernel.org Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23Xen/gntdev: correct error checking in gntdev_map_grant_pages()Jan Beulich1-8/+9
commit ebee0eab08594b2bd5db716288a4f1ae5936e9bc upstream. Failure of the kernel part of the mapping operation should also be indicated as an error to the caller, or else it may assume the respective kernel VA is okay to access. Furthermore gnttab_map_refs() failing still requires recording successfully mapped handles, so they can be unmapped subsequently. This in turn requires there to be a way to tell full hypercall failure from partial success - preset map_op status fields such that they won't "happen" to look as if the operation succeeded. Also again use GNTST_okay instead of implying its value (zero). This is part of XSA-361. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: stable@vger.kernel.org Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()Jan Beulich1-3/+13
commit dbe5283605b3bc12ca45def09cc721a0a5c853a2 upstream. We may not skip setting the field in the unmap structure when GNTMAP_device_map is in use - such an unmap would fail to release the respective resources (a page ref in the hypervisor). Otoh the field doesn't need setting at all when GNTMAP_device_map is not in use. To record the value for unmapping, we also better don't use our local p2m: In particular after a subsequent change it may not have got updated for all the batch elements. Instead it can simply be taken from the respective map's results. We can additionally avoid playing this game altogether for the kernel part of the mappings in (x86) PV mode. This is part of XSA-361. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: stable@vger.kernel.org Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23arm/xen: Don't probe xenbus as part of an early initcallJulien Grall2-2/+1
commit c4295ab0b485b8bc50d2264bcae2acd06f25caaf upstream. After Commit 3499ba8198cad ("xen: Fix event channel callback via INTX/GSI"), xenbus_probe() will be called too early on Arm. This will recent to a guest hang during boot. If the hang wasn't there, we would have ended up to call xenbus_probe() twice (the second time is in xenbus_probe_initcall()). We don't need to initialize xenbus_probe() early for Arm guest. Therefore, the call in xen_guest_init() is now removed. After this change, there is no more external caller for xenbus_probe(). So the function is turned to a static one. Interestingly there were two prototypes for it. Cc: stable@vger.kernel.org Fixes: 3499ba8198cad ("xen: Fix event channel callback via INTX/GSI") Reported-by: Ian Jackson <iwj@xenproject.org> Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20210210170654.5377-1-julien@xen.org Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-04xen: Fix XenStore initialisation for XS_LOCALDavid Woodhouse1-0/+31
commit 5f46400f7a6a4fad635d5a79e2aa5a04a30ffea1 upstream. In commit 3499ba8198ca ("xen: Fix event channel callback via INTX/GSI") I reworked the triggering of xenbus_probe(). I tried to simplify things by taking out the workqueue based startup triggered from wake_waiting(); the somewhat poorly named xenbus IRQ handler. I missed the fact that in the XS_LOCAL case (Dom0 starting its own xenstored or xenstore-stubdom, which happens after the kernel is booted completely), that IRQ-based trigger is still actually needed. So... put it back, except more cleanly. By just spawning a xenbus_probe thread which waits on xb_waitq and runs the probe the first time it gets woken, just as the workqueue-based hack did. This is actually a nicer approach for *all* the back ends with different interrupt methods, and we can switch them all over to that without the complex conditions for when to trigger it. But not in -rc6. This is the minimal fix for the regression, although it's a step in the right direction instead of doing a partial revert and actually putting the workqueue back. It's also simpler than the workqueue. Fixes: 3499ba8198ca ("xen: Fix event channel callback via INTX/GSI") Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/4c9af052a6e0f6485d1de43f2c38b1461996db99.camel@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Cc: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30xen: Fix event channel callback via INTX/GSIDavid Woodhouse5-33/+68
[ Upstream commit 3499ba8198cad47b731792e5e56b9ec2a78a83a2 ] For a while, event channel notification via the PCI platform device has been broken, because we attempt to communicate with xenstore before we even have notifications working, with the xs_reset_watches() call in xs_init(). We tend to get away with this on Xen versions below 4.0 because we avoid calling xs_reset_watches() anyway, because xenstore might not cope with reading a non-existent key. And newer Xen *does* have the vector callback support, so we rarely fall back to INTX/GSI delivery. To fix it, clean up a bit of the mess of xs_init() and xenbus_probe() startup. Call xs_init() directly from xenbus_init() only in the !XS_HVM case, deferring it to be called from xenbus_probe() in the XS_HVM case instead. Then fix up the invocation of xenbus_probe() to happen either from its device_initcall if the callback is available early enough, or when the callback is finally set up. This means that the hack of calling xenbus_probe() from a workqueue after the first interrupt, or directly from the PCI platform device setup, is no longer needed. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20210113132606.422794-2-dwmw2@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29xenbus/xenbus_backend: Disallow pending watch messagesSeongJae Park1-0/+7
commit 9996bd494794a2fe393e97e7a982388c6249aa76 upstream. 'xenbus_backend' watches 'state' of devices, which is writable by guests. Hence, if guests intensively updates it, dom0 will have lots of pending events that exhausting memory of dom0. In other words, guests can trigger dom0 memory pressure. This is known as XSA-349. However, the watch callback of it, 'frontend_changed()', reads only 'state', so doesn't need to have the pending events. To avoid the problem, this commit disallows pending watch messages for 'xenbus_backend' using the 'will_handle()' watch callback. This is part of XSA-349 Cc: stable@vger.kernel.org Signed-off-by: SeongJae Park <sjpark@amazon.de> Reported-by: Michael Kurth <mku@amazon.de> Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-29xen/xenbus: Count pending messages for each watchSeongJae Park1-11/+18
commit 3dc86ca6b4c8cfcba9da7996189d1b5a358a94fc upstream. This commit adds a counter of pending messages for each watch in the struct. It is used to skip unnecessary pending messages lookup in 'unregister_xenbus_watch()'. It could also be used in 'will_handle' callback. This is part of XSA-349 Cc: stable@vger.kernel.org Signed-off-by: SeongJae Park <sjpark@amazon.de> Reported-by: Michael Kurth <mku@amazon.de> Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-29xen/xenbus/xen_bus_type: Support will_handle watch callbackSeongJae Park2-1/+4
commit be987200fbaceaef340872841d4f7af2c5ee8dc3 upstream. This commit adds support of the 'will_handle' watch callback for 'xen_bus_type' users. This is part of XSA-349 Cc: stable@vger.kernel.org Signed-off-by: SeongJae Park <sjpark@amazon.de> Reported-by: Michael Kurth <mku@amazon.de> Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-29xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()SeongJae Park3-4/+9
commit 2e85d32b1c865bec703ce0c962221a5e955c52c2 upstream. Some code does not directly make 'xenbus_watch' object and call 'register_xenbus_watch()' but use 'xenbus_watch_path()' instead. This commit adds support of 'will_handle' callback in the 'xenbus_watch_path()' and it's wrapper, 'xenbus_watch_pathfmt()'. This is part of XSA-349 Cc: stable@vger.kernel.org Signed-off-by: SeongJae Park <sjpark@amazon.de> Reported-by: Michael Kurth <mku@amazon.de> Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-29xen/xenbus: Allow watches discard events before queueingSeongJae Park2-1/+5
commit fed1755b118147721f2c87b37b9d66e62c39b668 upstream. If handling logics of watch events are slower than the events enqueue logic and the events can be created from the guests, the guests could trigger memory pressure by intensively inducing the events, because it will create a huge number of pending events that exhausting the memory. Fortunately, some watch events could be ignored, depending on its handler callback. For example, if the callback has interest in only one single path, the watch wouldn't want multiple pending events. Or, some watches could ignore events to same path. To let such watches to volutarily help avoiding the memory pressure situation, this commit introduces new watch callback, 'will_handle'. If it is not NULL, it will be called for each new event just before enqueuing it. Then, if the callback returns false, the event will be discarded. No watch is using the callback for now, though. This is part of XSA-349 Cc: stable@vger.kernel.org Signed-off-by: SeongJae Park <sjpark@amazon.de> Reported-by: Michael Kurth <mku@amazon.de> Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: block rogue events for some timeJuergen Gross2-6/+24
commit 5f7f77400ab5b357b5fdb7122c3442239672186c upstream. In order to avoid high dom0 load due to rogue guests sending events at high frequency, block those events in case there was no action needed in dom0 to handle the events. This is done by adding a per-event counter, which set to zero in case an EOI without the XEN_EOI_FLAG_SPURIOUS is received from a backend driver, and incremented when this flag has been set. In case the counter is 2 or higher delay the EOI by 1 << (cnt - 2) jiffies, but not more than 1 second. In order not to waste memory shorten the per-event refcnt to two bytes (it should normally never exceed a value of 2). Add an overflow check to evtchn_get() to make sure the 2 bytes really won't overflow. This is part of XSA-332. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: defer eoi in case of excessive number of eventsJuergen Gross4-32/+208
commit e99502f76271d6bc4e374fe368c50c67a1fd3070 upstream. In case rogue guests are sending events at high frequency it might happen that xen_evtchn_do_upcall() won't stop processing events in dom0. As this is done in irq handling a crash might be the result. In order to avoid that, delay further inter-domain events after some time in xen_evtchn_do_upcall() by forcing eoi processing into a worker on the same cpu, thus inhibiting new events coming in. The time after which eoi processing is to be delayed is configurable via a new module parameter "event_loop_timeout" which specifies the maximum event loop time in jiffies (default: 2, the value was chosen after some tests showing that a value of 2 was the lowest with an only slight drop of dom0 network throughput while multiple guests performed an event storm). How long eoi processing will be delayed can be specified via another parameter "event_eoi_delay" (again in jiffies, default 10, again the value was chosen after testing with different delay values). This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: use a common cpu hotplug hook for event channelsJuergen Gross3-21/+47
commit 7beb290caa2adb0a399e735a1e175db9aae0523a upstream. Today only fifo event channels have a cpu hotplug callback. In order to prepare for more percpu (de)init work move that callback into events_base.c and add percpu_init() and percpu_deinit() hooks to struct evtchn_ops. This is part of XSA-332. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: switch user event channels to lateeoi modelJuergen Gross1-4/+3
commit c44b849cee8c3ac587da3b0980e01f77500d158c upstream. Instead of disabling the irq when an event is received and enabling it again when handled by the user process use the lateeoi model. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/pciback: use lateeoi irq bindingJuergen Gross4-20/+56
commit c2711441bc961b37bba0615dd7135857d189035f upstream. In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving pcifront use the lateeoi irq binding for pciback and unmask the event channel only just before leaving the event handling function. Restructure the handling to support that scheme. Basically an event can come in for two reasons: either a normal request for a pciback action, which is handled in a worker, or in case the guest has finished an AER request which was requested by pciback. When an AER request is issued to the guest and a normal pciback action is currently active issue an EOI early in order to be able to receive another event when the AER request has been finished by the guest. Let the worker processing the normal requests run until no further request is pending, instead of starting a new worker ion that case. Issue the EOI only just before leaving the worker. This scheme allows to drop calling the generic function xen_pcibk_test_and_schedule_op() after processing of any request as the handling of both request types is now separated more cleanly. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/pvcallsback: use lateeoi irq bindingJuergen Gross1-30/+46
commit c8d647a326f06a39a8e5f0f1af946eacfa1835f8 upstream. In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving pvcallsfront use the lateeoi irq binding for pvcallsback and unmask the event channel only after handling all write requests, which are the ones coming in via an irq. This requires modifying the logic a little bit to not require an event for each write request, but to keep the ioworker running until no further data is found on the ring page to be processed. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/scsiback: use lateeoi irq bindingJuergen Gross1-10/+13
commit 86991b6e7ea6c613b7692f65106076943449b6b7 upstream. In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving scsifront use the lateeoi irq binding for scsiback and unmask the event channel only just before leaving the event handling function. In case of a ring protocol error don't issue an EOI in order to avoid the possibility to use that for producing an event storm. This at once will result in no further call of scsiback_irq_fn(), so the ring_error struct member can be dropped and scsiback_do_cmd_fn() can signal the protocol error via a negative return value. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: add a new "late EOI" evtchn frameworkJuergen Gross1-17/+134
commit 54c9de89895e0a36047fcc4ae754ea5b8655fb9d upstream. In order to avoid tight event channel related IRQ loops add a new framework of "late EOI" handling: the IRQ the event channel is bound to will be masked until the event has been handled and the related driver is capable to handle another event. The driver is responsible for unmasking the event channel via the new function xen_irq_lateeoi(). This is similar to binding an event channel to a threaded IRQ, but without having to structure the driver accordingly. In order to support a future special handling in case a rogue guest is sending lots of unsolicited events, add a flag to xen_irq_lateeoi() which can be set by the caller to indicate the event was a spurious one. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: fix race in evtchn_fifo_unmask()Juergen Gross1-4/+9
commit f01337197419b7e8a492e83089552b77d3b5fb90 upstream. Unmasking a fifo event channel can result in unmasking it twice, once directly in the kernel and once via a hypercall in case the event was pending. Fix that by doing the local unmask only if the event is not pending. This is part of XSA-332. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: add a proper barrier to 2-level uevent unmaskingJuergen Gross1-0/+2
commit 4d3fe31bd993ef504350989786858aefdb877daa upstream. A follow-up patch will require certain write to happen before an event channel is unmasked. While the memory barrier is not strictly necessary for all the callers, the main one will need it. In order to avoid an extra memory barrier when using fifo event channels, mandate evtchn_unmask() to provide write ordering. The 2-level event handling unmask operation is missing an appropriate barrier, so add it. Fifo event channels are fine in this regard due to using sync_cmpxchg(). This is part of XSA-332. Cc: stable@vger.kernel.org Suggested-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18xen/events: avoid removing an event channel while handling itJuergen Gross1-5/+35
commit 073d0552ead5bfc7a3a9c01de590e924f11b5dd2 upstream. Today it can happen that an event channel is being removed from the system while the event handling loop is active. This can lead to a race resulting in crashes or WARN() splats when trying to access the irq_info structure related to the event channel. Fix this problem by using a rwlock taken as reader in the event handling loop and as writer when deallocating the irq_info structure. As the observed problem was a NULL dereference in evtchn_from_irq() make this function more robust against races by testing the irq_info pointer to be not NULL before dereferencing it. And finally make all accesses to evtchn_to_irq[row][col] atomic ones in order to avoid seeing partial updates of an array element in irq handling. Note that irq handling can be entered only for event channels which have been valid before, so any not populated row isn't a problem in this regard, as rows are only ever added and never removed. This is XSA-331. Cc: stable@vger.kernel.org Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reported-by: Jinoh Kang <luke1337@theori.io> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10xen/events: don't use chip_data for legacy IRQsJuergen Gross1-8/+21
commit 0891fb39ba67bd7ae023ea0d367297ffff010781 upstream. Since commit c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.") Xen is using the chip_data pointer for storing IRQ specific data. When running as a HVM domain this can result in problems for legacy IRQs, as those might use chip_data for their own purposes. Use a local array for this purpose in case of legacy IRQs, avoiding the double use. Cc: stable@vger.kernel.org Fixes: c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.") Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Stefan Bader <stefan.bader@canonical.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20200930091614.13660-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> [bwh: Backported to 4.9: adjust context] Signed-off-by: Ben Hutchings <benh@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09xen/xenbus: Fix granting of vmalloc'd memorySimon Leiner1-2/+8
[ Upstream commit d742db70033c745e410523e00522ee0cfe2aa416 ] On some architectures (like ARM), virt_to_gfn cannot be used for vmalloc'd memory because of its reliance on virt_to_phys. This patch introduces a check for vmalloc'd addresses and obtains the PFN using vmalloc_to_pfn in that case. Signed-off-by: Simon Leiner <simon@leiner.me> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20200825093153.35500-1-simon@leiner.me Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN ↵Thomas Gleixner1-8/+8
data pointer which contains XEN specific information. commit c330fb1ddc0a922f044989492b7fcca77ee1db46 upstream. handler data is meant for interrupt handlers and not for storing irq chip specific information as some devices require handler data to store internal per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers. This obviously creates a conflict of interests and crashes the machine because the XEN pointer is overwritten by the driver pointer. As the XEN data is not handler specific it should be stored in irqdesc::irq_data::chip_data instead. A simple sed s/irq_[sg]et_handler_data/irq_[sg]et_chip_data/ cures that. Cc: stable@vger.kernel.org Reported-by: Roman Shaposhnik <roman@zededa.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Roman Shaposhnik <roman@zededa.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/87lfi2yckt.fsf@nanos.tec.linutronix.de Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26xen: don't reschedule in preemption off sectionsJuergen Gross1-1/+1
For support of long running hypercalls xen_maybe_preempt_hcall() is calling cond_resched() in case a hypercall marked as preemptible has been interrupted. Normally this is no problem, as only hypercalls done via some ioctl()s are marked to be preemptible. In rare cases when during such a preemptible hypercall an interrupt occurs and any softirq action is started from irq_exit(), a further hypercall issued by the softirq handler will be regarded to be preemptible, too. This might lead to rescheduling in spite of the softirq handler potentially having set preempt_disable(), leading to splats like: BUG: sleeping function called from invalid context at drivers/xen/preempt.c:37 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 20775, name: xl INFO: lockdep is turned off. CPU: 1 PID: 20775 Comm: xl Tainted: G D W 5.4.46-1_prgmr_debug.el7.x86_64 #1 Call Trace: <IRQ> dump_stack+0x8f/0xd0 ___might_sleep.cold.76+0xb2/0x103 xen_maybe_preempt_hcall+0x48/0x70 xen_do_hypervisor_callback+0x37/0x40 RIP: e030:xen_hypercall_xen_version+0xa/0x20 Code: ... RSP: e02b:ffffc900400dcc30 EFLAGS: 00000246 RAX: 000000000004000d RBX: 0000000000000200 RCX: ffffffff8100122a RDX: ffff88812e788000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffffff83ee3ad0 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: ffff8881824aa0b0 R13: 0000000865496000 R14: 0000000865496000 R15: ffff88815d040000 ? xen_hypercall_xen_version+0xa/0x20 ? xen_force_evtchn_callback+0x9/0x10 ? check_events+0x12/0x20 ? xen_restore_fl_direct+0x1f/0x20 ? _raw_spin_unlock_irqrestore+0x53/0x60 ? debug_dma_sync_single_for_cpu+0x91/0xc0 ? _raw_spin_unlock_irqrestore+0x53/0x60 ? xen_swiotlb_sync_single_for_cpu+0x3d/0x140 ? mlx4_en_process_rx_cq+0x6b6/0x1110 [mlx4_en] ? mlx4_en_poll_rx_cq+0x64/0x100 [mlx4_en] ? net_rx_action+0x151/0x4a0 ? __do_softirq+0xed/0x55b ? irq_exit+0xea/0x100 ? xen_evtchn_do_upcall+0x2c/0x40 ? xen_do_hypervisor_callback+0x29/0x40 </IRQ> ? xen_hypercall_domctl+0xa/0x20 ? xen_hypercall_domctl+0x8/0x20 ? privcmd_ioctl+0x221/0x990 [xen_privcmd] ? do_vfs_ioctl+0xa5/0x6f0 ? ksys_ioctl+0x60/0x90 ? trace_hardirqs_off_thunk+0x1a/0x20 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x62/0x250 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe Fix that by testing preempt_count() before calling cond_resched(). In kernel 5.8 this can't happen any more due to the entry code rework (more than 100 patches, so not a candidate for backporting). The issue was introduced in kernel 4.3, so this patch should go into all stable kernels in [4.3 ... 5.7]. Reported-by: Sarah Newman <srn@prgmr.com> Fixes: 0fa2f5cb2b0ecd8 ("sched/preempt, xen: Use need_resched() instead of should_resched()") Cc: Sarah Newman <srn@prgmr.com> Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Chris Brannon <cmb@prgmr.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21xen/balloon: make the balloon wait interruptibleRoger Pau Monne1-2/+4
commit 88a479ff6ef8af7f07e11593d58befc644244ff7 upstream. So it can be killed, or else processes can get hung indefinitely waiting for balloon pages. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200727091342.52325-3-roger.pau@citrix.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21xen/balloon: fix accounting in alloc_xenballooned_pages error pathRoger Pau Monne1-0/+6
commit 1951fa33ec259abdf3497bfee7b63e7ddbb1a394 upstream. target_unpopulated is incremented with nr_pages at the start of the function, but the call to free_xenballooned_pages will only subtract pgno number of pages, and thus the rest need to be subtracted before returning or else accounting will be skewed. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200727091342.52325-2-roger.pau@citrix.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-20xen/pvcalls-back: test for errors when calling backend_connect()Juergen Gross1-1/+2
commit c8d70a29d6bbc956013f3401f92a4431a9385a3c upstream. backend_connect() can fail, so switch the device to connected only if no error occurred. Fixes: 0a9c75c2c7258f2 ("xen/pvcalls: xenbus state handling") Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20200511074231.19794-1-jgross@suse.com Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02xen/xenbus: ensure xenbus_map_ring_valloc() returns proper grant statusJuergen Gross1-1/+8
[ Upstream commit 6b51fd3f65a22e3d1471b18a1d56247e246edd46 ] xenbus_map_ring_valloc() maps a ring page and returns the status of the used grant (0 meaning success). There are Xen hypervisors which might return the value 1 for the status of a failed grant mapping due to a bug. Some callers of xenbus_map_ring_valloc() test for errors by testing the returned status to be less than zero, resulting in no error detected and crashing later due to a not available ring page. Set the return value of xenbus_map_ring_valloc() to GNTST_general_error in case the grant status reported by Xen is greater than zero. This is part of XSA-316. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Link: https://lore.kernel.org/r/20200326080358.1018-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-02xenbus: req->err should be updated before req->stateDongli Zhang1-0/+2
[ Upstream commit 8130b9d5b5abf26f9927b487c15319a187775f34 ] This patch adds the barrier to guarantee that req->err is always updated before req->state. Otherwise, read_reply() would not return ERR_PTR(req->err) but req->body, when process_writes()->xb_write() is failed. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20200303221423.21962-2-dongli.zhang@oracle.com Reviewed-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-02xenbus: req->body should be updated before req->stateDongli Zhang2-3/+8
[ Upstream commit 1b6a51e86cce38cf4d48ce9c242120283ae2f603 ] The req->body should be updated before req->state is updated and the order should be guaranteed by a barrier. Otherwise, read_reply() might return req->body = NULL. Below is sample callstack when the issue is reproduced on purpose by reordering the updates of req->body and req->state and adding delay in code between updates of req->state and req->body. [ 22.356105] general protection fault: 0000 [#1] SMP PTI [ 22.361185] CPU: 2 PID: 52 Comm: xenwatch Not tainted 5.5.0xen+ #6 [ 22.366727] Hardware name: Xen HVM domU, BIOS ... [ 22.372245] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60 ... ... [ 22.392163] RSP: 0018:ffffb2d64023fdf0 EFLAGS: 00010246 [ 22.395933] RAX: 0000000000000000 RBX: 75746e7562755f6d RCX: 0000000000000000 [ 22.400871] RDX: 0000000000000000 RSI: ffffb2d64023fdfc RDI: 75746e7562755f6d [ 22.405874] RBP: 0000000000000000 R08: 00000000000001e8 R09: 0000000000cdcdcd [ 22.410945] R10: ffffb2d6402ffe00 R11: ffff9d95395eaeb0 R12: ffff9d9535935000 [ 22.417613] R13: ffff9d9526d4a000 R14: ffff9d9526f4f340 R15: ffff9d9537654000 [ 22.423726] FS: 0000000000000000(0000) GS:ffff9d953bc80000(0000) knlGS:0000000000000000 [ 22.429898] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 22.434342] CR2: 000000c4206a9000 CR3: 00000001ea3fc002 CR4: 00000000001606e0 [ 22.439645] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 22.444941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 22.450342] Call Trace: [ 22.452509] simple_strtoull+0x27/0x70 [ 22.455572] xenbus_transaction_start+0x31/0x50 [ 22.459104] netback_changed+0x76c/0xcc1 [xen_netfront] [ 22.463279] ? find_watch+0x40/0x40 [ 22.466156] xenwatch_thread+0xb4/0x150 [ 22.469309] ? wait_woken+0x80/0x80 [ 22.472198] kthread+0x10e/0x130 [ 22.474925] ? kthread_park+0x80/0x80 [ 22.477946] ret_from_fork+0x35/0x40 [ 22.480968] Modules linked in: xen_kbdfront xen_fbfront(+) xen_netfront xen_blkfront [ 22.486783] ---[ end trace a9222030a747c3f7 ]--- [ 22.490424] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60 The virt_rmb() is added in the 'true' path of test_reply(). The "while" is changed to "do while" so that test_reply() is used as a read memory barrier. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20200303221423.21962-1-dongli.zhang@oracle.com Reviewed-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28xen: Enable interrupts when calling _cond_resched()Thomas Gleixner1-1/+3
commit 8645e56a4ad6dcbf504872db7f14a2f67db88ef2 upstream. xen_maybe_preempt_hcall() is called from the exception entry point xen_do_hypervisor_callback with interrupts disabled. _cond_resched() evades the might_sleep() check in cond_resched() which would have caught that and schedule_debug() unfortunately lacks a check for irqs_disabled(). Enable interrupts around the call and use cond_resched() to catch future issues. Fixes: fdfd811ddde3 ("x86/xen: allow privcmd hypercalls to be preempted") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/878skypjrh.fsf@nanos.tec.linutronix.de Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15xen/balloon: Support xend-based toolstack take twoJuergen Gross1-1/+1
commit eda4eabf86fd6806eaabc23fb90dd056fdac037b upstream. Commit 3aa6c19d2f38be ("xen/balloon: Support xend-based toolstack") tried to fix a regression with running on rather ancient Xen versions. Unfortunately the fix was based on the assumption that xend would just use another Xenstore node, but in reality only some downstream versions of xend are doing that. The upstream xend does not write that Xenstore node at all, so the problem must be fixed in another way. The easiest way to achieve that is to fall back to the behavior before commit 96edd61dcf4436 ("xen/balloon: don't online new memory initially") in case the static memory maximum can't be read. This is achieved by setting static_max to the current number of memory pages known by the system resulting in target_diff becoming zero. Fixes: 3aa6c19d2f38be ("xen/balloon: Support xend-based toolstack") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: <stable@vger.kernel.org> # 4.13 Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-27net: add {READ|WRITE}_ONCE() annotations on ->rskq_accept_headEric Dumazet1-1/+1
[ Upstream commit 60b173ca3d1cd1782bd0096dc17298ec242f6fb1 ] reqsk_queue_empty() is called from inet_csk_listen_poll() while other cpus might write ->rskq_accept_head value. Use {READ|WRITE}_ONCE() to avoid compiler tricks and potential KCSAN splats. Fixes: fff1f3001cc5 ("tcp: add a spinlock to protect struct request_sock_queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27xen, cpu_hotplug: Prevent an out of bounds accessDan Carpenter1-1/+1
[ Upstream commit 201676095dda7e5b31a5e1d116d10fc22985075e ] The "cpu" variable comes from the sscanf() so Smatch marks it as untrusted data. We can't pass a higher value than "nr_cpu_ids" to cpu_possible() or it results in an out of bounds access. Fixes: d68d82afd4c8 ("xen: implement CPU hotplugging") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09xen/balloon: fix ballooned page accounting without hotplug enabledJuergen Gross1-1/+2
[ Upstream commit c673ec61ade89bf2f417960f986bc25671762efb ] When CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is not defined reserve_additional_memory() will set balloon_stats.target_pages to a wrong value in case there are still some ballooned pages allocated via alloc_xenballooned_pages(). This will result in balloon_process() no longer be triggered when ballooned pages are freed in batches. Reported-by: Nicholas Tsirakis <niko.tsirakis@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05xen/pciback: Check dev_data before using itRoss Lagerwall1-1/+2
[ Upstream commit 1669907e3d1abfa3f7586e2d55dbbc117b5adba2 ] If pcistub_init_device fails, the release function will be called with dev_data set to NULL. Check it before using it to avoid a NULL pointer dereference. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01mm/memory_hotplug: make add_memory() take the device_hotplug_lockDavid Hildenbrand1-0/+3
[ Upstream commit 8df1d0e4a265f25dc1e7e7624ccdbcb4a6630c89 ] add_memory() currently does not take the device_hotplug_lock, however is aleady called under the lock from arch/powerpc/platforms/pseries/hotplug-memory.c drivers/acpi/acpi_memhotplug.c to synchronize against CPU hot-remove and similar. In general, we should hold the device_hotplug_lock when adding memory to synchronize against online/offline request (e.g. from user space) - which already resulted in lock inversions due to device_lock() and mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory hot-add deadlock"). add_memory()/add_memory_resource() will create memory block devices, so this really feels like the right thing to do. Holding the device_hotplug_lock makes sure that a memory block device can really only be accessed (e.g. via .online/.state) from user space, once the memory has been fully added to the system. The lock is not held yet in drivers/xen/balloon.c arch/powerpc/platforms/powernv/memtrace.c drivers/s390/char/sclp_cmd.c drivers/hv/hv_balloon.c So, let's either use the locked variants or take the lock. Don't export add_memory_resource(), as it once was exported to be used by XEN, which is never built as a module. If somebody requires it, we also have to export a locked variant (as device_hotplug_lock is never exported). Link: http://lkml.kernel.org/r/20180925091457.28651-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: John Allen <jallen@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mathieu Malaterre <malat@debian.org> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11xen/pci: reserve MCFG areas earlierIgor Druzhinin1-6/+15
[ Upstream commit a4098bc6eed5e31e0391bcc068e61804c98138df ] If MCFG area is not reserved in E820, Xen by default will defer its usage until Dom0 registers it explicitly after ACPI parser recognizes it as a reserved resource in DSDT. Having it reserved in E820 is not mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2) and firmware is free to keep a hole in E820 in that place. Xen doesn't know what exactly is inside this hole since it lacks full ACPI view of the platform therefore it's potentially harmful to access MCFG region without additional checks as some machines are known to provide inconsistent information on the size of the region. Now xen_mcfg_late() runs after acpi_init() which is too late as some basic PCI enumeration starts exactly there as well. Trying to register a device prior to MCFG reservation causes multiple problems with PCIe extended capability initializations in Xen (e.g. SR-IOV VF BAR sizing). There are no convenient hooks for us to subscribe to so register MCFG areas earlier upon the first invocation of xen_add_device(). It should be safe to do once since all the boot time buses must have their MCFG areas in MCFG table already and we don't support PCI bus hot-plug. Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11xen/xenbus: fix self-deadlock after killing user processJuergen Gross1-2/+18
commit a8fabb38525c51a094607768bac3ba46b3f4a9d5 upstream. In case a user process using xenbus has open transactions and is killed e.g. via ctrl-C the following cleanup of the allocated resources might result in a deadlock due to trying to end a transaction in the xenbus worker thread: [ 2551.474706] INFO: task xenbus:37 blocked for more than 120 seconds. [ 2551.492215] Tainted: P OE 5.0.0-29-generic #5 [ 2551.510263] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2551.528585] xenbus D 0 37 2 0x80000080 [ 2551.528590] Call Trace: [ 2551.528603] __schedule+0x2c0/0x870 [ 2551.528606] ? _cond_resched+0x19/0x40 [ 2551.528632] schedule+0x2c/0x70 [ 2551.528637] xs_talkv+0x1ec/0x2b0 [ 2551.528642] ? wait_woken+0x80/0x80 [ 2551.528645] xs_single+0x53/0x80 [ 2551.528648] xenbus_transaction_end+0x3b/0x70 [ 2551.528651] xenbus_file_free+0x5a/0x160 [ 2551.528654] xenbus_dev_queue_reply+0xc4/0x220 [ 2551.528657] xenbus_thread+0x7de/0x880 [ 2551.528660] ? wait_woken+0x80/0x80 [ 2551.528665] kthread+0x121/0x140 [ 2551.528667] ? xb_read+0x1d0/0x1d0 [ 2551.528670] ? kthread_park+0x90/0x90 [ 2551.528673] ret_from_fork+0x35/0x40 Fix this by doing the cleanup via a workqueue instead. Reported-by: James Dingwall <james@dingwall.me.uk> Fixes: fd8aa9095a95c ("xen: optimize xenbus driver for multiple concurrent xenstore accesses") Cc: <stable@vger.kernel.org> # 4.11 Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-25xen/pciback: remove set but not used variable 'old_state'YueHaibing1-2/+1
[ Upstream commit 09e088a4903bd0dd911b4f1732b250130cdaffed ] Fixes gcc '-Wunused-but-set-variable' warning: drivers/xen/xen-pciback/conf_space_capability.c: In function pm_ctrl_write: drivers/xen/xen-pciback/conf_space_capability.c:119:25: warning: variable old_state set but not used [-Wunused-but-set-variable] It is never used so can be removed. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06xen/swiotlb: fix condition for calling xen_destroy_contiguous_region()Juergen Gross1-2/+2
commit 50f6393f9654c561df4cdcf8e6cfba7260143601 upstream. The condition in xen_swiotlb_free_coherent() for deciding whether to call xen_destroy_contiguous_region() is wrong: in case the region to be freed is not contiguous calling xen_destroy_contiguous_region() is the wrong thing to do: it would result in inconsistent mappings of multiple PFNs to the same MFN. This will lead to various strange crashes or data corruption. Instead of calling xen_destroy_contiguous_region() in that case a warning should be issued as that situation should never occur. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31xen/events: fix binding user event channels to cpusJuergen Gross2-3/+11
commit bce5963bcb4f9934faa52be323994511d59fd13c upstream. When binding an interdomain event channel to a vcpu via IOCTL_EVTCHN_BIND_INTERDOMAIN not only the event channel needs to be bound, but the affinity of the associated IRQi must be changed, too. Otherwise the IRQ and the event channel won't be moved to another vcpu in case the original vcpu they were bound to is going offline. Cc: <stable@vger.kernel.org> # 4.13 Fixes: c48f64ab472389df ("xen-evtchn: Bind dyn evtchn:qemu-dm interrupt to next online VCPU") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31xen: let alloc_xenballooned_pages() fail if not enough memory freeJuergen Gross1-3/+13
commit a1078e821b605813b63bf6bca414a85f804d5c66 upstream. Instead of trying to allocate pages with GFP_USER in add_ballooned_pages() check the available free memory via si_mem_available(). GFP_USER is far less limiting memory exhaustion than the test via si_mem_available(). This will avoid dom0 running out of memory due to excessive foreign page mappings especially on ARM and on x86 in PVH mode, as those don't have a pre-ballooned area which can be used for foreign mappings. As the normal ballooning suffers from the same problem don't balloon down more than si_mem_available() pages in one iteration. At the same time limit the default maximum number of retries. This is part of XSA-300. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11fs: stream_open - opener for stream-like files so that read and write can ↵Kirill Smelkov1-3/+1
run simultaneously without deadlock commit 10dce8af34226d90fa56746a934f8da5dcdba3df upstream. Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006. However even though 2006'th version of Linus's patch was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", the 2014 version - the one that actually made it into the tree as 9c225f2655e3 - is doing so irregardless of whether a file is seekable or not. See https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/ https://lwn.net/Articles/180387 https://lwn.net/Articles/180396 for historic context. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... Despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse: implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: See https://github.com/libfuse/osspd https://lwn.net/Articles/308445 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in their open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yongzhi Pan <panyongzhi@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Nikolaus Rath <Nikolaus@rath.org> Cc: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09xen/pciback: Don't disable PCI_COMMAND on PCI device reset.Konrad Rzeszutek Wilk1-2/+0
commit 7681f31ec9cdacab4fd10570be924f2cef6669ba upstream. There is no need for this at all. Worst it means that if the guest tries to write to BARs it could lead (on certain platforms) to PCI SERR errors. Please note that with af6fc858a35b90e89ea7a7ee58e66628c55c776b "xen-pciback: limit guest control of command register" a guest is still allowed to enable those control bits (safely), but is not allowed to disable them and that therefore a well behaved frontend which enables things before using them will still function correctly. This is done via an write to the configuration register 0x4 which triggers on the backend side: command_write \- pci_enable_device \- pci_enable_device_flags \- do_pci_enable_device \- pcibios_enable_device \-pci_enable_resourcess [which enables the PCI_COMMAND_MEMORY|PCI_COMMAND_IO] However guests (and drivers) which don't do this could cause problems, including the security issues which XSA-120 sought to address. Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>