summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-07-07Linux 5.12.15v5.12.15Sasha Levin1-1/+1
Tested-by: Fox Chen <foxhlchen@gmail.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Tested-by: Justin M. Forbes <jforbes@fedoraproject.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-07Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"Sean Christopherson2-0/+2
commit f71a53d1180d5ecc346f0c6a23191d837fe2871b upstream. Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in mmu_role.base.level because that tracks the shadow root level, i.e. TDP level. Normally, this is not an issue because LA57 can't be toggled while long mode is active, i.e. the guest has to first disable paging, then toggle LA57, then re-enable paging, thus ensuring an MMU reinitialization. But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57 without having to bounce through an unpaged section. L1 can also load a new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a single entry PML5 is sufficient. Such shenanigans are only problematic if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode. Note, in the L2 case with nested TDP, even though L1 can switch between L2s with different LA57 settings, thus bypassing the paging requirement, in that case KVM's nested_mmu will track LA57 in base.level. This reverts commit 8053f924cad30bf9f9a24e02b6c8ddfabf5202ea. Fixes: 8053f924cad3 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-07gpio: AMD8111 and TQMX86 require HAS_IOPORT_MAPJohannes Berg1-0/+2
[ Upstream commit c6414e1a2bd26b0071e2b9d6034621f705dfd4c0 ] Both of these drivers use ioport_map(), so they need to depend on HAS_IOPORT_MAP. Otherwise, they cannot be built even with COMPILE_TEST on architectures without an ioport implementation, such as ARCH=um. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-07drm/nouveau: fix dma_address check for CPU/GPU syncChristian König1-2/+2
[ Upstream commit d330099115597bbc238d6758a4930e72b49ea9ba ] AGP for example doesn't have a dma_address array. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210614110517.1624-1-christian.koenig@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-07gpio: mxc: Fix disabled interrupt wake-up supportLoic Poulain1-1/+1
[ Upstream commit 3093e6cca3ba7d47848068cb256c489675125181 ] A disabled/masked interrupt marked as wakeup source must be re-enable and unmasked in order to be able to wake-up the host. That can be done by flaging the irqchip with IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND. Note: It 'sometimes' works without that change, but only thanks to the lazy generic interrupt disabling (keeping interrupt unmasked). Reported-by: Michal Koziel <michal.koziel@emlogic.no> Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-07s390/vfio-ap: clean up mdev resources when remove callback invokedTony Krowiak1-10/+0
[ Upstream commit 8c0795d2a0f50e2b131f5b2a8c2795939a94058e ] The mdev remove callback for the vfio_ap device driver bails out with -EBUSY if the mdev is in use by a KVM guest (i.e., the KVM pointer in the struct ap_matrix_mdev is not NULL). The intended purpose was to prevent the mdev from being removed while in use. There are two problems with this scenario: 1. Returning a non-zero return code from the remove callback does not prevent the removal of the mdev. 2. The KVM pointer in the struct ap_matrix_mdev will always be NULL because the remove callback will not get invoked until the mdev fd is closed. When the mdev fd is closed, the mdev release callback is invoked and clears the KVM pointer from the struct ap_matrix_mdev. Let's go ahead and remove the check for KVM in the remove callback and allow the cleanup of mdev resources to proceed. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20210609224634.575156-2-akrowiak@linux.ibm.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-07scsi: sr: Return appropriate error code when disk is ejectedManYi Li1-0/+2
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ] Handle a reported media event code of 3. This indicates that the media has been removed from the drive and user intervention is required to proceed. Return DISK_EVENT_EJECT_REQUEST in that case. Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com Signed-off-by: ManYi Li <limanyi@uniontech.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30Linux 5.12.14v5.12.14Sasha Levin1-1/+1
Tested-by: Fox Chen <foxhlchen@gmail.com> Tested-by: Justin M. Forbes <jforbes@fedoraproject.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30integrity: Load mokx variables into the blacklist keyringEric Snowberg1-2/+18
[ Upstream commit ebd9c2ae369a45bdd9f8615484db09be58fc242b ] During boot the Secure Boot Forbidden Signature Database, dbx, is loaded into the blacklist keyring. Systems booted with shim have an equivalent Forbidden Signature Database called mokx. Currently mokx is only used by shim and grub, the contents are ignored by the kernel. Add the ability to load mokx into the blacklist keyring during boot. Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com> Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> cc: keyrings@vger.kernel.org Link: https://lore.kernel.org/r/c33c8e3839a41e9654f41cc92c7231104931b1d7.camel@HansenPartnership.com/ Link: https://lore.kernel.org/r/20210122181054.32635-5-eric.snowberg@oracle.com/ # v5 Link: https://lore.kernel.org/r/161428674320.677100.12637282414018170743.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/161433313205.902181.2502803393898221637.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161529607422.163428.13530426573612578854.stgit@warthog.procyon.org.uk/ # v3 Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30certs: Add ability to preload revocation certsEric Snowberg5-2/+68
[ Upstream commit d1f044103dad70c1cec0a8f3abdf00834fec8b98 ] Add a new Kconfig option called SYSTEM_REVOCATION_KEYS. If set, this option should be the filename of a PEM-formated file containing X.509 certificates to be included in the default blacklist keyring. DH Changes: - Make the new Kconfig option depend on SYSTEM_REVOCATION_LIST. - Fix SYSTEM_REVOCATION_KEYS=n, but CONFIG_SYSTEM_REVOCATION_LIST=y[1][2]. - Use CONFIG_SYSTEM_REVOCATION_LIST for extract-cert[3]. - Use CONFIG_SYSTEM_REVOCATION_LIST for revocation_certificates.o[3]. Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Randy Dunlap <rdunlap@infradead.org> cc: keyrings@vger.kernel.org Link: https://lore.kernel.org/r/e1c15c74-82ce-3a69-44de-a33af9b320ea@infradead.org/ [1] Link: https://lore.kernel.org/r/20210303034418.106762-1-eric.snowberg@oracle.com/ [2] Link: https://lore.kernel.org/r/20210304175030.184131-1-eric.snowberg@oracle.com/ [3] Link: https://lore.kernel.org/r/20200930201508.35113-3-eric.snowberg@oracle.com/ Link: https://lore.kernel.org/r/20210122181054.32635-4-eric.snowberg@oracle.com/ # v5 Link: https://lore.kernel.org/r/161428673564.677100.4112098280028451629.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/161433312452.902181.4146169951896577982.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161529606657.163428.3340689182456495390.stgit@warthog.procyon.org.uk/ # v3 Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30certs: Move load_system_certificate_list to a common functionEric Snowberg4-47/+70
[ Upstream commit 2565ca7f5ec1a98d51eea8860c4ab923f1ca2c85 ] Move functionality within load_system_certificate_list to a common function, so it can be reused in the future. DH Changes: - Added inclusion of common.h to common.c (Eric [1]). Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: keyrings@vger.kernel.org Link: https://lore.kernel.org/r/EDA280F9-F72D-4181-93C7-CDBE95976FF7@oracle.com/ [1] Link: https://lore.kernel.org/r/20200930201508.35113-2-eric.snowberg@oracle.com/ Link: https://lore.kernel.org/r/20210122181054.32635-3-eric.snowberg@oracle.com/ # v5 Link: https://lore.kernel.org/r/161428672825.677100.7545516389752262918.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/161433311696.902181.3599366124784670368.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161529605850.163428.7786675680201528556.stgit@warthog.procyon.org.uk/ # v3 Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30certs: Add EFI_CERT_X509_GUID support for dbx entriesEric Snowberg6-0/+86
[ Upstream commit 56c5812623f95313f6a46fbf0beee7fa17c68bbf ] This fixes CVE-2020-26541. The Secure Boot Forbidden Signature Database, dbx, contains a list of now revoked signatures and keys previously approved to boot with UEFI Secure Boot enabled. The dbx is capable of containing any number of EFI_CERT_X509_SHA256_GUID, EFI_CERT_SHA256_GUID, and EFI_CERT_X509_GUID entries. Currently when EFI_CERT_X509_GUID are contained in the dbx, the entries are skipped. Add support for EFI_CERT_X509_GUID dbx entries. When a EFI_CERT_X509_GUID is found, it is added as an asymmetrical key to the .blacklist keyring. Anytime the .platform keyring is used, the keys in the .blacklist keyring are referenced, if a matching key is found, the key will be rejected. [DH: Made the following changes: - Added to have a config option to enable the facility. This allows a Kconfig solution to make sure that pkcs7_validate_trust() is enabled.[1][2] - Moved the functions out from the middle of the blacklist functions. - Added kerneldoc comments.] Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> cc: Randy Dunlap <rdunlap@infradead.org> cc: Mickaël Salaün <mic@digikod.net> cc: Arnd Bergmann <arnd@kernel.org> cc: keyrings@vger.kernel.org Link: https://lore.kernel.org/r/20200901165143.10295-1-eric.snowberg@oracle.com/ # rfc Link: https://lore.kernel.org/r/20200909172736.73003-1-eric.snowberg@oracle.com/ # v2 Link: https://lore.kernel.org/r/20200911182230.62266-1-eric.snowberg@oracle.com/ # v3 Link: https://lore.kernel.org/r/20200916004927.64276-1-eric.snowberg@oracle.com/ # v4 Link: https://lore.kernel.org/r/20210122181054.32635-2-eric.snowberg@oracle.com/ # v5 Link: https://lore.kernel.org/r/161428672051.677100.11064981943343605138.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/161433310942.902181.4901864302675874242.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161529605075.163428.14625520893961300757.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/bc2c24e3-ed68-2521-0bf4-a1f6be4a895d@infradead.org/ [1] Link: https://lore.kernel.org/r/20210225125638.1841436-1-arnd@kernel.org/ [2] Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30Revert "drm: add a locked version of drm_is_current_master"Daniel Vetter1-32/+19
commit f54b3ca7ea1e5e02f481cf4ca54568e57bd66086 upstream. This reverts commit 1815d9c86e3090477fbde066ff314a7e9721ee0f. Unfortunately this inverts the locking hierarchy, so back to the drawing board. Full lockdep splat below: ====================================================== WARNING: possible circular locking dependency detected 5.13.0-rc7-CI-CI_DRM_10254+ #1 Not tainted ------------------------------------------------------ kms_frontbuffer/1087 is trying to acquire lock: ffff88810dcd01a8 (&dev->master_mutex){+.+.}-{3:3}, at: drm_is_current_master+0x1b/0x40 but task is already holding lock: ffff88810dcd0488 (&dev->mode_config.mutex){+.+.}-{3:3}, at: drm_mode_getconnector+0x1c6/0x4a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&dev->mode_config.mutex){+.+.}-{3:3}: __mutex_lock+0xab/0x970 drm_client_modeset_probe+0x22e/0xca0 __drm_fb_helper_initial_config_and_unlock+0x42/0x540 intel_fbdev_initial_config+0xf/0x20 [i915] async_run_entry_fn+0x28/0x130 process_one_work+0x26d/0x5c0 worker_thread+0x37/0x380 kthread+0x144/0x170 ret_from_fork+0x1f/0x30 -> #1 (&client->modeset_mutex){+.+.}-{3:3}: __mutex_lock+0xab/0x970 drm_client_modeset_commit_locked+0x1c/0x180 drm_client_modeset_commit+0x1c/0x40 __drm_fb_helper_restore_fbdev_mode_unlocked+0x88/0xb0 drm_fb_helper_set_par+0x34/0x40 intel_fbdev_set_par+0x11/0x40 [i915] fbcon_init+0x270/0x4f0 visual_init+0xc6/0x130 do_bind_con_driver+0x1e5/0x2d0 do_take_over_console+0x10e/0x180 do_fbcon_takeover+0x53/0xb0 register_framebuffer+0x22d/0x310 __drm_fb_helper_initial_config_and_unlock+0x36c/0x540 intel_fbdev_initial_config+0xf/0x20 [i915] async_run_entry_fn+0x28/0x130 process_one_work+0x26d/0x5c0 worker_thread+0x37/0x380 kthread+0x144/0x170 ret_from_fork+0x1f/0x30 -> #0 (&dev->master_mutex){+.+.}-{3:3}: __lock_acquire+0x151e/0x2590 lock_acquire+0xd1/0x3d0 __mutex_lock+0xab/0x970 drm_is_current_master+0x1b/0x40 drm_mode_getconnector+0x37e/0x4a0 drm_ioctl_kernel+0xa8/0xf0 drm_ioctl+0x1e8/0x390 __x64_sys_ioctl+0x6a/0xa0 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: &dev->master_mutex --> &client->modeset_mutex --> &dev->mode_config.mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&dev->mode_config.mutex); lock(&client->modeset_mutex); lock(&dev->mode_config.mutex); lock(&dev->master_mutex);
2021-06-30mm/hwpoison: do not lock page again when me_huge_page() successfully recoversNaoya Horiguchi1-14/+30
commit ea6d0630100b285f059d0a8d8e86f38a46407536 upstream. Currently me_huge_page() temporary unlocks page to perform some actions then locks it again later. My testcase (which calls hard-offline on some tail page in a hugetlb, then accesses the address of the hugetlb range) showed that page allocation code detects this page lock on buddy page and printed out "BUG: Bad page state" message. check_new_page_bad() does not consider a page with __PG_HWPOISON as bad page, so this flag works as kind of filter, but this filtering doesn't work in this case because the "bad page" is not the actual hwpoisoned page. So stop locking page again. Actions to be taken depend on the page type of the error, so page unlocking should be done in ->action() callbacks. So let's make it assumed and change all existing callbacks that way. Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30netfs: fix test for whether we can skip read when writing beyond EOFJeff Layton1-13/+41
commit 827a746f405d25f79560c7868474aec5aee174e1 upstream. It's not sufficient to skip reading when the pos is beyond the EOF. There may be data at the head of the page that we need to fill in before the write. Add a new helper function that corrects and clarifies the logic of when we can skip reads, and have it only zero out the part of the page that won't have data copied in for the write. Finally, don't set the page Uptodate after zeroing. It's not up to date since the write data won't have been copied in yet. [DH made the following changes: - Prefixed the new function with "netfs_". - Don't call zero_user_segments() for a full-page write. - Altered the beyond-last-page check to avoid a DIV instruction and got rid of then-redundant zero-length file check. ] [ Note: this fix is commit 827a746f405d in mainline kernels. The original bug was in ceph, but got lifted into the fs/netfs library for v5.13. This backport should apply to stable kernels v5.10 though v5.12. ] Fixes: e1b1240c1ff5f ("netfs: Add write_begin helper") Reported-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: ceph-devel@vger.kernel.org Link: https://lore.kernel.org/r/20210613233345.113565-1-jlayton@kernel.org/ Link: https://lore.kernel.org/r/162367683365.460125.4467036947364047314.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162391826758.1173366.11794946719301590013.stgit@warthog.procyon.org.uk/ # v2 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30swiotlb: manipulate orig_addr when tlb_addr has offsetBumyong Lee1-1/+7
commit 5f89468e2f060031cd89fd4287298e0eaf246bf6 upstream. in case of driver wants to sync part of ranges with offset, swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with offset and ends up with data mismatch. It was removed from "swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single", but said logic has to be added back in. From Linus's email: "That commit which the removed the offset calculation entirely, because the old (unsigned long)tlb_addr & (IO_TLB_SIZE - 1) was wrong, but instead of removing it, I think it should have just fixed it to be (tlb_addr - mem->start) & (IO_TLB_SIZE - 1); instead. That way the slot offset always matches the slot index calculation." (Unfortunatly that broke NVMe). The use-case that drivers are hitting is as follow: 1. Get dma_addr_t from dma_map_single() dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE); |<---------------vsize------------->| +-----------------------------------+ | | original buffer +-----------------------------------+ vaddr swiotlb_align_offset |<----->|<---------------vsize------------->| +-------+-----------------------------------+ | | | swiotlb buffer +-------+-----------------------------------+ tlb_addr 2. Do something 3. Sync dma_addr_t through dma_sync_single_for_device(..) dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE); Error case. Copy data to original buffer but it is from base addr (instead of base addr + offset) in original buffer: swiotlb_align_offset |<----->|<- offset ->|<- size ->| +-------+-----------------------------------+ | | |##########| | swiotlb buffer +-------+-----------------------------------+ tlb_addr |<- size ->| +-----------------------------------+ |##########| | original buffer +-----------------------------------+ vaddr The fix is to copy the data to the original buffer and take into account the offset, like so: swiotlb_align_offset |<----->|<- offset ->|<- size ->| +-------+-----------------------------------+ | | |##########| | swiotlb buffer +-------+-----------------------------------+ tlb_addr |<- offset ->|<- size ->| +-----------------------------------+ | |##########| | original buffer +-----------------------------------+ vaddr [One fix which was Linus's that made more sense to as it created a symmetry would break NVMe. The reason for that is the: unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1); would come up with the proper offset, but it would lose the alignment (which this patch contains).] Fixes: 16fc3cef33a0 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single") Signed-off-by: Bumyong Lee <bumyong.lee@samsung.com> Signed-off-by: Chanho Park <chanho61.park@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Reported-by: Horia Geantă <horia.geanta@nxp.com> Tested-by: Horia Geantă <horia.geanta@nxp.com> CC: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30KVM: SVM: Call SEV Guest Decommission if ASID binding failsAlper Gun1-11/+21
commit 934002cd660b035b926438244b4294e647507e13 upstream. Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding fails. If a failure happens after a successful LAUNCH_START command, a decommission command should be executed. Otherwise, guest context will be unfreed inside the AMD SP. After the firmware will not have memory to allocate more SEV guest context, LAUNCH_START command will begin to fail with SEV_RET_RESOURCE_LIMIT error. The existing code calls decommission inside sev_unbind_asid, but it is not called if a failure happens before guest activation succeeds. If sev_bind_asid fails, decommission is never called. PSP firmware has a limit for the number of guests. If sev_asid_binding fails many times, PSP firmware will not have resources to create another guest context. Cc: stable@vger.kernel.org Fixes: 59414c989220 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command") Reported-by: Peter Gonda <pgonda@google.com> Signed-off-by: Alper Gun <alpergun@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210610174604.2554090-1-alpergun@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm, futex: fix shared futex pgoff on shmem huge pageHugh Dickins4-28/+9
commit fe19bd3dae3d15d2fbfdb3de8839a6ea0fe94264 upstream. If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope] Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Reported-by: Neel Natu <neelnatu@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Zhang Yi <wetpzy@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()Hugh Dickins1-0/+4
commit a7a69d8ba88d8dcee7ef00e91d413a4bd003a814 upstream. Aha! Shouldn't that quick scan over pte_none()s make sure that it holds ptlock in the PVMW_SYNC case? That too might have been responsible for BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though I've never seen any. Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Wang Yugui <wangyugui@e16-tech.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: fix page_vma_mapped_walk() if THP mapped by ptesHugh Dickins1-9/+25
commit a9a7504d9beaf395481faa91e70e2fd08f7a3dde upstream. Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). Crash dumps showed two tail pages of a shmem huge page remained mapped by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end of a long unmapped range; and no page table had yet been allocated for the head of the huge page to be mapped into. Although designed to handle these odd misaligned huge-page-mapped-by-pte cases, page_vma_mapped_walk() falls short by returning false prematurely when !pmd_present or !pud_present or !p4d_present or !pgd_present: there are cases when a huge page may span the boundary, with ptes present in the next. Restructure page_vma_mapped_walk() as a loop to continue in these cases, while keeping its layout much as before. Add a step_forward() helper to advance pvmw->address across those boundaries: originally I tried to use mm's standard p?d_addr_end() macros, but hit the same crash 512 times less often: because of the way redundant levels are folded together, but folded differently in different configurations, it was just too difficult to use them correctly; and step_forward() is simpler anyway. Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): get vma_address_end() earlierHugh Dickins1-4/+9
commit a765c417d876cc635f628365ec9aa6f09470069a upstream. page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start, rather than later at next_pte. It's a little unnecessary overhead on the first call, but makes for a simpler loop in the following commit. Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): use goto instead of while (1)Hugh Dickins1-4/+3
commit 474466301dfd8b39a10c01db740645f3f7ae9a28 upstream. page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte, and use "goto this_pte", in place of the "while (1)" loop at the end. Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): add a level of indentationHugh Dickins1-50/+55
commit b3807a91aca7d21c05d5790612e49969117a72b9 upstream. page_vma_mapped_walk() cleanup: add a level of indentation to much of the body, making no functional change in this commit, but reducing the later diff when this is all converted to a loop. [hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix] Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): crossing page table boundaryHugh Dickins1-4/+4
commit 448282487483d6fa5b2eeeafaa0acc681e544a9c upstream. page_vma_mapped_walk() cleanup: adjust the test for crossing page table boundary - I believe pvmw->address is always page-aligned, but nothing else here assumed that; and remember to reset pvmw->pte to NULL after unmapping the page table, though I never saw any bug from that. Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION blockHugh Dickins1-16/+14
commit e2e1d4076c77b3671cf8ce702535ae7dee3acf89 upstream. page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to follow the same "return not_found, return not_found, return true" pattern as the block above it (note: returning not_found there is never premature, since existence or prior existence of huge pmd guarantees good alignment). Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): use pmde for *pvmw->pmdHugh Dickins1-5/+6
commit 3306d3119ceacc43ea8b141a73e21fea68eec30c upstream. page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then use it in subsequent tests, instead of repeatedly dereferencing pointer. Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): settle PageHuge on entryHugh Dickins1-4/+8
commit 6d0fd5987657cb0c9756ce684e3a74c0f6351728 upstream. page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of the way at the start, so no need to worry about it later. Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: page_vma_mapped_walk(): use page for pvmw->pageHugh Dickins1-5/+4
commit f003c03bd29e6f46fef1b9a8e8d636ac732286d5 upstream. Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes". I've marked all of these for stable: many are merely cleanups, but I think they are much better before the main fix than after. This patch (of 11): page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was used, sometimes pvmw->page itself: use the local copy "page" throughout. Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for splitYang Shi1-17/+7
commit 504e070dc08f757bccaed6d05c0f53ecbfac8a23 upstream. When debugging the bug reported by Wang Yugui [1], try_to_unmap() may fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however it may miss the failure when head page is unmapped but other subpage is mapped. Then the second DEBUG_VM BUG() that check total mapcount would catch it. This may incur some confusion. As this is not a fatal issue, so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE(). [1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()Hugh Dickins3-24/+63
commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc upstream. There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped. The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page. However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside. The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside. Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly. This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe. Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped. Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com Fixes: fc127da085c2 ("truncate: handle file thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: fix page_address_in_vma() on file THP tailsJue Wang1-4/+4
commit 31657170deaf1d8d2f6a1955fbc6fa9d228be036 upstream. Anon THP tails were already supported, but memory-failure may need to use page_address_in_vma() on file THP tails, which its page->mapping check did not permit: fix it. hughd adds: no current usage is known to hit the issue, but this does fix a subtle trap in a general helper: best fixed in stable sooner than later. Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Jue Wang <juew@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: fix vma_address() if virtual address below file offsetHugh Dickins3-32/+53
commit 494334e43c16d63b878536a26505397fce6ff3a2 upstream. Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). When that BUG() was changed to a WARN(), it would later crash on the VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap(). vma_address() is usually correct, but there's a wraparound case when the vm_start address is unusually low, but vm_pgoff not so low: vma_address() chooses max(start, vma->vm_start), but that decides on the wrong address, because start has become almost ULONG_MAX. Rewrite vma_address() to be more careful about vm_pgoff; move the VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can be safely used from page_mapped_in_vma() and page_address_in_vma() too. Add vma_address_end() to apply similar care to end address calculation, in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one(); though it raises a question of whether callers would do better to supply pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch. An irritation is that their apparent generality breaks down on KSM pages, which cannot be located by the page->index that page_to_pgoff() uses: as commit 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages") once discovered. I dithered over the best thing to do about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and vma_address_end(); though the only place in danger of using it on them was try_to_unmap_one(). Sidenote: vma_address() and vma_address_end() now use compound_nr() on a head page, instead of thp_size(): to make the right calculation on a hugetlbfs page, whether or not THPs are configured. try_to_unmap() is used on hugetlbfs pages, but perhaps the wrong calculation never mattered. Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: try_to_unmap() use TTU_SYNC for safe splittingHugh Dickins4-2/+29
commit 732ed55823fc3ad998d43b86bf771887bcc5ec67 upstream. Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE (!unmap_success): with dump_page() showing mapcount:1, but then its raw struct page output showing _mapcount ffffffff i.e. mapcount 0. And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): all indicative of some mapcount difficulty in development here perhaps. But the !CONFIG_DEBUG_VM path handles the failures correctly and silently. I believe the problem is that once a racing unmap has cleared pte or pmd, try_to_unmap_one() may skip taking the page table lock, and emerge from try_to_unmap() before the racing task has reached decrementing mapcount. Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC to the options, and passing that from unmap_page(). When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same for both: the slight overhead added should rarely matter, except perhaps if splitting sparsely-populated multiply-mapped shmem. Once confident that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated. Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: make is_huge_zero_pmd() safe and quickerHugh Dickins2-2/+11
commit 3b77e8c8cde581dadab9a0f1543a347e24315f11 upstream. Most callers of is_huge_zero_pmd() supply a pmd already verified present; but a few (notably zap_huge_pmd()) do not - it might be a pmd migration entry, in which the pfn is encoded differently from a present pmd: which might pass the is_huge_zero_pmd() test (though not on x86, since L1TF forced us to protect against that); or perhaps even crash in pmd_page() applied to a swap-like entry. Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself; and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd() will not need to do that pmd_page() lookup each time. __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked, but is unnecessary now that is_huge_zero_pmd() checks present. Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/thp: fix __split_huge_pmd_locked() on shmem migration entryHugh Dickins2-12/+20
commit 99fa8a48203d62b3743d866fc48ef6abaee682be upstream. Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10. Here is v2 batch of long-standing THP bug fixes that I had not got around to sending before, but prompted now by Wang Yugui's report https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and they have done no harm, but have *not* fixed that issue: something more is needed and I have no idea of what. This patch (of 7): Stressing huge tmpfs page migration racing hole punch often crashed on the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel; or shortly afterwards, on a bad dereference in __split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd migration entries in the non-anonymous case. Full disclosure: those particular experiments were on a kernel with more relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the vanilla kernel: it is conceivable that stricter locking happens to avoid those cases, or makes them less likely; but __split_huge_pmd_locked() already allowed for pmd migration entries when handling anonymous THPs, so this commit brings the shmem and file THP handling into line. And while there: use old_pmd rather than _pmd, as in the following blocks; and make it clearer to the eye that the !vma_is_anonymous() block is self-contained, making an early return after accounting for unmapping. Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jue Wang <juew@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Shakeel Butt <shakeelb@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm, thp: use head page in __migration_entry_wait()Xu Yu1-0/+1
commit ffc90cbb2970ab88b66ea51dd580469eede57b67 upstream. We notice that hung task happens in a corner but practical scenario when CONFIG_PREEMPT_NONE is enabled, as follows. Process 0 Process 1 Process 2..Inf split_huge_page_to_list unmap_page split_huge_pmd_address __migration_entry_wait(head) __migration_entry_wait(tail) remap_page (roll back) remove_migration_ptes rmap_walk_anon cond_resched Where __migration_entry_wait(tail) is occurred in kernel space, e.g., copy_to_user in fstat, which will immediately fault again without rescheduling, and thus occupy the cpu fully. When there are too many processes performing __migration_entry_wait on tail page, remap_page will never be done after cond_resched. This makes __migration_entry_wait operate on the compound head page, thus waits for remap_page to complete, whether the THP is split successfully or roll back. Note that put_and_wait_on_page_locked helps to drop the page reference acquired with get_page_unless_zero, as soon as the page is on the wait queue, before actually waiting. So splitting the THP is only prevented for a brief interval. Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30mm/memory-failure: use a mutex to avoid memory_failure() racesTony Luck1-13/+23
commit 171936ddaf97e6f4e1264f4128bb5cf15691339c upstream. Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5. I wrote this patchset to materialize what I think is the current allowable solution mentioned by the previous discussion [1]. I simply borrowed Tony's mutex patch and Aili's return code patch, then I queued another one to find error virtual address in the best effort manner. I know that this is not a perfect solution, but should work for some typical case. [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ This patch (of 2): There can be races when multiple CPUs consume poison from the same page. The first into memory_failure() atomically sets the HWPoison page flag and begins hunting for tasks that map this page. Eventually it invalidates those mappings and may send a SIGBUS to the affected tasks. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. [akpm@linux-foundation.org: make mf_mutex local to memory_failure()] Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30x86/fpu: Make init_fpstate correct with optimized XSAVEThomas Gleixner2-25/+46
commit f9dfb5e390fab2df9f7944bb91e7705aba14cd26 upstream. The XSAVE init code initializes all enabled and supported components with XRSTOR(S) to init state. Then it XSAVEs the state of the components back into init_fpstate which is used in several places to fill in the init state of components. This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because those use the init optimization and skip writing state of components which are in init state. So init_fpstate.xsave still contains all zeroes after this operation. There are two ways to solve that: 1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when XSAVES is enabled because XSAVES uses compacted format. 2) Save the components which are known to have a non-zero init state by other means. Looking deeper, #2 is the right thing to do because all components the kernel supports have all-zeroes init state except the legacy features (FP, SSE). Those cannot be hard coded because the states are not identical on all CPUs, but they can be saved with FXSAVE which avoids all conditionals. Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with a BUILD_BUG_ON() which reminds developers to validate that a newly added component has all zeroes init state. As a bonus remove the now unused copy_xregs_to_kernel_booting() crutch. The XSAVE and reshuffle method can still be implemented in the unlikely case that components are added which have a non-zero init state and no other means to save them. For now, FXSAVE is just simple and good enough. [ bp: Fix a typo or two in the text. ] Fixes: 6bad06b76892 ("x86, xsave: Use xsaveopt in context-switch path when supported") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate()Thomas Gleixner1-18/+8
commit 9301982c424a003c0095bf157154a85bf5322bd0 upstream. sanitize_restored_user_xstate() preserves the supervisor states only when the fx_only argument is zero, which allows unprivileged user space to put supervisor states back into init state. Preserve them unconditionally. [ bp: Fix a typo or two in the text. ] Fixes: 5d6b6a6f9b5c ("x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstates") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.438635017@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30kthread: prevent deadlock when kthread_mod_delayed_work() races with ↵Petr Mladek1-11/+24
kthread_cancel_delayed_work_sync() commit 5fa54346caf67b4b1b10b1f390316ae466da4d53 upstream. The system might hang with the following backtrace: schedule+0x80/0x100 schedule_timeout+0x48/0x138 wait_for_common+0xa4/0x134 wait_for_completion+0x1c/0x2c kthread_flush_work+0x114/0x1cc kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144 kthread_cancel_delayed_work_sync+0x18/0x2c xxxx_pm_notify+0xb0/0xd8 blocking_notifier_call_chain_robust+0x80/0x194 pm_notifier_call_chain_robust+0x28/0x4c suspend_prepare+0x40/0x260 enter_state+0x80/0x3f4 pm_suspend+0x60/0xdc state_store+0x108/0x144 kobj_attr_store+0x38/0x88 sysfs_kf_write+0x64/0xc0 kernfs_fop_write_iter+0x108/0x1d0 vfs_write+0x2f4/0x368 ksys_write+0x7c/0xec It is caused by the following race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(): CPU0 CPU1 Context: Thread A Context: Thread B kthread_mod_delayed_work() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() kthread_cancel_delayed_work_sync() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() spin_lock() work->canceling++ spin_unlock spin_lock() queue_delayed_work() // dwork is put into the worker->delayed_work_list spin_unlock() kthread_flush_work() // flush_work is put at the tail of the dwork wait_for_completion() Context: IRQ kthread_delayed_work_timer_fn() spin_lock() list_del_init(&work->node); spin_unlock() BANG: flush_work is not longer linked and will never get proceed. The problem is that kthread_mod_delayed_work() checks work->canceling flag before canceling the timer. A simple solution is to (re)check work->canceling after __kthread_cancel_work(). But then it is not clear what should be returned when __kthread_cancel_work() removed the work from the queue (list) and it can't queue it again with the new @delay. The return value might be used for reference counting. The caller has to know whether a new work has been queued or an existing one was replaced. The proper solution is that kthread_mod_delayed_work() will remove the work from the queue (list) _only_ when work->canceling is not set. The flag must be checked after the timer is stopped and the remaining operations can be done under worker->lock. Note that kthread_mod_delayed_work() could remove the timer and then bail out. It is fine. The other canceling caller needs to cancel the timer as well. The important thing is that the queue (list) manipulation is done atomically under worker->lock. Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com Fixes: 9a6b06c8d9a220860468a ("kthread: allow to modify delayed kthread work") Signed-off-by: Petr Mladek <pmladek@suse.com> Reported-by: Martin Liu <liumartin@google.com> Cc: <jenhaochen@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30kthread_worker: split code for canceling the delayed work timerPetr Mladek1-17/+29
commit 34b3d5344719d14fd2185b2d9459b3abcb8cf9d8 upstream. Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()". This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling. This patch (of 2): Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(). It does not modify the existing behavior. Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30xen/events: reset active flag for lateeoi events laterJuergen Gross1-1/+10
commit 3de218ff39b9e3f0d453fe3154f12a174de44b25 upstream. In order to avoid a race condition for user events when changing cpu affinity reset the active flag only when EOI-ing the event. This is working fine as all user events are lateeoi events. Note that lateeoi_ack_mask_dynirq() is not modified as there is no explicit call to xen_irq_lateeoi() expected later. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Fixes: b6622798bc50b62 ("xen/events: avoid handling the same event on two cpus at the same time") Tested-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrvsky@oracle.com> Link: https://lore.kernel.org/r/20210623130913.9405-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30ceph: must hold snap_rwsem when filling inode for async createJeff Layton2-0/+5
commit 27171ae6a0fdc75571e5bf3d0961631a1e4fb765 upstream. ...and add a lockdep assertion for it to ceph_fill_inode(). Cc: stable@vger.kernel.org # v5.7+ Fixes: 9a8d03ca2e2c3 ("ceph: attempt to do async create when possible") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30i2c: robotfuzz-osif: fix control-request directionsJohan Hovold1-2/+2
commit 4ca070ef0dd885616ef294d269a9bf8e3b258e1a upstream. The direction of the pipe argument must match the request-type direction bit or control requests may fail depending on the host-controller-driver implementation. Control transfers without a data stage are treated as OUT requests by the USB stack and should be using usb_sndctrlpipe(). Failing to do so will now trigger a warning. Fix the OSIFI2C_SET_BIT_RATE and OSIFI2C_STOP requests which erroneously used the osif_usb_read() helper and set the IN direction bit. Reported-by: syzbot+9d7dadd15b8819d73f41@syzkaller.appspotmail.com Fixes: 83e53a8f120f ("i2c: Add bus driver for for OSIF USB i2c device.") Cc: stable@vger.kernel.org # 3.14 Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30KVM: do not allow mapping valid but non-reference-counted pagesNicholas Piggin1-2/+17
commit f8be156be163a052a067306417cd0ff679068c97 upstream. It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on valid pages if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). This addresses CVE-2021-22543. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30s390: clear pt_regs::flags on irq entrySven Schnelle1-0/+1
commit ca1f4d702d534387aa1f16379edb3b03cdb6ceda upstream. The current irq entry code doesn't initialize pt_regs::flags. On exit to user mode arch_do_signal_or_restart() tests whether PIF_SYSCALL is set, which might yield wrong results. Fix this by clearing pt_regs::flags in the entry.S irq handler code. Reported-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Fixes: 56e62a737028 ("s390: convert to generic entry") Cc: <stable@vger.kernel.org> # 5.12 Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30s390: fix system call restart with multiple signalsSven Schnelle1-1/+0
commit fc66127dc3396338f287c3b494dfbf102547e770 upstream. glibc complained with "The futex facility returned an unexpected error code.". It turned out that the futex syscall returned -ERESTARTSYS because a signal is pending. arch_do_signal_or_restart() restored the syscall parameters (nameley regs->gprs[2]) and set PIF_SYSCALL_RESTART. When another signal is made pending later in the exit loop arch_do_signal_or_restart() is called again. This function clears PIF_SYSCALL_RESTART and checks the return code which is set in regs->gprs[2]. However, regs->gprs[2] was restored in the previous run and no longer contains -ERESTARTSYS, so PIF_SYSCALL_RESTART isn't set again and the syscall is skipped. Fix this by not clearing PIF_SYSCALL_RESTART - it is already cleared in __do_syscall() when the syscall is restarted. Reported-by: Bjoern Walk <bwalk@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Fixes: 56e62a737028 ("s390: convert to generic entry") Cc: <stable@vger.kernel.org> # 5.12 Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30s390/stack: fix possible register corruption with stack switch helperHeiko Carstens1-7/+11
commit 67147e96a332b56c7206238162771d82467f86c0 upstream. The CALL_ON_STACK macro is used to call a C function from inline assembly, and therefore must consider the C ABI, which says that only registers 6-13, and 15 are non-volatile (restored by the called function). The inline assembly incorrectly marks all registers used to pass parameters to the called function as read-only input operands, instead of operands that are read and written to. This might result in register corruption depending on usage, compiler, and compile options. Fix this by marking all operands used to pass parameters as read/write operands. To keep the code simple even register 6, if used, is marked as read-write operand. Fixes: ff340d2472ec ("s390: add stack switch helper") Cc: <stable@kernel.org> # 4.20 Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30s390/topology: clear thread/group maps for offline cpusSven Schnelle1-3/+9
commit 9e3d62d55bf455d4f9fdf2ede5c8756410c64102 upstream. The current code doesn't clear the thread/group maps for offline CPUs. This may cause kernel crashes like the one bewlow in common code that assumes if a CPU has sibblings it is online. Unable to handle kernel pointer dereference in virtual kernel address space Call Trace: [<000000013a4b8c3c>] blk_mq_map_swqueue+0x10c/0x388 ([<000000013a4b8bcc>] blk_mq_map_swqueue+0x9c/0x388) [<000000013a4b9300>] blk_mq_init_allocated_queue+0x448/0x478 [<000000013a4b9416>] blk_mq_init_queue+0x4e/0x90 [<000003ff8019d3e6>] loop_add+0x106/0x278 [loop] [<000003ff801b8148>] loop_init+0x148/0x1000 [loop] [<0000000139de4924>] do_one_initcall+0x3c/0x1e0 [<0000000139ef449a>] do_init_module+0x6a/0x2a0 [<0000000139ef61bc>] __do_sys_finit_module+0xa4/0xc0 [<0000000139de9e6e>] do_syscall+0x7e/0xd0 [<000000013a8e0aec>] __do_syscall+0xbc/0x110 [<000000013a8ee2e8>] system_call+0x78/0xa0 Fixes: 52aeda7accb6 ("s390/topology: remove offline CPUs from CPU topology masks") Cc: <stable@kernel.org> # 5.7+ Reported-by: Marius Hillenbrand <mhillen@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30nilfs2: fix memory leak in nilfs_sysfs_delete_device_groupPavel Skripkin1-0/+1
[ Upstream commit 8fd0c1b0647a6bda4067ee0cd61e8395954b6f28 ] My local syzbot instance hit memory leak in nilfs2. The problem was in missing kobject_put() in nilfs_sysfs_delete_device_group(). kobject_del() does not call kobject_cleanup() for passed kobject and it leads to leaking duped kobject name if kobject_put() was not called. Fail log: BUG: memory leak unreferenced object 0xffff8880596171e0 (size 8): comm "syz-executor379", pid 8381, jiffies 4294980258 (age 21.100s) hex dump (first 8 bytes): 6c 6f 6f 70 30 00 00 00 loop0... backtrace: kstrdup+0x36/0x70 mm/util.c:60 kstrdup_const+0x53/0x80 mm/util.c:83 kvasprintf_const+0x108/0x190 lib/kasprintf.c:48 kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xc9/0x160 lib/kobject.c:473 nilfs_sysfs_create_device_group+0x150/0x800 fs/nilfs2/sysfs.c:999 init_nilfs+0xe26/0x12b0 fs/nilfs2/the_nilfs.c:637 Link: https://lkml.kernel.org/r/20210612140559.20022-1-paskripkin@gmail.com Fixes: da7141fb78db ("nilfs2: add /sys/fs/nilfs2/<device> group") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>