From 2f0584f3f4bd60bcc8735172981fb0bff86e74e0 Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Mon, 12 Jun 2023 17:10:27 -0700 Subject: mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma() The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). The goal is to make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(), create a new arch config HAS_HUGE_PAGE that can be used to tell if pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the compiler would not be able to find pmd_mkwrite_novma(). No functional change. Suggested-by: Linus Torvalds Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Mike Rapoport (IBM) Acked-by: Geert Uytterhoeven Acked-by: David Hildenbrand Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/ Link: https://lore.kernel.org/all/20230613001108.3040476-2-rick.p.edgecombe%40intel.com --- include/linux/pgtable.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+) (limited to 'include/linux') diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5063b482e34f..e6ea6e0d7d8d 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -515,6 +515,20 @@ extern pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, pud_t *pudp); #endif +#ifndef pte_mkwrite +static inline pte_t pte_mkwrite(pte_t pte) +{ + return pte_mkwrite_novma(pte); +} +#endif + +#if defined(CONFIG_ARCH_WANT_PMD_MKWRITE) && !defined(pmd_mkwrite) +static inline pmd_t pmd_mkwrite(pmd_t pmd) +{ + return pmd_mkwrite_novma(pmd); +} +#endif + #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT struct mm_struct; static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep) -- cgit v1.2.3 From 161e393c0f63592a3b95bdd8b55752653763fc6d Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Mon, 12 Jun 2023 17:10:29 -0700 Subject: mm: Make pte_mkwrite() take a VMA The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted callers that don't have a VMA were to use pte_mkwrite_novma(). So now change pte_mkwrite() to take a VMA and change the remaining callers to pass a VMA. Apply the same changes for pmd_mkwrite(). No functional change. Suggested-by: David Hildenbrand Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Mike Rapoport (IBM) Acked-by: David Hildenbrand Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com --- Documentation/mm/arch_pgtable_helpers.rst | 6 ++++-- include/linux/mm.h | 2 +- include/linux/pgtable.h | 4 ++-- mm/debug_vm_pgtable.c | 12 ++++++------ mm/huge_memory.c | 10 +++++----- mm/memory.c | 4 ++-- mm/migrate.c | 2 +- mm/migrate_device.c | 2 +- mm/mprotect.c | 2 +- mm/userfaultfd.c | 2 +- 10 files changed, 24 insertions(+), 22 deletions(-) (limited to 'include/linux') diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index 69ce1f2aa4d1..c82e3ee20e51 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -46,7 +46,8 @@ PTE Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_mkclean | Creates a clean PTE | +---------------------------+--------------------------------------------------+ -| pte_mkwrite | Creates a writable PTE | +| pte_mkwrite | Creates a writable PTE of the type specified by | +| | the VMA. | +---------------------------+--------------------------------------------------+ | pte_mkwrite_novma | Creates a writable PTE, of the conventional type | | | of writable. | @@ -121,7 +122,8 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_mkclean | Creates a clean PMD | +---------------------------+--------------------------------------------------+ -| pmd_mkwrite | Creates a writable PMD | +| pmd_mkwrite | Creates a writable PMD of the type specified by | +| | the VMA. | +---------------------------+--------------------------------------------------+ | pmd_mkwrite_novma | Creates a writable PMD, of the conventional type | | | of writable. | diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..d40fa0feb9dc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1277,7 +1277,7 @@ void free_compound_page(struct page *page); static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) { if (likely(vma->vm_flags & VM_WRITE)) - pte = pte_mkwrite(pte); + pte = pte_mkwrite(pte, vma); return pte; } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e6ea6e0d7d8d..9462f4a87d42 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -516,14 +516,14 @@ extern pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, #endif #ifndef pte_mkwrite -static inline pte_t pte_mkwrite(pte_t pte) +static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) { return pte_mkwrite_novma(pte); } #endif #if defined(CONFIG_ARCH_WANT_PMD_MKWRITE) && !defined(pmd_mkwrite) -static inline pmd_t pmd_mkwrite(pmd_t pmd) +static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) { return pmd_mkwrite_novma(pmd); } diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index ee119e33fef1..b457ca17cef7 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -109,10 +109,10 @@ static void __init pte_basic_tests(struct pgtable_debug_args *args, int idx) WARN_ON(!pte_same(pte, pte)); WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte)))); WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte)))); - WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte)))); + WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte), args->vma))); WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte)))); WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte)))); - WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte)))); + WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte, args->vma)))); WARN_ON(pte_dirty(pte_wrprotect(pte_mkclean(pte)))); WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte)))); } @@ -156,7 +156,7 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args) pte = pte_mkclean(pte); set_pte_at(args->mm, args->vaddr, args->ptep, pte); flush_dcache_page(page); - pte = pte_mkwrite(pte); + pte = pte_mkwrite(pte, args->vma); pte = pte_mkdirty(pte); ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1); pte = ptep_get(args->ptep); @@ -202,10 +202,10 @@ static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx) WARN_ON(!pmd_same(pmd, pmd)); WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd)))); WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd)))); - WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd)))); + WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd), args->vma))); WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd)))); WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd)))); - WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd)))); + WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd, args->vma)))); WARN_ON(pmd_dirty(pmd_wrprotect(pmd_mkclean(pmd)))); WARN_ON(!pmd_dirty(pmd_wrprotect(pmd_mkdirty(pmd)))); /* @@ -256,7 +256,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args) pmd = pmd_mkclean(pmd); set_pmd_at(args->mm, vaddr, args->pmdp, pmd); flush_dcache_page(page); - pmd = pmd_mkwrite(pmd); + pmd = pmd_mkwrite(pmd, args->vma); pmd = pmd_mkdirty(pmd); pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1); pmd = READ_ONCE(*args->pmdp); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eb3678360b97..23c2aa612926 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -551,7 +551,7 @@ __setup("transparent_hugepage=", setup_transparent_hugepage); pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) { if (likely(vma->vm_flags & VM_WRITE)) - pmd = pmd_mkwrite(pmd); + pmd = pmd_mkwrite(pmd, vma); return pmd; } @@ -1572,7 +1572,7 @@ out_map: pmd = pmd_modify(oldpmd, vma->vm_page_prot); pmd = pmd_mkyoung(pmd); if (writable) - pmd = pmd_mkwrite(pmd); + pmd = pmd_mkwrite(pmd, vma); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); spin_unlock(vmf->ptl); @@ -1925,7 +1925,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, /* See change_pte_range(). */ if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) && can_change_pmd_writable(vma, addr, entry)) - entry = pmd_mkwrite(entry); + entry = pmd_mkwrite(entry, vma); ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); @@ -2243,7 +2243,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, } else { entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); if (write) - entry = pte_mkwrite(entry); + entry = pte_mkwrite(entry, vma); if (anon_exclusive) SetPageAnonExclusive(page + i); if (!young) @@ -3287,7 +3287,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde = pmd_mksoft_dirty(pmde); if (is_writable_migration_entry(entry)) - pmde = pmd_mkwrite(pmde); + pmde = pmd_mkwrite(pmde, vma); if (pmd_swp_uffd_wp(*pvmw->pmd)) pmde = pmd_mkuffd_wp(pmde); if (!is_migration_entry_young(entry)) diff --git a/mm/memory.c b/mm/memory.c index 01f39e8144ef..f093c73512c5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4119,7 +4119,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) entry = mk_pte(&folio->page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) - entry = pte_mkwrite(pte_mkdirty(entry)); + entry = pte_mkwrite(pte_mkdirty(entry), vma); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); @@ -4808,7 +4808,7 @@ out_map: pte = pte_modify(old_pte, vma->vm_page_prot); pte = pte_mkyoung(pte); if (writable) - pte = pte_mkwrite(pte); + pte = pte_mkwrite(pte, vma); ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); update_mmu_cache(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); diff --git a/mm/migrate.c b/mm/migrate.c index 24baad2571e3..18f58b7e0aff 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -220,7 +220,7 @@ static bool remove_migration_pte(struct folio *folio, if (folio_test_dirty(folio) && is_migration_entry_dirty(entry)) pte = pte_mkdirty(pte); if (is_writable_migration_entry(entry)) - pte = pte_mkwrite(pte); + pte = pte_mkwrite(pte, vma); else if (pte_swp_uffd_wp(old_pte)) pte = pte_mkuffd_wp(pte); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 8365158460ed..df280aa461e2 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -623,7 +623,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, } entry = mk_pte(page, vma->vm_page_prot); if (vma->vm_flags & VM_WRITE) - entry = pte_mkwrite(pte_mkdirty(entry)); + entry = pte_mkwrite(pte_mkdirty(entry), vma); } ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); diff --git a/mm/mprotect.c b/mm/mprotect.c index 6f658d483704..b342e0196e01 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -185,7 +185,7 @@ static long change_pte_range(struct mmu_gather *tlb, if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pte_write(ptent) && can_change_pte_writable(vma, addr, ptent)) - ptent = pte_mkwrite(ptent); + ptent = pte_mkwrite(ptent, vma); ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); if (pte_needs_flush(oldpte, ptent)) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index a2bf37ee276d..b322ac54ea20 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -72,7 +72,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, if (page_in_cache && !vm_shared) writable = false; if (writable) - _dst_pte = pte_mkwrite(_dst_pte); + _dst_pte = pte_mkwrite(_dst_pte, dst_vma); if (flags & MFILL_ATOMIC_WP) _dst_pte = pte_mkuffd_wp(_dst_pte); -- cgit v1.2.3 From 592b5fad1677aa98a578ae50eb81d7383752c9c8 Mon Sep 17 00:00:00 2001 From: Yu-cheng Yu Date: Mon, 12 Jun 2023 17:10:30 -0700 Subject: mm: Re-introduce vm_flags to do_mmap() There was no more caller passing vm_flags to do_mmap(), and vm_flags was removed from the function's input by: commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()"). There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to do_mmap(). Thus, re-introduce vm_flags to do_mmap(). Co-developed-by: Rick Edgecombe Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Peter Collingbourne Reviewed-by: Kees Cook Reviewed-by: Kirill A. Shutemov Reviewed-by: Mark Brown Acked-by: Mike Rapoport (IBM) Acked-by: David Hildenbrand Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Tested-by: Mark Brown Link: https://lore.kernel.org/all/20230613001108.3040476-5-rick.p.edgecombe%40intel.com --- fs/aio.c | 2 +- include/linux/mm.h | 3 ++- ipc/shm.c | 2 +- mm/mmap.c | 10 +++++----- mm/nommu.c | 4 ++-- mm/util.c | 2 +- 6 files changed, 12 insertions(+), 11 deletions(-) (limited to 'include/linux') diff --git a/fs/aio.c b/fs/aio.c index 77e33619de40..c7c89181cf9f 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -558,7 +558,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, - MAP_SHARED, 0, &unused, NULL); + MAP_SHARED, 0, 0, &unused, NULL); mmap_write_unlock(mm); if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index d40fa0feb9dc..f9a627c492f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3176,7 +3176,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr, struct list_head *uf); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, - unsigned long pgoff, unsigned long *populate, struct list_head *uf); + vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, + struct list_head *uf); extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, bool unlock); diff --git a/ipc/shm.c b/ipc/shm.c index 60e45e7045d4..576a543b7cff 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1662,7 +1662,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, goto invalid; } - addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL); + addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL); *raddr = addr; err = 0; if (IS_ERR_VALUE(addr)) diff --git a/mm/mmap.c b/mm/mmap.c index 3eda23c9ebe7..4900f7471820 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1189,11 +1189,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode, */ unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, - unsigned long flags, unsigned long pgoff, - unsigned long *populate, struct list_head *uf) + unsigned long flags, vm_flags_t vm_flags, + unsigned long pgoff, unsigned long *populate, + struct list_head *uf) { struct mm_struct *mm = current->mm; - vm_flags_t vm_flags; int pkey = 0; validate_mm(mm); @@ -1254,7 +1254,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, * to. we assume access permissions have been handled by the open * of the memory object, so we don't do any here. */ - vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | + vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; if (flags & MAP_LOCKED) @@ -2995,7 +2995,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, file = get_file(vma->vm_file); ret = do_mmap(vma->vm_file, start, size, - prot, flags, pgoff, &populate, NULL); + prot, flags, 0, pgoff, &populate, NULL); fput(file); out: mmap_write_unlock(mm); diff --git a/mm/nommu.c b/mm/nommu.c index c072a660ec2c..fe19614b9c19 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1015,6 +1015,7 @@ unsigned long do_mmap(struct file *file, unsigned long len, unsigned long prot, unsigned long flags, + vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) @@ -1022,7 +1023,6 @@ unsigned long do_mmap(struct file *file, struct vm_area_struct *vma; struct vm_region *region; struct rb_node *rb; - vm_flags_t vm_flags; unsigned long capabilities, result; int ret; VMA_ITERATOR(vmi, current->mm, 0); @@ -1042,7 +1042,7 @@ unsigned long do_mmap(struct file *file, /* we've determined that we can make the mapping, now translate what we * now know into VMA flags */ - vm_flags = determine_vm_flags(file, prot, flags, capabilities); + vm_flags |= determine_vm_flags(file, prot, flags, capabilities); /* we're going to need to record the mapping */ diff --git a/mm/util.c b/mm/util.c index dd12b9531ac4..8e7fc6cacab4 100644 --- a/mm/util.c +++ b/mm/util.c @@ -540,7 +540,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, if (!ret) { if (mmap_write_lock_killable(mm)) return -EINTR; - ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate, + ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate, &uf); mmap_write_unlock(mm); userfaultfd_unmap_complete(mm, &uf); -- cgit v1.2.3 From fb47a799cc5ccc469c63e9174f2ad555a21ba2a1 Mon Sep 17 00:00:00 2001 From: Yu-cheng Yu Date: Mon, 12 Jun 2023 17:10:31 -0700 Subject: mm: Move VM_UFFD_MINOR_BIT from 37 to 38 The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. Future patches will introduce a new VM flag VM_SHADOW_STACK that will be VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake of order, make all VM_HIGH_ARCH_BITs stay together by moving VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be introduced as 37. Co-developed-by: Rick Edgecombe Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Reviewed-by: Axel Rasmussen Acked-by: Mike Rapoport (IBM) Acked-by: Peter Xu Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Link: https://lore.kernel.org/all/20230613001108.3040476-6-rick.p.edgecombe%40intel.com --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index f9a627c492f2..82990f3390f6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -370,7 +370,7 @@ extern unsigned int kobjsize(const void *objp); #endif #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR -# define VM_UFFD_MINOR_BIT 37 +# define VM_UFFD_MINOR_BIT 38 # define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */ #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ # define VM_UFFD_MINOR VM_NONE -- cgit v1.2.3 From 54007f818206dc27309ca423df4c87dd160a7208 Mon Sep 17 00:00:00 2001 From: Yu-cheng Yu Date: Mon, 12 Jun 2023 17:10:40 -0700 Subject: mm: Introduce VM_SHADOW_STACK for shadow stack memory New hardware extensions implement support for shadow stack memory, such as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to identify these areas, for example, to be used to properly indicate shadow stack PTEs to the hardware. Shadow stack VMA creation will be tightly controlled and limited to anonymous memory to make the implementation simpler and since that is all that is required. The solution will rely on pte_mkwrite() to create the shadow stack PTEs, so it will not be required for vm_get_page_prot() to learn how to create shadow stack memory. For this reason document that VM_SHADOW_STACK should not be mixed with VM_SHARED. Co-developed-by: Rick Edgecombe Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Reviewed-by: Kirill A. Shutemov Reviewed-by: Mark Brown Acked-by: Mike Rapoport (IBM) Acked-by: David Hildenbrand Tested-by: Mark Brown Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Link: https://lore.kernel.org/all/20230613001108.3040476-15-rick.p.edgecombe%40intel.com --- Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 8 ++++++++ 3 files changed, 12 insertions(+) (limited to 'include/linux') diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 7897a7dafcbc..6ccb57089a06 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -566,6 +566,7 @@ encoded manner. The codes are the following: mt arm64 MTE allocation tags are enabled um userfaultfd missing tracking uw userfaultfd wr-protect tracking + ss shadow stack page == ======================================= Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 507cd4e59d07..cfab855fe7e9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -709,6 +709,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR [ilog2(VM_UFFD_MINOR)] = "ui", #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ +#ifdef CONFIG_X86_USER_SHADOW_STACK + [ilog2(VM_SHADOW_STACK)] = "ss", +#endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index 82990f3390f6..f6c2ebde62b3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -319,11 +319,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -339,6 +341,12 @@ extern unsigned int kobjsize(const void *objp); #endif #endif /* CONFIG_ARCH_HAS_PKEYS */ +#ifdef CONFIG_X86_USER_SHADOW_STACK +# define VM_SHADOW_STACK VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */ +#else +# define VM_SHADOW_STACK VM_NONE +#endif + #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) -- cgit v1.2.3 From 0266e7c53647fbc18be2d0da98d5c9e92922d866 Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Mon, 12 Jun 2023 17:10:42 -0700 Subject: mm: Add guard pages around a shadow stack. The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. The architecture of shadow stack constrains the ability of userspace to move the shadow stack pointer (SSP) in order to prevent corrupting or switching to other shadow stacks. The RSTORSSP instruction can move the SSP to different shadow stacks, but it requires a specially placed token in order to do this. However, the architecture does not prevent incrementing the stack pointer to wander onto an adjacent shadow stack. To prevent this in software, enforce guard pages at the beginning of shadow stack VMAs, such that there will always be a gap between adjacent shadow stacks. Make the gap big enough so that no userspace SSP changing operations (besides RSTORSSP), can move the SSP from one stack to the next. The SSP can be incremented or decremented by CALL, RET and INCSSP. CALL and RET can move the SSP by a maximum of 8 bytes, at which point the shadow stack would be accessed. The INCSSP instruction can also increment the shadow stack pointer. It is the shadow stack analog of an instruction like: addq $0x80, %rsp However, there is one important difference between an ADD on %rsp and INCSSP. In addition to modifying SSP, INCSSP also reads from the memory of the first and last elements that were "popped". It can be thought of as acting like this: READ_ONCE(ssp); // read+discard top element on stack ssp += nr_to_pop * 8; // move the shadow stack READ_ONCE(ssp-8); // read+discard last popped stack element The maximum distance INCSSP can move the SSP is 2040 bytes, before it would read the memory. Therefore, a single page gap will be enough to prevent any operation from shifting the SSP to an adjacent stack, since it would have to land in the gap at least once, causing a fault. This could be accomplished by using VM_GROWSDOWN, but this has a downside. The behavior would allow shadow stacks to grow, which is unneeded and adds a strange difference to how most regular stacks work. In the maple tree code, there is some logic for retrying the unmapped area search if a guard gap is violated. This retry should happen for shadow stack guard gap violations as well. This logic currently only checks for VM_GROWSDOWN for start gaps. Since shadow stacks also have a start gap as well, create an new define VM_STARTGAP_FLAGS to hold all the VM flag bits that have start gaps, and make mmap use it. Co-developed-by: Yu-cheng Yu Signed-off-by: Yu-cheng Yu Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Reviewed-by: Mark Brown Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Link: https://lore.kernel.org/all/20230613001108.3040476-17-rick.p.edgecombe%40intel.com --- include/linux/mm.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++++------ mm/mmap.c | 4 ++-- 2 files changed, 50 insertions(+), 8 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index f6c2ebde62b3..97eddc83d19c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -342,7 +342,36 @@ extern unsigned int kobjsize(const void *objp); #endif /* CONFIG_ARCH_HAS_PKEYS */ #ifdef CONFIG_X86_USER_SHADOW_STACK -# define VM_SHADOW_STACK VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */ +/* + * This flag should not be set with VM_SHARED because of lack of support + * core mm. It will also get a guard page. This helps userspace protect + * itself from attacks. The reasoning is as follows: + * + * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The + * INCSSP instruction can increment the shadow stack pointer. It is the + * shadow stack analog of an instruction like: + * + * addq $0x80, %rsp + * + * However, there is one important difference between an ADD on %rsp + * and INCSSP. In addition to modifying SSP, INCSSP also reads from the + * memory of the first and last elements that were "popped". It can be + * thought of as acting like this: + * + * READ_ONCE(ssp); // read+discard top element on stack + * ssp += nr_to_pop * 8; // move the shadow stack + * READ_ONCE(ssp-8); // read+discard last popped stack element + * + * The maximum distance INCSSP can move the SSP is 2040 bytes, before + * it would read the memory. Therefore a single page gap will be enough + * to prevent any operation from shifting the SSP to an adjacent stack, + * since it would have to land in the gap at least once, causing a + * fault. + * + * Prevent using INCSSP to move the SSP between shadow stacks by + * having a PAGE_SIZE guard gap. + */ +# define VM_SHADOW_STACK VM_HIGH_ARCH_5 #else # define VM_SHADOW_STACK VM_NONE #endif @@ -405,6 +434,8 @@ extern unsigned int kobjsize(const void *objp); #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS #endif +#define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK) + #ifdef CONFIG_STACK_GROWSUP #define VM_STACK VM_GROWSUP #define VM_STACK_EARLY VM_GROWSDOWN @@ -3273,15 +3304,26 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr) return mtree_load(&mm->mm_mt, addr); } +static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_GROWSDOWN) + return stack_guard_gap; + + /* See reasoning around the VM_SHADOW_STACK definition */ + if (vma->vm_flags & VM_SHADOW_STACK) + return PAGE_SIZE; + + return 0; +} + static inline unsigned long vm_start_gap(struct vm_area_struct *vma) { + unsigned long gap = stack_guard_start_gap(vma); unsigned long vm_start = vma->vm_start; - if (vma->vm_flags & VM_GROWSDOWN) { - vm_start -= stack_guard_gap; - if (vm_start > vma->vm_start) - vm_start = 0; - } + vm_start -= gap; + if (vm_start > vma->vm_start) + vm_start = 0; return vm_start; } diff --git a/mm/mmap.c b/mm/mmap.c index 4900f7471820..11dcf50cb933 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1572,7 +1572,7 @@ retry: gap = mas.index; gap += (info->align_offset - gap) & info->align_mask; tmp = mas_next(&mas, ULONG_MAX); - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */ + if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */ if (vm_start_gap(tmp) < gap + length - 1) { low_limit = tmp->vm_end; mas_reset(&mas); @@ -1624,7 +1624,7 @@ retry: gap -= (gap - info->align_offset) & info->align_mask; gap_end = mas.last; tmp = mas_next(&mas, ULONG_MAX); - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */ + if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */ if (vm_start_gap(tmp) <= gap_end) { high_limit = vm_start_gap(tmp); mas_reset(&mas); -- cgit v1.2.3 From e5136e876581ba5b63220378e25fec9dcec7bad1 Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Mon, 12 Jun 2023 17:10:43 -0700 Subject: mm: Warn on shadow stack memory in wrong vma The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One sharp edge is that PTEs that are both Write=0 and Dirty=1 are treated as shadow by the CPU, but this combination used to be created by the kernel on x86. Previous patches have changed the kernel to now avoid creating these PTEs unless they are for shadow stack memory. In case any missed corners of the kernel are still creating PTEs like this for non-shadow stack memory, and to catch any re-introductions of the logic, warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow stack VMAs when they are being zapped. This won't catch transient cases but should have decent coverage. In order to check if a PTE is shadow stack in core mm code, add two arch breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack specific code to be kept in arch/x86. Only do the check if shadow stack is supported by the CPU and configured because in rare cases older CPUs may write Dirty=1 to a Write=0 CPU on older CPUs. This check is handled in pte_shstk()/pmd_shstk(). Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Mark Brown Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Link: https://lore.kernel.org/all/20230613001108.3040476-18-rick.p.edgecombe%40intel.com --- arch/x86/include/asm/pgtable.h | 6 ++++++ arch/x86/mm/pgtable.c | 20 ++++++++++++++++++++ include/linux/pgtable.h | 14 ++++++++++++++ mm/huge_memory.c | 1 + mm/memory.c | 1 + 5 files changed, 42 insertions(+) (limited to 'include/linux') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 7bab1b22a354..9255b5b7a9d9 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1665,6 +1665,12 @@ static inline bool arch_has_hw_pte_young(void) return true; } +#define arch_check_zapped_pte arch_check_zapped_pte +void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte); + +#define arch_check_zapped_pmd arch_check_zapped_pmd +void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd); + #ifdef CONFIG_XEN_PV #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young static inline bool arch_has_hw_nonleaf_pmd_young(void) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 217c436acfd3..4bfbe4cb6978 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -886,3 +886,23 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) return pmd_clear_saveddirty(pmd); } + +void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte) +{ + /* + * Hardware before shadow stack can (rarely) set Dirty=1 + * on a Write=0 PTE. So the below condition + * only indicates a software bug when shadow stack is + * supported by the HW. This checking is covered in + * pte_shstk(). + */ + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && + pte_shstk(pte)); +} + +void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd) +{ + /* See note in arch_check_zapped_pte() */ + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && + pmd_shstk(pmd)); +} diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 9462f4a87d42..dd4637d6cfaa 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -313,6 +313,20 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_check_zapped_pte +static inline void arch_check_zapped_pte(struct vm_area_struct *vma, + pte_t pte) +{ +} +#endif + +#ifndef arch_check_zapped_pmd +static inline void arch_check_zapped_pmd(struct vm_area_struct *vma, + pmd_t pmd) +{ +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 23c2aa612926..554f6f82d225 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1681,6 +1681,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, */ orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd, tlb->fullmm); + arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); if (vma_is_special_huge(vma)) { if (arch_needs_pgtable_deposit()) diff --git a/mm/memory.c b/mm/memory.c index f093c73512c5..36289f33327e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1430,6 +1430,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, continue; ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); -- cgit v1.2.3 From 29f890d1050fc099fd578d9db844d6c0375902b6 Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Mon, 12 Jun 2023 17:10:46 -0700 Subject: x86/mm: Introduce MAP_ABOVE4G The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which require some core mm changes to function properly. One of the properties is that the shadow stack pointer (SSP), which is a CPU register that points to the shadow stack like the stack pointer points to the stack, can't be pointing outside of the 32 bit address space when the CPU is executing in 32 bit mode. It is desirable to prevent executing in 32 bit mode when shadow stack is enabled because the kernel can't easily support 32 bit signals. On x86 it is possible to transition to 32 bit mode without any special interaction with the kernel, by doing a "far call" to a 32 bit segment. So the shadow stack implementation can use this address space behavior as a feature, by enforcing that shadow stack memory is always mapped outside of the 32 bit address space. This way userspace will trigger a general protection fault which will in turn trigger a segfault if it tries to transition to 32 bit mode with shadow stack enabled. This provides a clean error generating border for the user if they try attempt to do 32 bit mode shadow stack, rather than leave the kernel in a half working state for userspace to be surprised by. So to allow future shadow stack enabling patches to map shadow stacks out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior is pretty much like MAP_32BIT, except that it has the opposite address range. The are a few differences though. If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit syscall. Since the default search behavior is top down, the normal kaslr base can be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add its own randomization in the bottom up case. For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G both are potentially valid, so both are used. In the bottomup search path, the default behavior is already consistent with MAP_ABOVE4G since mmap base should be above 4GB. Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB. So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode with shadow stack enabled would usually segfault anyway. This is already pretty decent guard rails. But the addition of MAP_ABOVE4G is some small complexity spent to make it make it more complete. Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Link: https://lore.kernel.org/all/20230613001108.3040476-21-rick.p.edgecombe%40intel.com --- arch/x86/include/uapi/asm/mman.h | 1 + arch/x86/kernel/sys_x86_64.c | 6 +++++- include/linux/mman.h | 4 ++++ 3 files changed, 10 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h index 775dbd3aff73..5a0256e73f1e 100644 --- a/arch/x86/include/uapi/asm/mman.h +++ b/arch/x86/include/uapi/asm/mman.h @@ -3,6 +3,7 @@ #define _ASM_X86_MMAN_H #define MAP_32BIT 0x40 /* only give out 32bit addresses */ +#define MAP_ABOVE4G 0x80 /* only map above 4GB */ #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS #define arch_calc_vm_prot_bits(prot, key) ( \ diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 8cc653ffdccd..c783aeb37dce 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -193,7 +193,11 @@ get_unmapped_area: info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.length = len; - info.low_limit = PAGE_SIZE; + if (!in_32bit_syscall() && (flags & MAP_ABOVE4G)) + info.low_limit = SZ_4G; + else + info.low_limit = PAGE_SIZE; + info.high_limit = get_mmap_base(0); /* diff --git a/include/linux/mman.h b/include/linux/mman.h index cee1e4b566d8..40d94411d492 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -15,6 +15,9 @@ #ifndef MAP_32BIT #define MAP_32BIT 0 #endif +#ifndef MAP_ABOVE4G +#define MAP_ABOVE4G 0 +#endif #ifndef MAP_HUGE_2MB #define MAP_HUGE_2MB 0 #endif @@ -50,6 +53,7 @@ | MAP_STACK \ | MAP_HUGETLB \ | MAP_32BIT \ + | MAP_ABOVE4G \ | MAP_HUGE_2MB \ | MAP_HUGE_1GB) -- cgit v1.2.3 From c35559f94ebc3e3bc82e56e07161bb5986cd9761 Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Mon, 12 Jun 2023 17:11:00 -0700 Subject: x86/shstk: Introduce map_shadow_stack syscall When operating with shadow stacks enabled, the kernel will automatically allocate shadow stacks for new threads, however in some cases userspace will need additional shadow stacks. The main example of this is the ucontext family of functions, which require userspace allocating and pivoting to userspace managed stacks. Unlike most other user memory permissions, shadow stacks need to be provisioned with special data in order to be useful. They need to be setup with a restore token so that userspace can pivot to them via the RSTORSSP instruction. But, the security design of shadow stacks is that they should not be written to except in limited circumstances. This presents a problem for userspace, as to how userspace can provision this special data, without allowing for the shadow stack to be generally writable. Previously, a new PROT_SHADOW_STACK was attempted, which could be mprotect()ed from RW permissions after the data was provisioned. This was found to not be secure enough, as other threads could write to the shadow stack during the writable window. The kernel can use a special instruction, WRUSS, to write directly to userspace shadow stacks. So the solution can be that memory can be mapped as shadow stack permissions from the beginning (never generally writable in userspace), and the kernel itself can write the restore token. First, a new madvise() flag was explored, which could operate on the PROT_SHADOW_STACK memory. This had a couple of downsides: 1. Extra checks were needed in mprotect() to prevent writable memory from ever becoming PROT_SHADOW_STACK. 2. Extra checks/vma state were needed in the new madvise() to prevent restore tokens being written into the middle of pre-used shadow stacks. It is ideal to prevent restore tokens being added at arbitrary locations, so the check was to make sure the shadow stack had never been written to. 3. It stood out from the rest of the madvise flags, as more of direct action than a hint at future desired behavior. So rather than repurpose two existing syscalls (mmap, madvise) that don't quite fit, just implement a new map_shadow_stack syscall to allow userspace to map and setup new shadow stacks in one step. While ucontext is the primary motivator, userspace may have other unforeseen reasons to setup its own shadow stacks using the WRSS instruction. Towards this provide a flag so that stacks can be optionally setup securely for the common case of ucontext without enabling WRSS. Or potentially have the kernel set up the shadow stack in some new way. The following example demonstrates how to create a new shadow stack with map_shadow_stack: void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/uapi/asm/mman.h | 3 ++ arch/x86/kernel/shstk.c | 59 +++++++++++++++++++++++++++++----- include/linux/syscalls.h | 1 + kernel/sys_ni.c | 1 + 5 files changed, 57 insertions(+), 8 deletions(-) (limited to 'include/linux') diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 227538b0ce80..38db4b1c291a 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -373,6 +373,7 @@ 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat +452 64 map_shadow_stack sys_map_shadow_stack # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h index 5a0256e73f1e..8148bdddbd2c 100644 --- a/arch/x86/include/uapi/asm/mman.h +++ b/arch/x86/include/uapi/asm/mman.h @@ -13,6 +13,9 @@ ((key) & 0x8 ? VM_PKEY_BIT3 : 0)) #endif +/* Flags for map_shadow_stack(2) */ +#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */ + #include #endif /* _ASM_X86_MMAN_H */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 50733a510446..04c37b33a625 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr) return 0; } -static unsigned long alloc_shstk(unsigned long size) +static unsigned long alloc_shstk(unsigned long addr, unsigned long size, + unsigned long token_offset, bool set_res_tok) { int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; struct mm_struct *mm = current->mm; - unsigned long addr, unused; + unsigned long mapped_addr, unused; - mmap_write_lock(mm); - addr = do_mmap(NULL, 0, size, PROT_READ, flags, - VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); + if (addr) + flags |= MAP_FIXED_NOREPLACE; + mmap_write_lock(mm); + mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags, + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); mmap_write_unlock(mm); - return addr; + if (!set_res_tok || IS_ERR_VALUE(mapped_addr)) + goto out; + + if (create_rstor_token(mapped_addr + token_offset, NULL)) { + vm_munmap(mapped_addr, size); + return -EINVAL; + } + +out: + return mapped_addr; } static unsigned long adjust_shstk_size(unsigned long size) @@ -134,7 +147,7 @@ static int shstk_setup(void) return -EOPNOTSUPP; size = adjust_shstk_size(0); - addr = alloc_shstk(size); + addr = alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return PTR_ERR((void *)addr); @@ -178,7 +191,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl return 0; size = adjust_shstk_size(stack_size); - addr = alloc_shstk(size); + addr = alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return addr; @@ -398,6 +411,36 @@ static int shstk_disable(void) return 0; } +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags) +{ + bool set_tok = flags & SHADOW_STACK_SET_TOKEN; + unsigned long aligned_size; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + return -EOPNOTSUPP; + + if (flags & ~SHADOW_STACK_SET_TOKEN) + return -EINVAL; + + /* If there isn't space for a token */ + if (set_tok && size < 8) + return -ENOSPC; + + if (addr && addr < SZ_4G) + return -ERANGE; + + /* + * An overflow would result in attempting to write the restore token + * to the wrong location. Not catastrophic, but just return the right + * error code and block it. + */ + aligned_size = PAGE_ALIGN(size); + if (aligned_size < size) + return -EOVERFLOW; + + return alloc_shstk(addr, aligned_size, size, set_tok); +} + long shstk_prctl(struct task_struct *task, int option, unsigned long features) { if (option == ARCH_SHSTK_LOCK) { diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 03e3d0121d5e..7f6dc0988197 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -953,6 +953,7 @@ asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long l asmlinkage long sys_cachestat(unsigned int fd, struct cachestat_range __user *cstat_range, struct cachestat __user *cstat, unsigned int flags); +asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags); /* * Architecture-specific system calls diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 781de7cc6a4e..e137c1385c56 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -274,6 +274,7 @@ COND_SYSCALL(vm86old); COND_SYSCALL(modify_ldt); COND_SYSCALL(vm86); COND_SYSCALL(kexec_file_load); +COND_SYSCALL(map_shadow_stack); /* s390 */ COND_SYSCALL(s390_pci_mmio_read); -- cgit v1.2.3 From 0ee44885fe9cf19eb3870947c8f3c275017e48a7 Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Mon, 12 Jun 2023 17:11:02 -0700 Subject: x86: Expose thread features in /proc/$PID/status Applications and loaders can have logic to decide whether to enable shadow stack. They usually don't report whether shadow stack has been enabled or not, so there is no way to verify whether an application actually is protected by shadow stack. Add two lines in /proc/$PID/status to report enabled and locked features. Since, this involves referring to arch specific defines in asm/prctl.h, implement an arch breakout to emit the feature lines. [Switched to CET, added to commit log] Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Reviewed-by: Kees Cook Acked-by: Mike Rapoport (IBM) Tested-by: Pengfei Xu Tested-by: John Allen Tested-by: Kees Cook Link: https://lore.kernel.org/all/20230613001108.3040476-37-rick.p.edgecombe%40intel.com --- arch/x86/kernel/cpu/proc.c | 23 +++++++++++++++++++++++ fs/proc/array.c | 6 ++++++ include/linux/proc_fs.h | 1 + 3 files changed, 30 insertions(+) (limited to 'include/linux') diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index 099b6f0d96bd..31c0e68f6227 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -4,6 +4,8 @@ #include #include #include +#include +#include #include "cpu.h" @@ -175,3 +177,24 @@ const struct seq_operations cpuinfo_op = { .stop = c_stop, .show = show_cpuinfo, }; + +#ifdef CONFIG_X86_USER_SHADOW_STACK +static void dump_x86_features(struct seq_file *m, unsigned long features) +{ + if (features & ARCH_SHSTK_SHSTK) + seq_puts(m, "shstk "); + if (features & ARCH_SHSTK_WRSS) + seq_puts(m, "wrss "); +} + +void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task) +{ + seq_puts(m, "x86_Thread_features:\t"); + dump_x86_features(m, task->thread.features); + seq_putc(m, '\n'); + + seq_puts(m, "x86_Thread_features_locked:\t"); + dump_x86_features(m, task->thread.features_locked); + seq_putc(m, '\n'); +} +#endif /* CONFIG_X86_USER_SHADOW_STACK */ diff --git a/fs/proc/array.c b/fs/proc/array.c index d35bbf35a874..2c2efbe685d8 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -431,6 +431,11 @@ static inline void task_untag_mask(struct seq_file *m, struct mm_struct *mm) seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm)); } +__weak void arch_proc_pid_thread_features(struct seq_file *m, + struct task_struct *task) +{ +} + int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { @@ -455,6 +460,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); + arch_proc_pid_thread_features(m, task); return 0; } diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 253f2676d93a..de407e7c3b55 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -159,6 +159,7 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns, #endif /* CONFIG_PROC_PID_ARCH_STATUS */ void arch_report_meminfo(struct seq_file *m); +void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task); #else /* CONFIG_PROC_FS */ -- cgit v1.2.3 From 87f0df7828899c552bcdde639c045983d5aeeed9 Mon Sep 17 00:00:00 2001 From: Rick Edgecombe Date: Thu, 6 Jul 2023 16:32:48 -0700 Subject: x86/shstk: Move arch detail comment out of core mm The comment around VM_SHADOW_STACK in mm.h refers to a lot of x86 specific details that don't belong in a cross arch file. Remove these out of core mm, and just leave the non-arch details. Since the comment includes some useful details that would be good to retain in the source somewhere, put the arch specifics parts in arch/x86/shstk.c near alloc_shstk(), where memory of this type is allocated. Include a reference to the existence of the x86 details near the VM_SHADOW_STACK definition mm.h. Signed-off-by: Rick Edgecombe Signed-off-by: Dave Hansen Reviewed-by: Mark Brown Link: https://lore.kernel.org/all/20230706233248.445713-1-rick.p.edgecombe%40intel.com --- arch/x86/kernel/shstk.c | 25 +++++++++++++++++++++++++ include/linux/mm.h | 32 ++++++-------------------------- 2 files changed, 31 insertions(+), 26 deletions(-) (limited to 'include/linux') diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index b26810c7cd1c..47f5204b0fa9 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -72,6 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr) return 0; } +/* + * VM_SHADOW_STACK will have a guard page. This helps userspace protect + * itself from attacks. The reasoning is as follows: + * + * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The + * INCSSP instruction can increment the shadow stack pointer. It is the + * shadow stack analog of an instruction like: + * + * addq $0x80, %rsp + * + * However, there is one important difference between an ADD on %rsp + * and INCSSP. In addition to modifying SSP, INCSSP also reads from the + * memory of the first and last elements that were "popped". It can be + * thought of as acting like this: + * + * READ_ONCE(ssp); // read+discard top element on stack + * ssp += nr_to_pop * 8; // move the shadow stack + * READ_ONCE(ssp-8); // read+discard last popped stack element + * + * The maximum distance INCSSP can move the SSP is 2040 bytes, before + * it would read the memory. Therefore a single page gap will be enough + * to prevent any operation from shifting the SSP to an adjacent stack, + * since it would have to land in the gap at least once, causing a + * fault. + */ static unsigned long alloc_shstk(unsigned long addr, unsigned long size, unsigned long token_offset, bool set_res_tok) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 97eddc83d19c..8c0350c1134a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -343,33 +343,13 @@ extern unsigned int kobjsize(const void *objp); #ifdef CONFIG_X86_USER_SHADOW_STACK /* - * This flag should not be set with VM_SHARED because of lack of support - * core mm. It will also get a guard page. This helps userspace protect - * itself from attacks. The reasoning is as follows: + * VM_SHADOW_STACK should not be set with VM_SHARED because of lack of + * support core mm. * - * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The - * INCSSP instruction can increment the shadow stack pointer. It is the - * shadow stack analog of an instruction like: - * - * addq $0x80, %rsp - * - * However, there is one important difference between an ADD on %rsp - * and INCSSP. In addition to modifying SSP, INCSSP also reads from the - * memory of the first and last elements that were "popped". It can be - * thought of as acting like this: - * - * READ_ONCE(ssp); // read+discard top element on stack - * ssp += nr_to_pop * 8; // move the shadow stack - * READ_ONCE(ssp-8); // read+discard last popped stack element - * - * The maximum distance INCSSP can move the SSP is 2040 bytes, before - * it would read the memory. Therefore a single page gap will be enough - * to prevent any operation from shifting the SSP to an adjacent stack, - * since it would have to land in the gap at least once, causing a - * fault. - * - * Prevent using INCSSP to move the SSP between shadow stacks by - * having a PAGE_SIZE guard gap. + * These VMAs will get a single end guard page. This helps userspace protect + * itself from attacks. A single page is enough for current shadow stack archs + * (x86). See the comments near alloc_shstk() in arch/x86/kernel/shstk.c + * for more details on the guard size. */ # define VM_SHADOW_STACK VM_HIGH_ARCH_5 #else -- cgit v1.2.3