Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:
- Fix virtual vs physical address confusion in vmem_add_range() and
vmem_remove_range() functions
- Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
throughout s390 code
- Make all PSW related defines also available for assembler files.
Remove PSW_DEFAULT_KEY define from uapi for that
- When adding an undefined symbol the build still succeeds, but
userspace crashes trying to execute VDSO, because the symbol is not
resolved. Add undefined symbols check to prevent that
- Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
memory when executing s390 crypto adapter IOCTL
- Add -fPIE flag to prevent decompressor misaligned symbol build error
with clang
- Use .balign instead of .align everywhere. This is a no-op for s390,
but with this there no mix in using .align and .balign anymore
- Filter out -mno-pic-data-is-text-relative flag when compiling kernel
to prevent VDSO build error
- Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
mask directly
- Do not retry administrative requests to some s390 crypto cards, since
the firmware assumes replay attacks
- Remove most of the debug code, which is build in when kernel config
option CONFIG_ZCRYPT_DEBUG is enabled
- Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
off the multiple devices support for the s390 zcrypt device driver
- With the conversion to generic entry machine checks are accounted to
the current context instead of irq time. As result, the STCKF
instruction at the beginning of the machine check handler and the
lowcore member are no longer required, therefore remove it
- Fix various typos found with codespell
- Minor cleanups to CPU-measurement Counter and Sampling Facilities
code
- Revert patch that removes VMEM_MAX_PHYS macro, since it causes a
regression
* tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (25 commits)
Revert "s390/mm: get rid of VMEM_MAX_PHYS macro"
s390/cpum_sf: remove check on CPU being online
s390/cpum_sf: handle casts consistently
s390/cpum_sf: remove unnecessary debug statement
s390/cpum_sf: remove parameter in call to pr_err
s390/cpum_sf: simplify function setup_pmu_cpu
s390/cpum_cf: remove unneeded debug statements
s390/entry: remove mcck clock
s390: fix various typos
s390/zcrypt: remove ZCRYPT_MULTIDEVNODES kernel config option
s390/zcrypt: do not retry administrative requests
s390/zcrypt: cleanup some debug code
s390/entry: rework entering DAT-on mode on CPU restart
s390/mm: fence off VM macros from asm and linker
s390: include linux/io.h instead of asm/io.h
s390/ptrace: make all psw related defines also available for asm
s390/ptrace: remove PSW_DEFAULT_KEY from uapi
s390/vdso: filter out mno-pic-data-is-text-relative cflag
s390: consistently use .balign instead of .align
s390/decompressor: fix misaligned symbol build error
...
|
|
This reverts commit 456be42aa713e7f83b467db66ceae779431c7d9d.
The assumption VMEM_MAX_PHYS should match ident_map_size
is wrong. At least discontiguous saved segments (DCSS)
could be loaded at addresses beyond ident_map_size and
dcssblk device driver might fail as result.
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Fix various typos found with codespell.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Include linux/io.h instead of asm/io.h everywhere. linux/io.h includes
asm/io.h, so this shouldn't cause any problems. Instead this might help for
some randconfig build errors which were reported due to some undefined io
related functions.
Also move the changed include so it stays grouped together with other
includes from the same directory.
For ctcm_mpc.c also remove not needed comments (actually questions).
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.
It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.
And it worked fine. We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.
That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken. Oops.
It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful. We have basically three
different cases of stack expansion, and they all work just a bit
differently:
- the common and obvious case is the page fault handling. It's actually
fairly simple and straightforward, except for the fact that we have
something like 24 different versions of it, and you end up in a maze
of twisty little passages, all alike.
- the simplest case is the execve() code that creates a new stack.
There are no real locking concerns because it's all in a private new
VM that hasn't been exposed to anybody, but lockdep still can end up
unhappy if you get it wrong.
- and finally, we have GUP and page pinning, which shouldn't really be
expanding the stack in the first place, but in addition to execve()
we also use it for ptrace(). And debuggers do want to possibly access
memory under the stack pointer and thus need to be able to expand the
stack as a special case.
None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times. And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.
So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.
Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.
And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.
That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern. Along with the couple of special cases in execve() and
GUP.
So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion. The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".
The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.
And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).
In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().
Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else. Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.
Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.
Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch. That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.
Reported-by: Ruihan Li <lrh2000@pku.edu.cn>
* branch 'expand-stack':
gup: add warning if some caller would seem to want stack expansion
mm: always expand the stack with the mmap write lock held
execve: expand new process stack manually ahead of time
mm: make find_extend_vma() fail if write lock not held
powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
arm/mm: Convert to using lock_mm_and_find_vma()
riscv/mm: Convert to using lock_mm_and_find_vma()
mips/mm: Convert to using lock_mm_and_find_vma()
powerpc/mm: Convert to using lock_mm_and_find_vma()
arm64/mm: Convert to using lock_mm_and_find_vma()
mm: make the page fault mmap locking killable
mm: introduce new 'lock_mm_and_find_vma()' page fault helper
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
|
|
Fix virtual vs physical address confusion (which currently are the same).
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
VMEM_MAX_PHYS is supposed to be the highest physical
address that can be added to the identity mapping.
It should match ident_map_size, which has the same
meaning. However, unlike ident_map_size it is not
adjusted against various limiting factors (see the
comment to setup_ident_map_size() function). That
renders all checks against VMEM_MAX_PHYS invalid.
Further, VMEM_MAX_PHYS is currently set to vmemmap,
which is an address in virtual memory space. However,
it gets compared against physical addresses in various
locations. That works, because both address spaces
are the same on s390, but otherwise it is wrong.
Instead of fixing VMEM_MAX_PHYS misuse and semantics
just remove it.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Miscellaneous features, cleanups, and fixes for vfs and individual fs
Features:
- Use mode 0600 for file created by cachefilesd so it can be run by
unprivileged users. This aligns them with directories which are
already created with mode 0700 by cachefilesd
- Reorder a few members in struct file to prevent some false sharing
scenarios
- Indicate that an eventfd is used a semaphore in the eventfd's
fdinfo procfs file
- Add a missing uapi header for eventfd exposing relevant uapi
defines
- Let the VFS protect transitions of a superblock from read-only to
read-write in addition to the protection it already provides for
transitions from read-write to read-only. Protecting read-only to
read-write transitions allows filesystems such as ext4 to perform
internal writes, keeping writers away until the transition is
completed
Cleanups:
- Arnd removed the architecture specific arch_report_meminfo()
prototypes and added a generic one into procfs.h. Note, we got a
report about a warning in amdpgpu codepaths that suggested this was
bisectable to this change but we concluded it was a false positive
- Remove unused parameters from split_fs_names()
- Rename put_and_unmap_page() to unmap_and_put_page() to let the name
reflect the order of the cleanup operation that has to unmap before
the actual put
- Unexport buffer_check_dirty_writeback() as it is not used outside
of block device aops
- Stop allocating aio rings from highmem
- Protecting read-{only,write} transitions in the VFS used open-coded
barriers in various places. Replace them with proper little helpers
and document both the helpers and all barrier interactions involved
when transitioning between read-{only,write} states
- Use flexible array members in old readdir codepaths
Fixes:
- Use the correct type __poll_t for epoll and eventfd
- Replace all deprecated strlcpy() invocations, whose return value
isn't checked with an equivalent strscpy() call
- Fix some kernel-doc warnings in fs/open.c
- Reduce the stack usage in jffs2's xattr codepaths finally getting
rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
royally annoying compilation warning
- Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
fmode_t is required to avoid fmode_t to integer degradation
warnings
- Create coredumps with O_WRONLY instead of O_RDWR. There's a long
explanation in that commit how O_RDWR is actually a bug which we
found out with the help of Linus and git archeology
- Fix "no previous prototype" warnings in the pipe codepaths
- Add overflow calculations for remap_verify_area() as a signed
addition overflow could be triggered in xfstests
- Fix a null pointer dereference in sysv
- Use an unsigned variable for length calculations in jfs avoiding
compilation warnings with gcc 13
- Fix a dangling pipe pointer in the watch queue codepath
- The legacy mount option parser provided as a fallback by the VFS
for filesystems not yet converted to the new mount api did prefix
the generated mount option string with a leading ',' causing issues
for some filesystems
- Fix a repeated word in a comment in fs.h
- autofs: Update the ctime when mtime is updated as mandated by
POSIX"
* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
readdir: Replace one-element arrays with flexible-array members
fs: Provide helpers for manipulating sb->s_readonly_remount
fs: Protect reconfiguration of sb read-write from racing writes
eventfd: add a uapi header for eventfd userspace APIs
autofs: set ctime as well when mtime changes on a dir
eventfd: show the EFD_SEMAPHORE flag in fdinfo
fs/aio: Stop allocating aio rings from HIGHMEM
fs: Fix comment typo
fs: unexport buffer_check_dirty_writeback
fs: avoid empty option when generating legacy mount string
watch_queue: prevent dangling pipe pointer
fs.h: Optimize file struct to prevent false sharing
highmem: Rename put_and_unmap_page() to unmap_and_put_page()
cachefiles: Allow the cache to be non-root
init: remove unused names parameter in split_fs_names()
jfs: Use unsigned variable for length calculations
fs/sysv: Null check to prevent null-ptr-deref bug
fs: use UB-safe check for signed addition overflow in remap_verify_area
procfs: consolidate arch_report_meminfo declaration
fs: pipe: reveal missing function protoypes
...
|
|
Since commit 3b5c3f000c2e ("s390/kasan: move shadow mapping
to decompressor") the decompressor establishes mappings for
the shadow memory and sets initial protection attributes to
RWX. The decompressed kernel resets protection to RW+NX
later on.
In case a shadow memory range is not aligned on page boundary
(e.g. as result of mem= kernel command line parameter use),
the "Checked W+X mappings: FAILED, 1 W+X pages found" warning
hits.
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Fixes: 557b19709da9 ("s390/kasan: move shadow mapping to decompressor")
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
and use pte_unmap_unlock() instead of direct spin_unlock() (even though
ptep ends up unused inside the macro).
Link: https://lkml.kernel.org/r/78873af-e1ec-4f9-47ac-483940ac6daa@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.
Link: https://lkml.kernel.org/r/3ff29363-336a-9733-12a1-5c31a45c8aeb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The arch_report_meminfo() function is provided by four architectures,
with a __weak fallback in procfs itself. On architectures that don't
have a custom version, the __weak version causes a warning because
of the missing prototype.
Remove the architecture specific prototypes and instead add one
in linux/proc_fs.h.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for arch/x86
Acked-by: Helge Deller <deller@gmx.de> # parisc
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Message-Id: <20230516195834.551901-1-arnd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull more kvm updates from Paolo Bonzini:
"This includes the 6.4 changes for RISC-V, and a few bugfix patches for
other architectures. For x86, this closes a longstanding performance
issue in the newer and (usually) more scalable page table management
code.
RISC-V:
- ONE_REG interface to enable/disable SBI extensions
- Zbb extension for Guest/VM
- AIA CSR virtualization
x86:
- Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU
can result in the root being freed even though the root is
completely valid and can be reused as-is (with a TLB flush).
s390:
- A couple of bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: fix race in gmap_make_secure()
KVM: s390: pv: fix asynchronous teardown for small VMs
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
RISC-V: KVM: Virtualize per-HART AIA CSRs
RISC-V: KVM: Use bitmap for irqs_pending and irqs_pending_mask
RISC-V: KVM: Add ONE_REG interface for AIA CSRs
RISC-V: KVM: Implement subtype for CSR ONE_REG interface
RISC-V: KVM: Initial skeletal support for AIA
RISC-V: KVM: Drop the _MASK suffix from hgatp.VMID mask defines
RISC-V: Detect AIA CSRs from ISA string
RISC-V: Add AIA related CSR defines
RISC-V: KVM: Allow Zbb extension for Guest/VM
RISC-V: KVM: Add ONE_REG interface to enable/disable SBI extensions
RISC-V: KVM: Alphabetize selects
KVM: RISC-V: Retry fault if vma_lookup() results become invalid
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
For 6.4
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- Some DAMON cleanups from Kefeng Wang
- Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE
ioctl's behavior more similar to KSM's behavior.
[ Andrew called these "final", but I suspect we'll have a series fixing
up the fact that the last commit in the dmapools series in the
previous pull seems to have unintentionally just reverted all the
other commits in the same series.. - Linus ]
* tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: hwpoison: coredump: support recovery from dump_user_range()
mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
mm/ksm: move disabling KSM from s390/gmap code to KSM code
selftests/ksm: ksm_functional_tests: add prctl unmerge test
mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
mm/damon/paddr: fix missing folio_sz update in damon_pa_young()
mm/damon/paddr: minor refactor of damon_pa_mark_accessed_or_deactivate()
mm/damon/paddr: minor refactor of damon_pa_pageout()
|
|
On machines without the Destroy Secure Configuration Fast UVC, the
topmost level of page tables is set aside and freed asynchronously
as last step of the asynchronous teardown.
Each gmap has a host_to_guest radix tree mapping host (userspace)
addresses (with 1M granularity) to gmap segment table entries (pmds).
If a guest is smaller than 2GB, the topmost level of page tables is the
segment table (i.e. there are only 2 levels). Replacing it means that
the pointers in the host_to_guest mapping would become stale and cause
all kinds of nasty issues.
This patch fixes the issue by disallowing asynchronous teardown for
guests with only 2 levels of page tables. Userspace should (and already
does) try using the normal destroy if the asynchronous one fails.
Update s390_replace_asce so it refuses to replace segment type ASCEs.
This is still needed in case the normal destroy VM fails.
Fixes: fb491d5500a7 ("KVM: s390: pv: asynchronous destroy for reboot")
Reviewed-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-Id: <20230421085036.52511-2-imbrenda@linux.ibm.com>
|
|
Let's factor out actual disabling of KSM. The existing "mm->def_flags &=
~VM_MERGEABLE;" was essentially a NOP and can be dropped, because
def_flags should never include VM_MERGEABLE. Note that we don't currently
prevent re-enabling KSM.
This should now be faster in case KSM was never enabled, because we only
conditionally iterate all VMAs. Further, it certainly looks cleaner.
Link: https://lkml.kernel.org/r/20230422210156.33630-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Stefan Roesch <shr@devkernel.io>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Add support for stackleak feature. Also allow specifying
architecture-specific stackleak poison function to enable faster
implementation. On s390, the mvc-based implementation helps decrease
typical overhead from a factor of 3 to just 25%
- Convert all assembler files to use SYM* style macros, deprecating the
ENTRY() macro and other annotations. Select ARCH_USE_SYM_ANNOTATIONS
- Improve KASLR to also randomize module and special amode31 code base
load addresses
- Rework decompressor memory tracking to support memory holes and
improve error handling
- Add support for protected virtualization AP binding
- Add support for set_direct_map() calls
- Implement set_memory_rox() and noexec module_alloc()
- Remove obsolete overriding of mem*() functions for KASAN
- Rework kexec/kdump to avoid using nodat_stack to call purgatory
- Convert the rest of the s390 code to use flexible-array member
instead of a zero-length array
- Clean up uaccess inline asm
- Enable ARCH_HAS_MEMBARRIER_SYNC_CORE
- Convert to using CONFIG_FUNCTION_ALIGNMENT and enable
DEBUG_FORCE_FUNCTION_ALIGN_64B
- Resolve last_break in userspace fault reports
- Simplify one-level sysctl registration
- Clean up branch prediction handling
- Rework CPU counter facility to retrieve available counter sets just
once
- Other various small fixes and improvements all over the code
* tag 's390-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (118 commits)
s390/stackleak: provide fast __stackleak_poison() implementation
stackleak: allow to specify arch specific stackleak poison function
s390: select ARCH_USE_SYM_ANNOTATIONS
s390/mm: use VM_FLUSH_RESET_PERMS in module_alloc()
s390: wire up memfd_secret system call
s390/mm: enable ARCH_HAS_SET_DIRECT_MAP
s390/mm: use BIT macro to generate SET_MEMORY bit masks
s390/relocate_kernel: adjust indentation
s390/relocate_kernel: use SYM* macros instead of ENTRY(), etc.
s390/entry: use SYM* macros instead of ENTRY(), etc.
s390/purgatory: use SYM* macros instead of ENTRY(), etc.
s390/kprobes: use SYM* macros instead of ENTRY(), etc.
s390/reipl: use SYM* macros instead of ENTRY(), etc.
s390/head64: use SYM* macros instead of ENTRY(), etc.
s390/earlypgm: use SYM* macros instead of ENTRY(), etc.
s390/mcount: use SYM* macros instead of ENTRY(), etc.
s390/crc32le: use SYM* macros instead of ENTRY(), etc.
s390/crc32be: use SYM* macros instead of ENTRY(), etc.
s390/crypto,chacha: use SYM* macros instead of ENTRY(), etc.
s390/amode31: use SYM* macros instead of ENTRY(), etc.
...
|
|
Instead of having callers care about the mmap_min_addr logic for the
lowest valid mapping address (and some of them getting it wrong), just
move the logic into vm_unmapped_area() itself. One less thing for various
architecture cases (and generic helpers) to worry about.
We should really try to make much more of this be common code, but baby
steps..
Without this, vm_unmapped_area() could return an address below
mmap_min_addr (because some caller forgot about that). That then causes
the mmap machinery to think it has found a workable address, but then
later security_mmap_addr(addr) is unhappy about it and the mmap() returns
with a nonsensical error (EPERM).
The proper action is to either return ENOMEM (if the virtual address space
is exhausted), or try to find another address (ie do a bottom-up search
for free addresses after the top-down one failed).
See commit 2afc745f3e30 ("mm: ensure get_unmapped_area() returns higher
address than mmap_min_addr"), which fixed this for one call site (the
generic arch_get_unmapped_area_topdown() fallback) but left other cases
alone.
Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs (if
they are part of the same security domain). Only a higler level entity
like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal workload knowledge to make targeted madvise
calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine
with 64GB main memory):
full_scans: 445
general_profit: 20158298048
max_page_sharing: 256
merge_across_nodes: 1
pages_shared: 129547
pages_sharing: 5119146
pages_to_scan: 4000
pages_unshared: 1760924
pages_volatile: 10761341
run: 1
sleep_millisecs: 20
stable_node_chains: 167
stable_node_chains_prune_millisecs: 2000
stable_node_dups: 2751
use_zero_pages: 0
zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test
the new prctl options and the other one is a fork test to verify that
the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the
MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Make use of the set_direct_map() calls for module allocations.
In particular:
- All changes to read-only permissions in kernel VA mappings are also
applied to the direct mapping. Note that execute permissions are
intentionally not applied to the direct mapping in order to make
sure that all allocated pages within the direct mapping stay
non-executable
- module_alloc() passes the VM_FLUSH_RESET_PERMS to __vmalloc_node_range()
to make sure that all implicit permission changes made to the direct
mapping are reset when the allocated vm area is freed again
Side effects: the direct mapping will be fragmented depending on how many
vm areas with VM_FLUSH_RESET_PERMS and/or explicit page permission changes
are allocated and freed again.
For example, just after boot of a system the direct mapping statistics look
like:
$cat /proc/meminfo
...
DirectMap4k: 111628 kB
DirectMap1M: 16665600 kB
DirectMap2G: 0 kB
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Implement the set_direct_map_*() API, which allows to invalidate and set
default permissions to pages within the direct mapping.
Note that kernel_page_present(), which is also supposed to be part of this
API, is intentionally not implemented. The reason for this is that
kernel_page_present() is only used (and currently only makes sense) for
suspend/resume, which isn't supported on s390.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Commit bb1520d581a3 ("s390/mm: start kernel with DAT enabled") did not
implement direct map accounting in the early page table setup code. In
result the reported values are bogus now:
$cat /proc/meminfo
...
DirectMap4k: 5120 kB
DirectMap1M: 18446744073709546496 kB
DirectMap2G: 0 kB
Fix this by adding the missing accounting. The result looks sane again:
$cat /proc/meminfo
...
DirectMap4k: 6156 kB
DirectMap1M: 2091008 kB
DirectMap2G: 6291456 kB
Fixes: bb1520d581a3 ("s390/mm: start kernel with DAT enabled")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Given that set_memory_rox() and set_memory_rwnx() exist, it is possible
to get rid of all open coded __set_memory() usages and replace them with
proper helper calls everywhere.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
This is the s390 variant of "x86/mm: try VMA lock-based page fault handling
first".
Link: https://lkml.kernel.org/r/20230314132808.1266335-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Make use of atomic_fetch_xor() instead of an atomic_cmpxchg() loop to
implement atomic_xor_bits() (aka atomic_xor_return()). This makes the C
code more readable and in addition generates better code, since for z196
and newer a single lax instruction is generated instead of a cmpxchg()
loop.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Vasily Gorbik says:
===================
Combine and generalize all methods for finding unused memory in
decompressor, while decreasing complexity, add memory holes support,
while improving error handling (especially in low-memory conditions)
and debug-ability.
===================
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Since regular paging structs are initialized in decompressor already
move KASAN shadow mapping to decompressor as well. This helps to avoid
allocating KASAN required memory in 1 large chunk, de-duplicate paging
structs creation code and start the uncompressed kernel with KASAN
instrumentation right away. This also allows to avoid all pitfalls
accidentally calling KASAN instrumented code during KASAN initialization.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Allow changing page table attributes for KASAN shadow memory ranges.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Currently several approaches for finding unused memory in decompressor
are utilized. While "safe_addr" grows towards higher addresses, vmem
code allocates paging structures top down. The former requires careful
ordering. In addition to that ipl report handling code verifies potential
intersections with secure boot certificates on its own. Neither of two
approaches are memory holes aware and consistent with each other in low
memory conditions.
To solve that, existing approaches are generalized and combined
together, as well as online memory ranges are now taken into
consideration.
physmem_info has been extended to contain reserved memory ranges. New
set of functions allow to handle reserves and find unused memory.
All reserves and memory allocations are "typed". In case of out of
memory condition decompressor fails with detailed info on current
reserved ranges and usable online memory.
Linux version 6.2.0 ...
Kernel command line: ... mem=100M
Our of memory allocating 100000 bytes 100000 aligned in range 0:5800000
Reserved memory ranges:
0000000000000000 0000000003e33000 DECOMPRESSOR
0000000003f00000 00000000057648a3 INITRD
00000000063e0000 00000000063e8000 VMEM
00000000063eb000 00000000063f4000 VMEM
00000000063f7800 0000000006400000 VMEM
0000000005800000 0000000006300000 KASAN
Usable online memory ranges (info source: sclp read info [3]):
0000000000000000 0000000006400000
Usable online memory total: 6400000 Reserved: 61b10a3 Free: 24ef5d
Call Trace:
(sp:000000000002bd58 [<0000000000012a70>] physmem_alloc_top_down+0x60/0x14c)
sp:000000000002bdc8 [<0000000000013756>] _pa+0x56/0x6a
sp:000000000002bdf0 [<0000000000013bcc>] pgtable_populate+0x45c/0x65e
sp:000000000002be90 [<00000000000140aa>] setup_vmem+0x2da/0x424
sp:000000000002bec8 [<0000000000011c20>] startup_kernel+0x428/0x8b4
sp:000000000002bf60 [<00000000000100f4>] startup_normal+0xd4/0xd4
physmem_alloc_range allows to find free memory in specified range. It
should be used for one time allocations only like finding position for
amode31 and vmlinux.
physmem_alloc_top_down can be used just like physmem_alloc_range, but
it also allows multiple allocations per type and tries to merge sequential
allocations together. Which is useful for paging structures allocations.
If sequential allocations cannot be merged together they are "chained",
allowing easy per type reserved ranges enumeration and migration to
memblock later. Extra "struct reserved_range" allocated for chaining are
not tracked or reserved but rely on the fact that both
physmem_alloc_range and physmem_alloc_top_down search for free memory
only below current top down allocator position. All reserved ranges
should be transferred to memblock before memblock allocations are
enabled.
The startup code has been reordered to delay any memory allocations until
online memory ranges are detected and occupied memory ranges are marked as
reserved to be excluded from follow-up allocations.
Ipl report certificates are a special case, ipl report certificates list
is checked together with other memory reserves until certificates are
saved elsewhere.
KASAN required memory for shadow memory allocation and mapping is reserved
as 1 large chunk which is later passed to KASAN early initialization code.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
In preparation to extending mem_detect with additional information like
reserved ranges rename it to more generic physmem_info. This new naming
also help to avoid confusion by using more exact terms like "physmem
online ranges", etc.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230310234525.3986352-6-mcgrof@kernel.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230310234525.3986352-5-mcgrof@kernel.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- Add empty command line parameter handling stubs to kernel for all
command line parameters which are handled in the decompressor. This
avoids invalid "Unknown kernel command line parameters" messages from
the kernel, and also avoids that these will be incorrectly passed to
user space. This caused already confusion, therefore add the empty
stubs
- Add missing phys_to_virt() handling to machine check handler
- Introduce and use a union to be used for zcrypt inline assemblies.
This makes sure that only a register wide member of the union is
passed as input and output parameter to inline assemblies, while
usual C code uses other members of the union to access bit fields of
it
- Add and use a READ_ONCE_ALIGNED_128() macro, which can be used to
atomically read a 128-bit value from memory. This replaces the
(mis-)use of the 128-bit cmpxchg operation to do the same in cpum_sf
code. Currently gcc does not generate the used lpq instruction if
__READ_ONCE() is used for aligned 128-bit accesses, therefore use
this s390 specific helper
- Simplify machine check handler code if a task needs to be killed
because of e.g. register corruption due to a machine malfunction
- Perform CPU reset to clear pending interrupts and TLB entries on an
already stopped target CPU before delegating work to it
- Generate arch/s390/boot/vmlinux.map link map for the decompressor,
when CONFIG_VMLINUX_MAP is enabled for debugging purposes
- Fix segment type handling for dcssblk devices. It incorrectly always
returned type "READ/WRITE" even for read-only segements, which can
result in a kernel panic if somebody tries to write to a read-only
device
- Sort config S390 select list again
- Fix two kprobe reenter bugs revealed by a recently added kprobe kunit
test
* tag 's390-6.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/kprobes: fix current_kprobe never cleared after kprobes reenter
s390/kprobes: fix irq mask clobbering on kprobe reenter from post_handler
s390/Kconfig: sort config S390 select list again
s390/extmem: return correct segment type in __segment_load()
s390/decompressor: add link map saving
s390/smp: perform cpu reset before delegating work to target cpu
s390/mcck: cleanup user process termination path
s390/cpum_sf: use READ_ONCE_ALIGNED_128() instead of 128-bit cmpxchg
s390/rwonce: add READ_ONCE_ALIGNED_128() macro
s390/ap,zcrypt,vfio: introduce and use ap_queue_status_reg union
s390/nmi: fix virtual-physical address confusion
s390/setup: do not complain about parameters handled in decompressor
|
|
Commit f05f62d04271f ("s390/vmem: get rid of memory segment list")
reshuffled the call to vmem_add_mapping() in __segment_load(), which now
overwrites rc after it was set to contain the segment type code.
As result, __segment_load() will now always return 0 on success, which
corresponds to the segment type code SEG_TYPE_SW, i.e. a writeable
segment. This results in a kernel crash when loading a read-only segment
as dcssblk block device, and trying to write to it.
Instead of reshuffling code again, make sure to return the segment type
on success, and also describe this rather delicate and unexpected logic
in the function comment. Also initialize new segtype variable with
invalid value, to prevent possible future confusion.
Fixes: f05f62d04271 ("s390/vmem: get rid of memory segment list")
Cc: <stable@vger.kernel.org> # 5.9+
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
|
|
Baoquan He reported lots of KFENCE reports when /proc/kcore is read,
e.g. with crash or even simpler with dd:
BUG: KFENCE: invalid read in copy_from_kernel_nofault+0x5e/0x120
Invalid read at 0x00000000f4f5149f:
copy_from_kernel_nofault+0x5e/0x120
read_kcore+0x6b2/0x870
proc_reg_read+0x9a/0xf0
vfs_read+0x94/0x270
ksys_read+0x70/0x100
__do_syscall+0x1d0/0x200
system_call+0x82/0xb0
The reason for this is that read_kcore() simply reads memory that might
have been unmapped by KFENCE with copy_from_kernel_nofault(). Any fault due
to pages being unmapped by KFENCE would be handled gracefully by the fault
handler (exception table fixup).
However the s390 fault handler first reports the fault, and only afterwards
would perform the exception table fixup. Most architectures have this in
reversed order, which also avoids the false positive KFENCE reports when an
unmapped page is accessed.
Therefore change the s390 fault handler so it handles exception table
fixups before KFENCE page faults are reported.
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20230213183858.1473681-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Commit bf64f0517e5d ("s390/mem_detect: handle online memory limit
just once") introduced truncation of mem_detect online ranges
based on identity mapping size. For kdump case however the full
set of online memory ranges has to be feed into memblock_physmem_add
so that crashed system memory could be extracted.
Instead of truncating introduce a "usable limit" which is respected by
mem_detect api. Also add extra online memory ranges iterator which still
provides full set of online memory ranges disregarding the "usable limit".
Fixes: bf64f0517e5d ("s390/mem_detect: handle online memory limit just once")
Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
RDP instruction allows to reset DAT-protection bit in a PTE, with less
CPU synchronization overhead than IPTE instruction. In particular, IPTE
can cause machine-wide synchronization overhead, and excessive IPTE usage
can negatively impact machine performance.
RDP can be used instead of IPTE, if the new PTE only differs in SW bits
and _PAGE_PROTECT HW bit, for PTE protection changes from RO to RW.
SW PTE bit changes are allowed, e.g. for dirty and young tracking, but none
of the other HW-defined part of the PTE must change. This is because the
architecture forbids such changes to an active and valid PTE, which
is why invalidation with IPTE is always used first, before writing a new
entry.
The RDP optimization helps mainly for fault-driven SW dirty-bit tracking.
Writable PTEs are initially always mapped with HW _PAGE_PROTECT bit set,
to allow SW dirty-bit accounting on first write protection fault, where
the DAT-protection would then be reset. The reset is now done with RDP
instead of IPTE, if RDP instruction is available.
RDP cannot always guarantee that the DAT-protection reset is propagated
to all CPUs immediately. This means that spurious TLB protection faults
on other CPUs can now occur. For this, common code provides a
flush_tlb_fix_spurious_fault() handler, which will now be used to do a
CPU-local TLB flush. However, this will clear the whole TLB of a CPU, and
not just the affected entry. For more fine-grained flushing, by simply
doing a (local) RDP again, flush_tlb_fix_spurious_fault() would need to
also provide the PTE pointer.
Note that spurious TLB protection faults cannot really be distinguished
from racing pagetable updates, where another thread already installed the
correct PTE. In such a case, the local TLB flush would be unnecessary
overhead, but overall reduction of CPU synchronization overhead by not
using IPTE is still expected to be beneficial.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The current definition already collapse with the generic definition of
vm_fault_reason. Move the private definitions to allocate bits from the
top of uint so they won't collapse anymore.
Signed-off-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20230205231704.909536-4-peterx@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.
Link: https://lkml.kernel.org/r/20230126193752.297968-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.
[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
KASAN common code is able to handle memory hotplug and create KASAN shadow
memory on a fly. Online memory ranges are available from mem_detect,
use this information to avoid mapping KASAN shadow for standby memory.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Introduce mem_detect_truncate() to cut any online memory ranges above
established identity mapping size, so that mem_detect users wouldn't
have to do it over and over again.
Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Kasan shadow memory area has been moved to the end of kernel address
space since commit 9a39abb7c9aa ("s390/boot: simplify and fix kernel
memory layout setup"), therefore skipping any memory ranges above
VMALLOC_START in empty page tables cleanup code already handles
KASAN shadow memory intersection case and explicit checks could be
removed.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Commit b9ff81003cf1 ("s390/vmem: cleanup empty page tables") introduced
empty page tables cleanup in vmem code, but when the kernel is built
with KASAN enabled the code has no effect due to wrong KASAN shadow
memory intersection condition, which effectively ignores any memory
range below KASAN shadow. Fix intersection condition to make code
work as anticipated.
Fixes: b9ff81003cf1 ("s390/vmem: cleanup empty page tables")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Kasan shadow memory area has been moved to the end of kernel address
space since commit 9a39abb7c9aa ("s390/boot: simplify and fix kernel
memory layout setup"). Change kasan memory layout note accordingly.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|