summaryrefslogtreecommitdiff
path: root/fs/binfmt_elf.c
AgeCommit message (Collapse)AuthorFilesLines
2024-07-24Merge tag 'execve-v6.11-rc1-fix1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fix from Kees Cook: "This moves the exec and binfmt_elf tests out of your way and into the tests/ subdirectory, following the newly ratified KUnit naming conventions. :)" * tag 'execve-v6.11-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: execve: Move KUnit tests to tests/ subdirectory
2024-07-23execve: Move KUnit tests to tests/ subdirectoryKees Cook1-1/+1
Move the exec KUnit tests into a separate directory to avoid polluting the local directory namespace. Additionally update MAINTAINERS for the new files. Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20240720170310.it.942-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2024-07-16Merge tag 'execve-v6.11-rc1' of ↵Linus Torvalds1-32/+67
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Use value of kernel.randomize_va_space once per exec (Alexey Dobriyan) - Honor PT_LOAD alignment for static PIE - Make bprm->argmin only visible under CONFIG_MMU - Add KUnit testing of bprm_stack_limits() * tag 'execve-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Avoid pathological argc, envc, and bprm->p values execve: Keep bprm->argmin behind CONFIG_MMU ELF: fix kernel.randomize_va_space double read exec: Add KUnit test for bprm_stack_limits() binfmt_elf: Honor PT_LOAD alignment for static PIE binfmt_elf: Calculate total_size earlier selftests/exec: Build both static and non-static load_address tests
2024-06-21ELF: fix kernel.randomize_va_space double readAlexey Dobriyan1-2/+3
ELF loader uses "randomize_va_space" twice. It is sysctl and can change at any moment, so 2 loads could see 2 different values in theory with unpredictable consequences. Issue exactly one load for consistent value across one exec. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Link: https://lore.kernel.org/r/3329905c-7eb8-400a-8f0a-d87cff979b5b@p183 Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-19binfmt_elf: Honor PT_LOAD alignment for static PIEKees Cook1-5/+37
The p_align values in PT_LOAD were ignored for static PIE executables (i.e. ET_DYN without PT_INTERP). This is because there is no way to request a non-fixed mmap region with a specific alignment. ET_DYN with PT_INTERP uses a separate base address (ELF_ET_DYN_BASE) and binfmt_elf performs the ASLR itself, which means it can also apply alignment. For the mmap region, the address selection happens deep within the vm_mmap() implementation (when the requested address is 0). The earlier attempt to implement this: commit 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE") commit 925346c129da ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") did not take into account the different base address origins, and were eventually reverted: aeb7923733d1 ("revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"") In order to get the correct alignment from an mmap base, binfmt_elf must perform a 0-address load first, then tear down the mapping and perform alignment on the resulting address. Since this is slightly more overhead, only do this when it is needed (i.e. the alignment is not the default ELF alignment). This does, however, have the benefit of being able to use MAP_FIXED_NOREPLACE, to avoid potential collisions. With this fixed, enable the static PIE self tests again. Reported-by: H.J. Lu <hjl.tools@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215275 Link: https://lore.kernel.org/r/20240508173149.677910-3-keescook@chromium.org Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-19binfmt_elf: Calculate total_size earlierKees Cook1-25/+27
In preparation to support PT_LOAD with large p_align values on non-PT_INTERP ET_DYN executables (i.e. "static pie"), we'll need to use the total_size details earlier. Move this separately now to make the next patch more readable. As total_size and load_bias are currently calculated separately, this has no behavioral impact. Link: https://lore.kernel.org/r/20240508173149.677910-2-keescook@chromium.org Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-03fs: don't block i_writecount during execChristian Brauner1-2/+0
Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. But even back then I had started taking some notes on what could potentially depend on this and I didn't come up with a lot so I've changed my mind and I would like to try this. Here are some of the notes that I took: (1) The deny_write_access() mechanism is causing really pointless issues such as [1]. If a thread in a thread-group opens a file writable, then writes some stuff, then closing the file descriptor and then calling execve() they can fail the execve() with ETXTBUSY because another thread in the thread-group could have concurrently called fork(). Multi-threaded libraries such as go suffer from this. (2) There are userspace attacks that rely on overwriting the binary of a running process. These attacks are _mitigated_ but _not at all prevented_ from ocurring by the deny_write_access() mechanism. I'll go over some details. The clearest example of such attacks was the attack against runC in CVE-2019-5736 (cf. [3]). An attack could compromise the runC host binary from inside a _privileged_ runC container. The malicious binary could then be used to take over the host. (It is crucial to note that this attack is _not_ possible with unprivileged containers. IOW, the setup here is already insecure.) The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed. That magic link will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. However, this will not succeed because of deny_write_access(). Now, one might think that this would prevent the attack but it doesn't. To overcome this, the attacker has multiple ways: * Open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. * Use a malicious shared library annotating a function in there with the constructor attribute making the malicious function run as an initializor. The malicious library will then open /proc/self/exe for creating a new entry under /proc/self/fd/<nr>. It'll then call exec to a) force runC to exit and b) hand the file descriptor off to a program that then reopens /proc/self/fd/<nr> for writing (which is now possible because runC has exited) and overwriting that binary. To sum up: the deny_write_access() mechanism doesn't prevent such attacks in insecure setups. It just makes them minimally harder. That's all. The only way back then to prevent this is to create a temporary copy of the calling binary itself when it starts or attaches to containers. So what I did back then for LXC (and Aleksa for runC) was to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. This sealed, in-memory file copy is then executed instead of the original on-disk binary. Any compromising write operations from a privileged container to the host binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host binary. Also as the temporary, in-memory binary is sealed, writes to this will also fail. The point is that deny_write_access() is uselss to prevent these attacks. (3) Denying write access to an inode because it's currently used in an exec path could easily be done on an LSM level. It might need an additional hook but that should be about it. (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time ago so while we do protect the main executable the bigger portion of the things you'd think need protecting such as the shared libraries aren't. IOW, we let anyone happily overwrite shared libraries. (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means: (5.1) We removed the legacy uselib() protection for preventing overwriting of shared libraries. Nobody cared in 3 years. (5.2) We allow write access to the elf interpreter after exec completed treating it on a par with shared libraries. Yes, someone in userspace could potentially be relying on this. It's not completely out of the realm of possibility but let's find out if that's actually the case and not guess. Link: https://github.com/golang/go/issues/22315 [1] Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2] Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3] Link: https://lwn.net/Articles/866493 Link: https://github.com/golang/go/issues/22220 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61 Link: https://github.com/buildkite/agent/pull/2736 Link: https://github.com/rust-lang/rust/issues/114554 Link: https://bugs.openjdk.org/browse/JDK-8068370 Link: https://github.com/dotnet/runtime/issues/58964 Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-20Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
2024-05-08fs/coredump: Enable dynamic configuration of max file note sizeAllen Pais1-2/+5
Introduce the capability to dynamically configure the maximum file note size for ELF core dumps via sysctl. Why is this being done? We have observed that during a crash when there are more than 65k mmaps in memory, the existing fixed limit on the size of the ELF notes section becomes a bottleneck. The notes section quickly reaches its capacity, leading to incomplete memory segment information in the resulting coredump. This truncation compromises the utility of the coredumps, as crucial information about the memory state at the time of the crash might be omitted. This enhancement removes the previous static limit of 4MB, allowing system administrators to adjust the size based on system-specific requirements or constraints. Eg: $ sysctl -a | grep core_file_note_size_limit kernel.core_file_note_size_limit = 4194304 $ sysctl -n kernel.core_file_note_size_limit 4194304 $echo 519304 > /proc/sys/kernel/core_file_note_size_limit $sysctl -n kernel.core_file_note_size_limit 519304 Attempting to write beyond the ceiling value of 16MB $echo 17194304 > /proc/sys/kernel/core_file_note_size_limit bash: echo: write error: Invalid argument Signed-off-by: Vijay Nag <nagvijay@microsoft.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Link: https://lore.kernel.org/r/20240506193700.7884-1-apais@linux.microsoft.com Signed-off-by: Kees Cook <keescook@chromium.org>
2024-04-26regset: use kvzalloc() for regset_get_alloc()Douglas Anderson1-1/+1
While browsing through ChromeOS crash reports, I found one with an allocation failure that looked like this: chrome: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=urgent,mems_allowed=0 CPU: 7 PID: 3295 Comm: chrome Not tainted 5.15.133-20574-g8044615ac35c #1 (HASH:1162 1) Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT) Call trace: ... warn_alloc+0x104/0x174 __alloc_pages+0x5f0/0x6e4 kmalloc_order+0x44/0x98 kmalloc_order_trace+0x34/0x124 __kmalloc+0x228/0x36c __regset_get+0x68/0xcc regset_get_alloc+0x1c/0x28 elf_core_dump+0x3d8/0xd8c do_coredump+0xeb8/0x1378 get_signal+0x14c/0x804 ... An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not a surprise that this allocation failed on a system that's been running for a while. More digging showed that it was fairly easy to see the order 7 allocation by just sending a SIGQUIT to chrome (or other processes) to generate a core dump. The actual amount being allocated was 279,584 bytes and it was for "core_note_type" NT_ARM_SVE. There was quite a bit of discussion [1] on the mailing lists in response to my v1 patch attempting to switch to vmalloc. The overall conclusion was that we could likely reduce the 279,584 byte allocation by quite a bit and Mark Brown has sent a patch to that effect [2]. However even with the 279,584 byte allocation gone there are still 65,552 byte allocations. These are just barely more than the 65,536 bytes and thus would require an order 5 allocation. An order 5 allocation is still something to avoid unless necessary and nothing needs the memory here to be contiguous. Change the allocation to kvzalloc() which should still be efficient for small allocations but doesn't force the memory subsystem to work hard (and maybe fail) at getting a large contiguous chunk. [1] https://lore.kernel.org/r/20240201171159.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid [2] https://lore.kernel.org/r/20240203-arm64-sve-ptrace-regset-size-v1-1-2c3ba1386b9e@kernel.org Link: https://lkml.kernel.org/r/20240205092626.v2.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-24binfmt_elf: Leave a gap between .bss and brkKees Cook1-0/+3
Currently the brk starts its randomization immediately after .bss, which means there is a chance that when the random offset is 0, linear overflows from .bss can reach into the brk area. Leave at least a single page gap between .bss and brk (when it has not already been explicitly relocated into the mmap range). Reported-by: <y0un9n132@gmail.com> Closes: https://lore.kernel.org/linux-hardening/CA+2EKTVLvc8hDZc+2Yhwmus=dzOUG5E4gV7ayCbu0MPJTZzWkw@mail.gmail.com/ Link: https://lore.kernel.org/r/20240217062545.1631668-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-10-04binfmt_elf: Only report padzero() errors when PROT_WRITEKees Cook1-9/+23
Errors with padzero() should be caught unless we're expecting a pathological (non-writable) segment. Report -EFAULT only when PROT_WRITE is present. Additionally add some more documentation to padzero(), elf_map(), and elf_load(). Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Suggested-by: Eric Biederman <ebiederm@xmission.com> Tested-by: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20230929032435.2391507-5-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-10-04binfmt_elf: Use elf_load() for libraryKees Cook1-20/+4
While load_elf_library() is a libc5-ism, we can still replace most of its contents with elf_load() as well, further simplifying the code. Some historical context: - libc4 was a.out and used uselib (a.out support has been removed) - libc5 was ELF and used uselib (there may still be users) - libc6 is ELF and has never used uselib Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Suggested-by: Eric Biederman <ebiederm@xmission.com> Tested-by: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20230929032435.2391507-4-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-09-29binfmt_elf: Use elf_load() for interpreterKees Cook1-45/+1
Handle arbitrary memsz>filesz in interpreter ELF segments, instead of only supporting it in the last segment (which is expected to be the BSS). Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Reported-by: Pedro Falcato <pedro.falcato@gmail.com> Closes: https://lore.kernel.org/lkml/20221106021657.1145519-1-pedro.falcato@gmail.com/ Tested-by: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20230929032435.2391507-3-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-09-29binfmt_elf: elf_bss no longer used by load_elf_binary()Kees Cook1-5/+1
With the BSS handled generically via the new filesz/memsz mismatch handling logic in elf_load(), elf_bss no longer needs to be tracked. Drop the variable. Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Suggested-by: Eric Biederman <ebiederm@xmission.com> Tested-by: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20230929032435.2391507-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-09-29binfmt_elf: Support segments with 0 filesz and misaligned startsEric W. Biederman1-63/+48
Implement a helper elf_load() that wraps elf_map() and performs all of the necessary work to ensure that when "memsz > filesz" the bytes described by "memsz > filesz" are zeroed. An outstanding issue is if the first segment has filesz 0, and has a randomized location. But that is the same as today. In this change I replaced an open coded padzero() that did not clear all of the way to the end of the page, with padzero() that does. I also stopped checking the return of padzero() as there is at least one known case where testing for failure is the wrong thing to do. It looks like binfmt_elf_fdpic may have the proper set of tests for when error handling can be safely completed. I found a couple of commits in the old history https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git, that look very interesting in understanding this code. commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail") commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c") commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2") Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"): > commit 39b56d902bf35241e7cba6cc30b828ed937175ad > Author: Pavel Machek <pavel@ucw.cz> > Date: Wed Feb 9 22:40:30 2005 -0800 > > [PATCH] binfmt_elf: clearing bss may fail > > So we discover that Borland's Kylix application builder emits weird elf > files which describe a non-writeable bss segment. > > So remove the clear_user() check at the place where we zero out the bss. I > don't _think_ there are any security implications here (plus we've never > checked that clear_user() return value, so whoops if it is a problem). > > Signed-off-by: Pavel Machek <pavel@suse.cz> > Signed-off-by: Andrew Morton <akpm@osdl.org> > Signed-off-by: Linus Torvalds <torvalds@osdl.org> It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for non-writable segments and otherwise calling clear_user(), aka padzero(), and checking it's return code is the right thing to do. I just skipped the error checking as that avoids breaking things. And notably, it looks like Borland's Kylix died in 2005 so it might be safe to just consider read-only segments with memsz > filesz an error. Reported-by: Sebastian Ott <sebott@redhat.com> Reported-by: Thomas Weißschuh <linux@weissschuh.net> Closes: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/87sf71f123.fsf@email.froward.int.ebiederm.org Tested-by: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20230929032435.2391507-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-06-29Merge branch 'expand-stack'Linus Torvalds1-3/+3
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <lrh2000@pku.edu.cn> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-27mm: always expand the stack with the mmap write lock heldLinus Torvalds1-1/+1
This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again. For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that. It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures. As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended. Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64 Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-25mm: make find_extend_vma() fail if write lock not heldLiam R. Howlett1-3/+3
Make calls to extend_vma() and find_extend_vma() fail if the write lock is required. To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order. Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23binfmt_elf: fix comment typo s/reset/regset/Baruch Siach1-1/+1
Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/0b2967c4a4141875c493e835d5a6f8f2d19ae2d6.1687499804.git.baruch@tkos.co.il
2023-05-17coredump, vmcore: Set p_align to 4 for PT_NOTEFangrui Song1-1/+1
Tools like readelf/llvm-readelf use p_align to parse a PT_NOTE program header as an array of 4-byte entries or 8-byte entries. Currently, there are workarounds[1] in place for Linux to treat p_align==0 as 4. However, it would be more appropriate to set the correct alignment so that tools do not have to rely on guesswork. FreeBSD coredumps set p_align to 4 as well. [1]: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=82ed9683ec099d8205dc499ac84febc975235af6 [2]: https://reviews.llvm.org/D150022 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230512022528.3430327-1-maskray@google.com
2023-04-28Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly singleton patches all over the place. Series of note are: - updates to scripts/gdb from Glenn Washburn - kexec cleanups from Bjorn Helgaas" * tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits) mailmap: add entries for Paul Mackerras libgcc: add forward declarations for generic library routines mailmap: add entry for Oleksandr ocfs2: reduce ioctl stack usage fs/proc: add Kthread flag to /proc/$pid/status ia64: fix an addr to taddr in huge_pte_offset() checkpatch: introduce proper bindings license check epoll: rename global epmutex scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry() scripts/gdb: create linux/vfs.py for VFS related GDB helpers uapi/linux/const.h: prefer ISO-friendly __typeof__ delayacct: track delays from IRQ/SOFTIRQ scripts/gdb: timerlist: convert int chunks to str scripts/gdb: print interrupts scripts/gdb: raise error with reduced debugging information scripts/gdb: add a Radix Tree Parser lib/rbtree: use '+' instead of '|' for setting color. proc/stat: remove arch_idle_time() checkpatch: check for misuse of the link tags checkpatch: allow Closes tags with links ...
2023-04-13binfmt_elf: remove MODULE_LICENSE in non-modulesNick Alcock1-1/+0
Since commit 8b41fc4454e ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations are used to identify modules. As a consequence, uses of the macro in non-modules will cause modprobe to misidentify their containing object file as a module when it is not (false positives), and modprobe might succeed rather than failing with a suitable error message. So remove it in the files in this commit, none of which can be built as modules. Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Suggested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-modules@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-08ELF: fix all "Elf" typosAlexey Dobriyan1-1/+1
ELF is acronym and therefore should be spelled in all caps. I left one exception at Documentation/arm/nwfpe/nwfpe.rst which looks like being written in the first person. Link: https://lkml.kernel.org/r/Y/3wGWQviIOkyLJW@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31Merge tag 'v6.2-rc6' into sched/core, to pick up fixesIngo Molnar1-2/+2
Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-05elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}Catalin Marinas1-2/+2
A subsequent fix for arm64 will use this parameter to parse the vma information from the snapshot created by dump_vma_snapshot() rather than traversing the vma list without the mmap_lock. Fixes: 6dd8b1a0b6cb ("arm64: mte: Dump the MTE tags in the core file") Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Seth Jenkins <sethjenkins@google.com> Suggested-by: Seth Jenkins <sethjenkins@google.com> Cc: Will Deacon <will@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221222181251.1345752-3-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-27rseq: Introduce feature size and alignment ELF auxiliary vector entriesMathieu Desnoyers1-0/+5
Export the rseq feature size supported by the kernel as well as the required allocation alignment for the rseq per-thread area to user-space through ELF auxiliary vector entries. This is part of the extensible rseq ABI. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-3-mathieu.desnoyers@efficios.com
2022-12-13Merge tag 'pull-elfcore' of ↵Linus Torvalds1-220/+51
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull elf coredumping updates from Al Viro: "Unification of regset and non-regset sides of ELF coredump handling. Collecting per-thread register values is the only thing that needs to be ifdefed there..." * tag 'pull-elfcore' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [elf] get rid of get_note_info_size() [elf] unify regset and non-regset cases [elf][non-regset] use elf_core_copy_task_regs() for dumper as well [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument) elf_core_copy_task_regs(): task_pt_regs is defined everywhere [elf][regset] simplify thread list handling in fill_note_info() [elf][regset] clean fill_note_info() a bit kill extern of vsyscall32_sysctl kill coredump_params->regs kill signal_pt_regs()
2022-11-25[elf] get rid of get_note_info_size()Al Viro1-6/+1
it's trivial now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25[elf] unify regset and non-regset casesAl Viro1-192/+38
The only real difference is in filling per-thread notes - getting the values of registers. And this is the only part that is worth an ifdef - we don't need to duplicate the logics regarding gathering threads, filling other notes, etc. It would've been hard to do back when regset-based variant had been introduced, mostly due to sharing bits and pieces of helpers with aout coredumps. As the result, too much had been duplicated and the copies had drifted away since then. Now it can be done cleanly... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25[elf][non-regset] use elf_core_copy_task_regs() for dumper as wellAl Viro1-1/+1
elf_core_copy_regs() is equivalent to elf_core_copy_task_regs() of current on all architectures. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25[elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs ↵Al Viro1-3/+2
argument) Don't bother with pointless macros - we are not sharing it with aout coredumps anymore. Just convert the underlying functions to the same arguments (nobody uses regs, actually) and call them elf_core_copy_task_fpregs(). And unexport the entire bunch, while we are at it. [added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra warnings about the lack of externs getting added to huge piles for those files. Pointless, but...] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-18binfmt_elf: replace IS_ERR() with IS_ERR_VALUE()Bo Liu1-3/+3
Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221115031757.2426-1-liubo03@inspur.com
2022-10-26binfmt_elf: simplify error handling in load_elf_phdrs()Rolf Eike Beer1-8/+2
The err variable was the same like retval, but capped to <= 0. This is the same as retval as elf_read() never returns positive values. Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/4137126.7Qn9TF0dmF@mobilepool36.emlix.com
2022-10-26binfmt_elf: fix documented return value for load_elf_phdrs()Rolf Eike Beer1-1/+1
This function has never returned anything but a plain NULL. Fixes: 6a8d38945cf4 ("binfmt_elf: Hoist ELF program header loading to a function") Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/2359389.EDbqzprbEW@mobilepool36.emlix.com
2022-10-26binfmt: Fix whitespace issuesKees Cook1-7/+7
Fix the annoying whitespace issues that have been following these files around for years. Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20221018071350.never.230-kees@kernel.org
2022-10-26fs/binfmt_elf: Fix memory leak in load_elf_binary()Li Zetao1-1/+2
There is a memory leak reported by kmemleak: unreferenced object 0xffff88817104ef80 (size 224): comm "xfs_admin", pid 47165, jiffies 4298708825 (age 1333.476s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 60 a8 b3 00 81 88 ff ff a8 10 5a 00 81 88 ff ff `.........Z..... backtrace: [<ffffffff819171e1>] __alloc_file+0x21/0x250 [<ffffffff81918061>] alloc_empty_file+0x41/0xf0 [<ffffffff81948cda>] path_openat+0xea/0x3d30 [<ffffffff8194ec89>] do_filp_open+0x1b9/0x290 [<ffffffff8192660e>] do_open_execat+0xce/0x5b0 [<ffffffff81926b17>] open_exec+0x27/0x50 [<ffffffff81a69250>] load_elf_binary+0x510/0x3ed0 [<ffffffff81927759>] bprm_execve+0x599/0x1240 [<ffffffff8192a997>] do_execveat_common.isra.0+0x4c7/0x680 [<ffffffff8192b078>] __x64_sys_execve+0x88/0xb0 [<ffffffff83bbf0a5>] do_syscall_64+0x35/0x80 If "interp_elf_ex" fails to allocate memory in load_elf_binary(), the program will take the "out_free_ph" error handing path, resulting in "interpreter" file resource is not released. Fix it by adding an error handing path "out_free_file", which will release the file resource when "interp_elf_ex" failed to allocate memory. Fixes: 0693ffebcfe5 ("fs/binfmt_elf.c: allocate less for static executable") Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221024154421.982230-1-lizetao1@huawei.com
2022-10-24[elf][regset] simplify thread list handling in fill_note_info()Al Viro1-12/+10
fill_note_info() iterates through the list of threads collected in mm->core_state->dumper, allocating a struct elf_thread_core_info instance for each and linking those into a list. We need the entry corresponding to current to be first in the resulting list, so the logics for list insertion is if it's for current or list is empty insert in the head else insert after the first element However, in mm->core_state->dumper the entry for current is guaranteed to be the first one. Which means that both parts of condition will be true on the first iteration and neither will be true on all subsequent ones. Taking the first iteration out of the loop simplifies things nicely... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-10-24[elf][regset] clean fill_note_info() a bitAl Viro1-9/+2
*info is already initialized... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-10-24kill coredump_params->regsAl Viro1-2/+2
it's always task_pt_regs(current) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-16revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"Andrew Morton1-2/+2
Despite Mike's attempted fix (925346c129da117122), regressions reports continue: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info So revert this patch. Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE") Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fangrui Song <maskray@google.com> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-16revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"Andrew Morton1-1/+1
Commit 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") was an attempt to fix regressions due to 9630f0d60fec5f ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE"). But regressionss continue to be reported: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info This patch reverts the fix, so the original can also be reverted. Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Song Liu <songliubraving@fb.com> Cc: David Rientjes <rientjes@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Fangrui Song <maskray@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22Merge tag 'execve-v5.18-rc1' of ↵Linus Torvalds1-70/+83
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: "Execve and binfmt updates. Eric and I have stepped up to be the active maintainers of this area, so here's our first collection. The bulk of the work was in coredump handling fixes; additional details are noted below: - Handle unusual AT_PHDR offsets (Akira Kawata) - Fix initial mapping size when PT_LOADs are not ordered (Alexey Dobriyan) - Move more code under CONFIG_COREDUMP (Alexey Dobriyan) - Fix missing mmap_lock in file_files_note (Eric W. Biederman) - Remove a.out support for alpha and m68k (Eric W. Biederman) - Include first pages of non-exec ELF libraries in coredump (Jann Horn) - Don't write past end of notes for regset gap in coredump (Rick Edgecombe) - Comment clean-ups (Tom Rix) - Force single empty string when argv is empty (Kees Cook) - Add NULL argv selftest (Kees Cook) - Properly redefine PT_GNU_* in terms of PT_LOOS (Kees Cook) - MAINTAINERS: Update execve entry with tree (Kees Cook) - Introduce initial KUnit testing for binfmt_elf (Kees Cook)" * tag 'execve-v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf: Don't write past end of notes for regset gap a.out: Stop building a.out/osf1 support on alpha and m68k coredump: Don't compile flat_core_dump when coredumps are disabled coredump: Use the vma snapshot in fill_files_note coredump/elf: Pass coredump_params into fill_note_info coredump: Remove the WARN_ON in dump_vma_snapshot coredump: Snapshot the vmas in do_coredump coredump: Move definition of struct coredump_params into coredump.h binfmt_elf: Introduce KUnit test ELF: Properly redefine PT_GNU_* in terms of PT_LOOS MAINTAINERS: Update execve entry with more details exec: cleanup comments fs/binfmt_elf: Refactor load_elf_binary function fs/binfmt_elf: Fix AT_PHDR for unusual ELF files binfmt: move more stuff undef CONFIG_COREDUMP selftests/exec: Test for empty string on NULL argv exec: Force single empty string when argv is empty coredump: Also dump first pages of non-executable ELF libraries ELF: fix overflow in total mapping size calculation
2022-03-18binfmt_elf: Don't write past end of notes for regset gapRick Edgecombe1-10/+14
In fill_thread_core_info() the ptrace accessible registers are collected to be written out as notes in a core file. The note array is allocated from a size calculated by iterating the user regset view, and counting the regsets that have a non-zero core_note_type. However, this only allows for there to be non-zero core_note_type at the end of the regset view. If there are any gaps in the middle, fill_thread_core_info() will overflow the note allocation, as it iterates over the size of the view and the allocation would be smaller than that. There doesn't appear to be any arch that has gaps such that they exceed the notes allocation, but the code is brittle and tries to support something it doesn't. It could be fixed by increasing the allocation size, but instead just have the note collecting code utilize the array better. This way the allocation can stay smaller. Even in the case of no arch's that have gaps in their regset views, this introduces a change in the resulting indicies of t->notes. It does not introduce any changes to the core file itself, because any blank notes are skipped in write_note_info(). In case, the allocation logic between fill_note_info() and fill_thread_core_info() ever diverges from the usage logic, warn and skip writing any notes that would overflow the array. This fix is derrived from an earlier one[0] by Yu-cheng Yu. [0] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/ Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220317192013.13655-4-rick.p.edgecombe@intel.com
2022-03-08coredump: Use the vma snapshot in fill_files_noteEric W. Biederman1-12/+12
Matthew Wilcox reported that there is a missing mmap_lock in file_files_note that could possibly lead to a user after free. Solve this by using the existing vma snapshot for consistency and to avoid the need to take the mmap_lock anywhere in the coredump code except for dump_vma_snapshot. Update the dump_vma_snapshot to capture vm_pgoff and vm_file that are neeeded by fill_files_note. Add free_vma_snapshot to free the captured values of vm_file. Reported-by: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org Cc: stable@vger.kernel.org Fixes: a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files") Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump/elf: Pass coredump_params into fill_note_infoEric W. Biederman1-11/+11
Instead of individually passing cprm->siginfo and cprm->regs into fill_note_info pass all of struct coredump_params. This is preparation to allow fill_files_note to use the existing vma snapshot. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump: Snapshot the vmas in do_coredumpEric W. Biederman1-13/+7
Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the individual coredump routines into do_coredump itself. This makes the code less error prone and easier to maintain. Make the vma snapshot available to the coredump routines in struct coredump_params. This makes it easier to change and update what is captures in the vma snapshot and will be needed for fixing fill_file_notes. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-04binfmt_elf: Introduce KUnit testKees Cook1-0/+4
Adds simple KUnit test for some binfmt_elf internals: specifically a regression test for the problem fixed by commit 8904d9cd90ee ("ELF: fix overflow in total mapping size calculation"). $ ./tools/testing/kunit/kunit.py run --arch x86_64 \ --kconfig_add CONFIG_IA32_EMULATION=y '*binfmt_elf' ... [19:41:08] ================== binfmt_elf (1 subtest) ================== [19:41:08] [PASSED] total_mapping_size_test [19:41:08] =================== [PASSED] binfmt_elf ==================== [19:41:08] ============== compat_binfmt_elf (1 subtest) =============== [19:41:08] [PASSED] total_mapping_size_test [19:41:08] ================ [PASSED] compat_binfmt_elf ================ [19:41:08] ============================================================ [19:41:08] Testing complete. Passed: 2, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0 Cc: Eric Biederman <ebiederm@xmission.com> Cc: David Gow <davidgow@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Magnus Groß" <magnus.gross@rwth-aachen.de> Cc: kunit-dev@googlegroups.com Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> --- v1: https://lore.kernel.org/lkml/20220224054332.1852813-1-keescook@chromium.org v2: - improve commit log - fix comment URL (Daniel) - drop redundant KUnit Kconfig help info (Daniel) - note in Kconfig help that COMPAT builds add a compat test (David)
2022-03-02fs/binfmt_elf: Refactor load_elf_binary functionAkira Kawata1-8/+6
I delete load_addr because it is not used anymore. And I rename load_addr_set to first_pt_load because it is used only to capture the first iteration of the loop. Signed-off-by: Akira Kawata <akirakawata1@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127124014.338760-3-akirakawata1@gmail.com
2022-03-02fs/binfmt_elf: Fix AT_PHDR for unusual ELF filesAkira Kawata1-6/+18
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=197921 As pointed out in the discussion of buglink, we cannot calculate AT_PHDR as the sum of load_addr and exec->e_phoff. : The AT_PHDR of ELF auxiliary vectors should point to the memory address : of program header. But binfmt_elf.c calculates this address as follows: : : NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff); : : which is wrong since e_phoff is the file offset of program header and : load_addr is the memory base address from PT_LOAD entry. : : The ld.so uses AT_PHDR as the memory address of program header. In normal : case, since the e_phoff is usually 64 and in the first PT_LOAD region, it : is the correct program header address. : : But if the address of program header isn't equal to the first PT_LOAD : address + e_phoff (e.g. Put the program header in other non-consecutive : PT_LOAD region), ld.so will try to read program header from wrong address : then crash or use incorrect program header. This is because exec->e_phoff is the offset of PHDRs in the file and the address of PHDRs in the memory may differ from it. This patch fixes the bug by calculating the address of program headers from PT_LOADs directly. Signed-off-by: Akira Kawata <akirakawata1@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127124014.338760-2-akirakawata1@gmail.com