From 24c776355f4097316a763005434ffff716aa21a8 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Sun, 14 Dec 2025 16:51:56 -0800 Subject: kernel.h: drop hex.h and update all hex.h users Remove from and update all users/callers of hex.h interfaces to directly #include as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap Reviewed-by: Andy Shevchenko Cc: Ingo Molnar Cc: Yury Norov (NVIDIA) Signed-off-by: Andrew Morton --- include/linux/kernel.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 5b46924fdff5..35b8f2a5aca5 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -21,7 +21,6 @@ #include #include #include -#include #include #include #include -- cgit v1.2.3 From 436debc9cad892576d4f3287446b64474922764c Mon Sep 17 00:00:00 2001 From: Alejandro Colomar Date: Thu, 11 Dec 2025 11:43:49 +0100 Subject: array_size.h: add ARRAY_END() Patch series "Add ARRAY_END(), and use it to fix off-by-one bugs", v6. Add ARRAY_END(), and use it to fix off-by-one bugs ARRAY_END() is a macro to calculate a pointer to one past the last element of an array argument. This is a very common pointer, which is used to iterate over all elements of an array: for (T *p = a; p < ARRAY_END(a); p++) ... Of course, this pointer should never be dereferenced. A pointer one past the last element of an array should not be dereferenced; it's perfectly fine to hold such a pointer --and a good thing to do--, but the only thing it should be used for is comparing it with other pointers derived from the same array. Due to how special these pointers are, it would be good to use consistent naming. It's common to name such a pointer 'end' --in fact, we have many such cases in the kernel--. C++ even standardized this name with std::end(). Let's try naming such pointers 'end', and try also avoid using 'end' for pointers that are not the result of ARRAY_END(). It has been incorrectly suggested that these pointers are dangerous, and that they should never be used, suggesting to use something like #define ARRAY_LAST(a) ((a) + ARRAY_SIZE(a) - 1) for (T *p = a; p <= ARRAY_LAST(a); p++) ... This is bogus, as it doesn't scale down to arrays of 0 elements. In the case of an array of 0 elements, ARRAY_LAST() would underflow the pointer, which not only it can't be dereferenced, it can't even be held (it produces Undefined Behavior). That would be a footgun. Such arrays don't exist per the ISO C standard; however, GCC supports them as an extension (with partial support, though; GCC has a few bugs which need to be fixed). This patch set fixes a few places where it was intended to use the array end (that is, one past the last element), but accidentally a pointer to the last element was used instead, thus wasting one byte. It also replaces other places where the array end was correctly calculated with ARRAY_SIZE(), by using the simpler ARRAY_END(). Also, there was one drivers/ file that already defined this macro. We remove that definition, to not conflict with this one. This patch (of 4): ARRAY_END() returns a pointer one past the end of the last element in the array argument. This pointer is useful for iterating over the elements of an array: for (T *p = a, p < ARRAY_END(a); p++) ... Link: https://lkml.kernel.org/r/cover.1765449750.git.alx@kernel.org Link: https://lkml.kernel.org/r/5973cfb674192bc8e533485dbfb54e3062896be1.1765449750.git.alx@kernel.org Signed-off-by: Alejandro Colomar Cc: Kees Cook Cc: Christopher Bazley Cc: Rasmus Villemoes Cc: Marco Elver Cc: Michal Hocko Cc: Linus Torvalds Cc: Al Viro Cc: Alexander Potapenko Cc: Dmitriy Vyukov Cc: Jann Horn Cc: Maciej W. Rozycki Signed-off-by: Andrew Morton --- drivers/block/floppy.c | 2 -- include/linux/array_size.h | 6 ++++++ 2 files changed, 6 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index c28786e0fe1c..92e446a64371 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -4802,8 +4802,6 @@ static void floppy_release_allocated_regions(int fdc, const struct io_region *p) } } -#define ARRAY_END(X) (&((X)[ARRAY_SIZE(X)])) - static int floppy_request_regions(int fdc) { const struct io_region *p; diff --git a/include/linux/array_size.h b/include/linux/array_size.h index 06d7d83196ca..0c4fec98822e 100644 --- a/include/linux/array_size.h +++ b/include/linux/array_size.h @@ -10,4 +10,10 @@ */ #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) +/** + * ARRAY_END - get a pointer to one past the last element in array @arr + * @arr: array + */ +#define ARRAY_END(arr) (&(arr)[ARRAY_SIZE(arr)]) + #endif /* _LINUX_ARRAY_SIZE_H */ -- cgit v1.2.3 From acfdbb4ab2910ff6f03becb569c23ac7b2223913 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Fri, 28 Nov 2025 14:59:16 +0100 Subject: module: add helper function for reading module_buildid() Add a helper function for reading the optional "build_id" member of struct module. It is going to be used also in ftrace_mod_address_lookup(). Use "#ifdef" instead of "#if IS_ENABLED()" to match the declaration of the optional field in struct module. Link: https://lkml.kernel.org/r/20251128135920.217303-4-pmladek@suse.com Signed-off-by: Petr Mladek Reviewed-by: Daniel Gomez Reviewed-by: Petr Pavlu Cc: Aaron Tomlin Cc: Alexei Starovoitov Cc: Daniel Borkman Cc: John Fastabend Cc: Kees Cook Cc: Luis Chamberalin Cc: Marc Rutland Cc: "Masami Hiramatsu (Google)" Cc: Sami Tolvanen Cc: Steven Rostedt (Google) Signed-off-by: Andrew Morton --- include/linux/module.h | 9 +++++++++ kernel/module/kallsyms.c | 9 ++------- 2 files changed, 11 insertions(+), 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/module.h b/include/linux/module.h index d80c3ea57472..ac254525014c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -748,6 +748,15 @@ static inline void __module_get(struct module *module) __mod ? __mod->name : "kernel"; \ }) +static inline const unsigned char *module_buildid(struct module *mod) +{ +#ifdef CONFIG_STACKTRACE_BUILD_ID + return mod->build_id; +#else + return NULL; +#endif +} + /* Dereference module function descriptor */ void *dereference_module_function_descriptor(struct module *mod, void *ptr); diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c index 00a60796327c..0fc11e45df9b 100644 --- a/kernel/module/kallsyms.c +++ b/kernel/module/kallsyms.c @@ -334,13 +334,8 @@ int module_address_lookup(unsigned long addr, if (mod) { if (modname) *modname = mod->name; - if (modbuildid) { -#if IS_ENABLED(CONFIG_STACKTRACE_BUILD_ID) - *modbuildid = mod->build_id; -#else - *modbuildid = NULL; -#endif - } + if (modbuildid) + *modbuildid = module_buildid(mod); sym = find_kallsyms_symbol(mod, addr, size, offset); -- cgit v1.2.3 From cd6735896d0343942cf3dafb48ce32eb79341990 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Fri, 28 Nov 2025 14:59:18 +0100 Subject: kallsyms/bpf: rename __bpf_address_lookup() to bpf_address_lookup() bpf_address_lookup() has been used only in kallsyms_lookup_buildid(). It was supposed to set @modname and @modbuildid when the symbol was in a module. But it always just cleared @modname because BPF symbols were never in a module. And it did not clear @modbuildid because the pointer was not passed. The wrapper is no longer needed. Both @modname and @modbuildid are now always initialized to NULL in kallsyms_lookup_buildid(). Remove the wrapper and rename __bpf_address_lookup() to bpf_address_lookup() because this variant is used everywhere. [akpm@linux-foundation.org: fix loongarch] Link: https://lkml.kernel.org/r/20251128135920.217303-6-pmladek@suse.com Fixes: 9294523e3768 ("module: add printk formats to add module build ID to stacktraces") Signed-off-by: Petr Mladek Acked-by: Alexei Starovoitov Cc: Aaron Tomlin Cc: Daniel Borkman Cc: Daniel Gomez Cc: John Fastabend Cc: Kees Cook Cc: Luis Chamberalin Cc: Marc Rutland Cc: "Masami Hiramatsu (Google)" Cc: Petr Pavlu Cc: Sami Tolvanen Cc: Steven Rostedt (Google) Signed-off-by: Andrew Morton --- arch/arm64/net/bpf_jit_comp.c | 2 +- arch/loongarch/net/bpf_jit.c | 2 +- arch/powerpc/net/bpf_jit_comp.c | 2 +- include/linux/filter.h | 26 ++++---------------------- kernel/bpf/core.c | 4 ++-- kernel/kallsyms.c | 5 ++--- 6 files changed, 11 insertions(+), 30 deletions(-) (limited to 'include/linux') diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index b6eb7a465ad2..1d657bd3ce65 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -2951,7 +2951,7 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, u64 plt_target = 0ULL; bool poking_bpf_entry; - if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf)) + if (!bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf)) /* Only poking bpf text is supported. Since kernel function * entry is set up by ftrace, we reply on ftrace to poke kernel * functions. diff --git a/arch/loongarch/net/bpf_jit.c b/arch/loongarch/net/bpf_jit.c index d1d5a65308b9..3b63bc5b99d9 100644 --- a/arch/loongarch/net/bpf_jit.c +++ b/arch/loongarch/net/bpf_jit.c @@ -1319,7 +1319,7 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, /* Only poking bpf text is supported. Since kernel function entry * is set up by ftrace, we rely on ftrace to poke kernel functions. */ - if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf)) + if (!bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf)) return -ENOTSUPP; image = ip - offset; diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c index 5e976730b2f5..e199976e410a 100644 --- a/arch/powerpc/net/bpf_jit_comp.c +++ b/arch/powerpc/net/bpf_jit_comp.c @@ -1122,7 +1122,7 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t, bpf_func = (unsigned long)ip; /* We currently only support poking bpf programs */ - if (!__bpf_address_lookup(bpf_func, &size, &offset, name)) { + if (!bpf_address_lookup(bpf_func, &size, &offset, name)) { pr_err("%s (0x%lx): kernel/modules are not supported\n", __func__, bpf_func); return -EOPNOTSUPP; } diff --git a/include/linux/filter.h b/include/linux/filter.h index fd54fed8f95f..7452817d707d 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1375,24 +1375,13 @@ static inline bool bpf_jit_kallsyms_enabled(void) return false; } -int __bpf_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char *sym); +int bpf_address_lookup(unsigned long addr, unsigned long *size, + unsigned long *off, char *sym); bool is_bpf_text_address(unsigned long addr); int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *sym); struct bpf_prog *bpf_prog_ksym_find(unsigned long addr); -static inline int -bpf_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char **modname, char *sym) -{ - int ret = __bpf_address_lookup(addr, size, off, sym); - - if (ret && modname) - *modname = NULL; - return ret; -} - void bpf_prog_kallsyms_add(struct bpf_prog *fp); void bpf_prog_kallsyms_del(struct bpf_prog *fp); @@ -1431,8 +1420,8 @@ static inline bool bpf_jit_kallsyms_enabled(void) } static inline int -__bpf_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char *sym) +bpf_address_lookup(unsigned long addr, unsigned long *size, + unsigned long *off, char *sym) { return 0; } @@ -1453,13 +1442,6 @@ static inline struct bpf_prog *bpf_prog_ksym_find(unsigned long addr) return NULL; } -static inline int -bpf_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char **modname, char *sym) -{ - return 0; -} - static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp) { } diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index f1c5fc66ef01..8f6d8f1c4946 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -714,8 +714,8 @@ static struct bpf_ksym *bpf_ksym_find(unsigned long addr) return n ? container_of(n, struct bpf_ksym, tnode) : NULL; } -int __bpf_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char *sym) +int bpf_address_lookup(unsigned long addr, unsigned long *size, + unsigned long *off, char *sym) { struct bpf_ksym *ksym; int ret = 0; diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index c0898327836c..a37cafdf52ca 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -345,7 +345,7 @@ int kallsyms_lookup_size_offset(unsigned long addr, unsigned long *symbolsize, return 1; } return !!module_address_lookup(addr, symbolsize, offset, NULL, NULL, namebuf) || - !!__bpf_address_lookup(addr, symbolsize, offset, namebuf); + !!bpf_address_lookup(addr, symbolsize, offset, namebuf); } static int kallsyms_lookup_buildid(unsigned long addr, @@ -386,8 +386,7 @@ static int kallsyms_lookup_buildid(unsigned long addr, ret = module_address_lookup(addr, symbolsize, offset, modname, modbuildid, namebuf); if (!ret) - ret = bpf_address_lookup(addr, symbolsize, - offset, modname, namebuf); + ret = bpf_address_lookup(addr, symbolsize, offset, namebuf); if (!ret) ret = ftrace_mod_address_lookup(addr, symbolsize, -- cgit v1.2.3 From e8a1e7eaa19d0b757b06a2f913e3eeb4b1c002c6 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Fri, 28 Nov 2025 14:59:19 +0100 Subject: kallsyms/ftrace: set module buildid in ftrace_mod_address_lookup() __sprint_symbol() might access an invalid pointer when kallsyms_lookup_buildid() returns a symbol found by ftrace_mod_address_lookup(). The ftrace lookup function must set both @modname and @modbuildid the same way as module_address_lookup(). Link: https://lkml.kernel.org/r/20251128135920.217303-7-pmladek@suse.com Fixes: 9294523e3768 ("module: add printk formats to add module build ID to stacktraces") Signed-off-by: Petr Mladek Reviewed-by: Aaron Tomlin Acked-by: Steven Rostedt (Google) Cc: Alexei Starovoitov Cc: Daniel Borkman Cc: Daniel Gomez Cc: John Fastabend Cc: Kees Cook Cc: Luis Chamberalin Cc: Marc Rutland Cc: "Masami Hiramatsu (Google)" Cc: Petr Pavlu Cc: Sami Tolvanen Signed-off-by: Andrew Morton --- include/linux/ftrace.h | 6 ++++-- kernel/kallsyms.c | 4 ++-- kernel/trace/ftrace.c | 5 ++++- 3 files changed, 10 insertions(+), 5 deletions(-) (limited to 'include/linux') diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index a3a8989e3268..dc844d7e693d 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -87,11 +87,13 @@ struct ftrace_hash; defined(CONFIG_DYNAMIC_FTRACE) int ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char **modname, char *sym); + unsigned long *off, char **modname, + const unsigned char **modbuildid, char *sym); #else static inline int ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char **modname, char *sym) + unsigned long *off, char **modname, + const unsigned char **modbuildid, char *sym) { return 0; } diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index a37cafdf52ca..0f639c907336 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -389,8 +389,8 @@ static int kallsyms_lookup_buildid(unsigned long addr, ret = bpf_address_lookup(addr, symbolsize, offset, namebuf); if (!ret) - ret = ftrace_mod_address_lookup(addr, symbolsize, - offset, modname, namebuf); + ret = ftrace_mod_address_lookup(addr, symbolsize, offset, + modname, modbuildid, namebuf); return ret; } diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index aa758efc3731..304505c11686 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -7753,7 +7753,8 @@ ftrace_func_address_lookup(struct ftrace_mod_map *mod_map, int ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, - unsigned long *off, char **modname, char *sym) + unsigned long *off, char **modname, + const unsigned char **modbuildid, char *sym) { struct ftrace_mod_map *mod_map; int ret = 0; @@ -7765,6 +7766,8 @@ ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, if (ret) { if (modname) *modname = mod_map->mod->name; + if (modbuildid) + *modbuildid = module_buildid(mod_map->mod); break; } } -- cgit v1.2.3 From ad533a740c7ccb801619ed962807605254fe7545 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Sat, 13 Dec 2025 12:53:09 +0100 Subject: resource: provide 0args DEFINE_RES variant for unset resource desc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Provide a variant of DEFINE_RES that takes 0 arguments to initialize an "unset" resource descriptor. This should be used for the improper case of struct resource res = {}; where DEFINE_RES() should be used. With this new helper variant, it would result in: struct resource res = DEFINE_RES(); instead of having to define the full 3 arguments: struct resource res = DEFINE_RES(0, 0, IORESOURCE_UNSET); DEFINE_RES() with no args, will set the flags to IORESOURCE_UNSET signaling the resource descriptor is UNSET and doesn't reflect an actual resource currently. Link: https://lkml.kernel.org/r/20251213115314.16700-1-ansuelsmth@gmail.com Signed-off-by: Christian Marangi Suggested-by: Ilpo Järvinen Reviewed-by: Bjorn Helgaas Reviewed-by: Andy Shevchenko Signed-off-by: Andrew Morton --- include/linux/ioport.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 9afa30f9346f..e974fc087059 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -10,6 +10,7 @@ #define _LINUX_IOPORT_H #ifndef __ASSEMBLY__ +#include #include #include #include @@ -165,8 +166,12 @@ enum { #define DEFINE_RES_NAMED(_start, _size, _name, _flags) \ DEFINE_RES_NAMED_DESC(_start, _size, _name, _flags, IORES_DESC_NONE) -#define DEFINE_RES(_start, _size, _flags) \ +#define __DEFINE_RES0() \ + DEFINE_RES_NAMED(0, 0, NULL, IORESOURCE_UNSET) +#define __DEFINE_RES3(_start, _size, _flags) \ DEFINE_RES_NAMED(_start, _size, NULL, _flags) +#define DEFINE_RES(...) \ + CONCATENATE(__DEFINE_RES, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__) #define DEFINE_RES_IO_NAMED(_start, _size, _name) \ DEFINE_RES_NAMED((_start), (_size), (_name), IORESOURCE_IO) -- cgit v1.2.3 From e896c44aecfb7b3470470b4e63495dfa2b359060 Mon Sep 17 00:00:00 2001 From: Thomas Weißschuh Date: Tue, 30 Dec 2025 08:13:15 +0100 Subject: types: drop definition of __EXPORTED_HEADERS__ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This definition disarms the warning in uapi/linux/types.h about including kernel headers from user space. However the warning is already disarmed due to the fact that kernel code is built with -D__KERNEL__. Drop the pointless definition. Link: https://lkml.kernel.org/r/20251230-exported-headers-types-h-v1-1-947fc606f3d8@linutronix.de Signed-off-by: Thomas Weißschuh Cc: Arnd Bergmann Signed-off-by: Andrew Morton --- include/linux/types.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/types.h b/include/linux/types.h index d4437e9c452c..0cbb684eec5c 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -2,7 +2,6 @@ #ifndef _LINUX_TYPES_H #define _LINUX_TYPES_H -#define __EXPORTED_HEADERS__ #include #ifndef __ASSEMBLY__ -- cgit v1.2.3 From 10d1c75ed4382a8e79874379caa2ead8952734f9 Mon Sep 17 00:00:00 2001 From: Harshit Mogalapalli Date: Tue, 30 Dec 2025 22:16:07 -0800 Subject: ima: verify the previous kernel's IMA buffer lies in addressable RAM Patch series "Address page fault in ima_restore_measurement_list()", v3. When the second-stage kernel is booted via kexec with a limiting command line such as "mem=" we observe a pafe fault that happens. BUG: unable to handle page fault for address: ffff97793ff47000 RIP: ima_restore_measurement_list+0xdc/0x45a #PF: error_code(0x0000) not-present page This happens on x86_64 only, as this is already fixed in aarch64 in commit: cbf9c4b9617b ("of: check previous kernel's ima-kexec-buffer against memory bounds") This patch (of 3): When the second-stage kernel is booted with a limiting command line (e.g. "mem="), the IMA measurement buffer handed over from the previous kernel may fall outside the addressable RAM of the new kernel. Accessing such a buffer can fault during early restore. Introduce a small generic helper, ima_validate_range(), which verifies that a physical [start, end] range for the previous-kernel IMA buffer lies within addressable memory: - On x86, use pfn_range_is_mapped(). - On OF based architectures, use page_is_ram(). Link: https://lkml.kernel.org/r/20251231061609.907170-1-harshit.m.mogalapalli@oracle.com Link: https://lkml.kernel.org/r/20251231061609.907170-2-harshit.m.mogalapalli@oracle.com Signed-off-by: Harshit Mogalapalli Reviewed-by: Mimi Zohar Cc: Alexander Graf Cc: Ard Biesheuvel Cc: Borislav Betkov Cc: guoweikang Cc: Henry Willard Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jiri Bohac Cc: Joel Granados Cc: Jonathan McDowell Cc: Mike Rapoport Cc: Paul Webb Cc: Sohil Mehta Cc: Sourabh Jain Cc: Thomas Gleinxer Cc: Yifei Liu Cc: Baoquan He Cc: Signed-off-by: Andrew Morton --- include/linux/ima.h | 1 + security/integrity/ima/ima_kexec.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+) (limited to 'include/linux') diff --git a/include/linux/ima.h b/include/linux/ima.h index 8e29cb4e6a01..abf8923f8fc5 100644 --- a/include/linux/ima.h +++ b/include/linux/ima.h @@ -69,6 +69,7 @@ static inline int ima_measure_critical_data(const char *event_label, #ifdef CONFIG_HAVE_IMA_KEXEC int __init ima_free_kexec_buffer(void); int __init ima_get_kexec_buffer(void **addr, size_t *size); +int ima_validate_range(phys_addr_t phys, size_t size); #endif #ifdef CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 5beb69edd12f..36a34c54de58 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -12,6 +12,8 @@ #include #include #include +#include +#include #include #include #include "ima.h" @@ -294,3 +296,36 @@ void __init ima_load_kexec_buffer(void) pr_debug("Error restoring the measurement list: %d\n", rc); } } + +/* + * ima_validate_range - verify a physical buffer lies in addressable RAM + * @phys: physical start address of the buffer from previous kernel + * @size: size of the buffer + * + * On success return 0. On failure returns -EINVAL so callers can skip + * restoring. + */ +int ima_validate_range(phys_addr_t phys, size_t size) +{ + unsigned long start_pfn, end_pfn; + phys_addr_t end_phys; + + if (check_add_overflow(phys, (phys_addr_t)size - 1, &end_phys)) + return -EINVAL; + + start_pfn = PHYS_PFN(phys); + end_pfn = PHYS_PFN(end_phys); + +#ifdef CONFIG_X86 + if (!pfn_range_is_mapped(start_pfn, end_pfn)) +#else + if (!page_is_ram(start_pfn) || !page_is_ram(end_pfn)) +#endif + { + pr_warn("IMA: previous kernel measurement buffer %pa (size 0x%zx) lies outside available memory\n", + &phys, size); + return -EINVAL; + } + + return 0; +} -- cgit v1.2.3 From a7e53bfb43667dd0eaf046c1725105e2cfe3be7c Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Mon, 5 Jan 2026 18:58:34 +0200 Subject: kho/abi: luo: make generated documentation more coherent Patch series "kho: ABI headers and Documentation updates". LUO started adding KHO ABI headers to include/linux/kho/abi, but the core parts of KHO and memblock are still using the old way for descriptions on their ABIs. Let's consolidate all things KHO in include/linux/kho/abi. And while on that, make some documentation updates to have more coherent KHO docs. This patch (of 6): LUO ABI description starts with "This header defines" which is fine in the header but reads weird in the generated html documentation. Update it to make the generated documentation coherent. Link: https://lkml.kernel.org/r/20260105165839.285270-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: Jonathan Corbet Cc: Pasha Tatashin Cc: Jason Miu Signed-off-by: Andrew Morton --- include/linux/kho/abi/luo.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index bb099c92e469..beb86847b544 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -8,10 +8,10 @@ /** * DOC: Live Update Orchestrator ABI * - * This header defines the stable Application Binary Interface used by the - * Live Update Orchestrator to pass state from a pre-update kernel to a - * post-update kernel. The ABI is built upon the Kexec HandOver framework - * and uses a Flattened Device Tree to describe the preserved data. + * Live Update Orchestrator uses the stable Application Binary Interface + * defined below to pass state from a pre-update kernel to a post-update + * kernel. The ABI is built upon the Kexec HandOver framework and uses a + * Flattened Device Tree to describe the preserved data. * * This interface is a contract. Any modification to the FDT structure, node * properties, compatible strings, or the layout of the `__packed` serialization -- cgit v1.2.3 From 32cb2729c956162e5ca96fe5509b38eb9561e8c0 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Mon, 5 Jan 2026 18:58:35 +0200 Subject: kho/abi: memfd: make generated documentation more coherent memfd preservation ABI description starts with "This header defines" which is fine in the header but reads weird in the generated html documentation. Update it to make the generated documentation coherent. Link: https://lkml.kernel.org/r/20260105165839.285270-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: Jason Miu Cc: Jonathan Corbet Cc: Pasha Tatashin Signed-off-by: Andrew Morton --- include/linux/kho/abi/memfd.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kho/abi/memfd.h b/include/linux/kho/abi/memfd.h index da7d063474a1..c211c31334a3 100644 --- a/include/linux/kho/abi/memfd.h +++ b/include/linux/kho/abi/memfd.h @@ -17,8 +17,8 @@ /** * DOC: memfd Live Update ABI * - * This header defines the ABI for preserving the state of a memfd across a - * kexec reboot using the LUO. + * memfd uses the ABI defined below for preserving its state across a kexec + * reboot using the LUO. * * The state is serialized into a packed structure `struct memfd_luo_ser` * which is handed over to the next kernel via the KHO mechanism. -- cgit v1.2.3 From 5e1ea1e27b6ff237122ac6cb30e0b8ea4618f75f Mon Sep 17 00:00:00 2001 From: Jason Miu Date: Mon, 5 Jan 2026 18:58:37 +0200 Subject: kho: introduce KHO FDT ABI header Introduce the `include/linux/kho/abi/kexec_handover.h` header file, which defines the stable ABI for the KHO mechanism. This header specifies how preserved data is passed between kernels using an FDT. The ABI contract includes the FDT structure, node properties, and the "kho-v1" compatible string. By centralizing these definitions, this header serves as the foundational agreement for inter-kernel communication of preserved states, ensuring forward compatibility and preventing misinterpretation of data across kexec transitions. Since the ABI definitions are now centralized in the header files, the YAML files that previously described the FDT interfaces are redundant. These redundant files have therefore been removed. Link: https://lkml.kernel.org/r/20260105165839.285270-5-rppt@kernel.org Signed-off-by: Jason Miu Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: Jonathan Corbet Cc: Pasha Tatashin Signed-off-by: Andrew Morton --- Documentation/core-api/kho/abi.rst | 16 +++++ Documentation/core-api/kho/bindings/kho.yaml | 43 ------------ Documentation/core-api/kho/bindings/sub-fdt.yaml | 27 -------- Documentation/core-api/kho/index.rst | 9 +++ MAINTAINERS | 1 + include/linux/kho/abi/kexec_handover.h | 85 ++++++++++++++++++++++++ kernel/liveupdate/kexec_handover.c | 19 +++--- 7 files changed, 120 insertions(+), 80 deletions(-) create mode 100644 Documentation/core-api/kho/abi.rst delete mode 100644 Documentation/core-api/kho/bindings/kho.yaml delete mode 100644 Documentation/core-api/kho/bindings/sub-fdt.yaml create mode 100644 include/linux/kho/abi/kexec_handover.h (limited to 'include/linux') diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst new file mode 100644 index 000000000000..a1ee0f481727 --- /dev/null +++ b/Documentation/core-api/kho/abi.rst @@ -0,0 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +================== +Kexec Handover ABI +================== + +Core Kexec Handover ABI +======================== + +.. kernel-doc:: include/linux/kho/abi/kexec_handover.h + :doc: Kexec Handover ABI + +See Also +======== + +- :doc:`/admin-guide/mm/kho` diff --git a/Documentation/core-api/kho/bindings/kho.yaml b/Documentation/core-api/kho/bindings/kho.yaml deleted file mode 100644 index 11e8ab7b219d..000000000000 --- a/Documentation/core-api/kho/bindings/kho.yaml +++ /dev/null @@ -1,43 +0,0 @@ -# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -%YAML 1.2 ---- -title: Kexec HandOver (KHO) root tree - -maintainers: - - Mike Rapoport - - Changyuan Lyu - -description: | - System memory preserved by KHO across kexec. - -properties: - compatible: - enum: - - kho-v1 - - preserved-memory-map: - description: | - physical address (u64) of an in-memory structure describing all preserved - folios and memory ranges. - -patternProperties: - "$[0-9a-f_]+^": - $ref: sub-fdt.yaml# - description: physical address of a KHO user's own FDT. - -required: - - compatible - - preserved-memory-map - -additionalProperties: false - -examples: - - | - kho { - compatible = "kho-v1"; - preserved-memory-map = <0xf0be16 0x1000000>; - - memblock { - fdt = <0x80cc16 0x1000000>; - }; - }; diff --git a/Documentation/core-api/kho/bindings/sub-fdt.yaml b/Documentation/core-api/kho/bindings/sub-fdt.yaml deleted file mode 100644 index b9a3d2d24850..000000000000 --- a/Documentation/core-api/kho/bindings/sub-fdt.yaml +++ /dev/null @@ -1,27 +0,0 @@ -# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -%YAML 1.2 ---- -title: KHO users' FDT address - -maintainers: - - Mike Rapoport - - Changyuan Lyu - -description: | - Physical address of an FDT blob registered by a KHO user. - -properties: - fdt: - description: | - physical address (u64) of an FDT blob. - -required: - - fdt - -additionalProperties: false - -examples: - - | - memblock { - fdt = <0x80cc16 0x1000000>; - }; diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index 1733b3c3e976..dcc6a36cc134 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -31,6 +31,15 @@ can retrieve and restore the preserved state from KHO FDT. Subsystems participating in KHO can define their own format for state serialization and preservation. +KHO FDT and structures defined by the subsystems form an ABI between pre-kexec +and post-kexec kernels. This ABI is defined by header files in +``include/linux/kho/abi`` directory. + +.. toctree:: + :maxdepth: 1 + + abi.rst + .. _kho_scratch: Scratch Regions diff --git a/MAINTAINERS b/MAINTAINERS index 4dcbcb5c14f0..9d724a7ade71 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13968,6 +13968,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ +F: include/linux/kho/abi/ F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h new file mode 100644 index 000000000000..af9fa8c134c7 --- /dev/null +++ b/include/linux/kho/abi/kexec_handover.h @@ -0,0 +1,85 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +/* + * Copyright (C) 2023 Alexander Graf + * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport + * Copyright (C) 2025 Google LLC, Changyuan Lyu + * Copyright (C) 2025 Google LLC, Jason Miu + */ + +#ifndef _LINUX_KHO_ABI_KEXEC_HANDOVER_H +#define _LINUX_KHO_ABI_KEXEC_HANDOVER_H + +/** + * DOC: Kexec Handover ABI + * + * Kexec Handover uses the ABI defined below for passing preserved data from + * one kernel to the next. + * The ABI uses Flattened Device Tree (FDT) format. The first kernel creates an + * FDT which is then passed to the next kernel during a kexec handover. + * + * This interface is a contract. Any modification to the FDT structure, node + * properties, compatible string, or the layout of the data structures + * referenced here constitutes a breaking change. Such changes require + * incrementing the version number in KHO_FDT_COMPATIBLE to prevent a new kernel + * from misinterpreting data from an older kernel. Changes are allowed provided + * the compatibility version is incremented. However, backward/forward + * compatibility is only guaranteed for kernels supporting the same ABI version. + * + * FDT Structure Overview: + * The FDT serves as a central registry for physical + * addresses of preserved data structures and sub-FDTs. The first kernel + * populates this FDT with references to memory regions and other FDTs that + * need to persist across the kexec transition. The subsequent kernel then + * parses this FDT to locate and restore the preserved data.:: + * + * / { + * compatible = "kho-v1"; + * + * preserved-memory-map = <0x...>; + * + * { + * fdt = <0x...>; + * }; + * + * { + * fdt = <0x...>; + * }; + * ... ... + * { + * fdt = <0x...>; + * }; + * }; + * + * Root KHO Node (/): + * - compatible: "kho-v1" + * + * Indentifies the overall KHO ABI version. + * + * - preserved-memory-map: u64 + * + * Physical memory address pointing to the root of the + * preserved memory map data structure. + * + * Subnodes (): + * Subnodes can also be added to the root node to + * describe other preserved data blobs. The + * is provided by the subsystem that uses KHO for preserving its + * data. + * + * - fdt: u64 + * + * Physical address pointing to a subnode FDT blob that is also + * being preserved. + */ + +/* The compatible string for the KHO FDT root node. */ +#define KHO_FDT_COMPATIBLE "kho-v1" + +/* The FDT property for the preserved memory map. */ +#define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map" + +/* The FDT property for sub-FDTs. */ +#define KHO_FDT_SUB_TREE_PROP_NAME "fdt" + +#endif /* _LINUX_KHO_ABI_KEXEC_HANDOVER_H */ diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index d4482b6e3cae..8f57d6e040af 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -33,10 +34,7 @@ #include "../kexec_internal.h" #include "kexec_handover_internal.h" -#define KHO_FDT_COMPATIBLE "kho-v1" -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map" -#define PROP_SUB_FDT "fdt" - +/* The magic token for preserved pages */ #define KHO_PAGE_MAGIC 0x4b484f50U /* ASCII for 'KHOP' */ /* @@ -378,7 +376,7 @@ static void kho_update_memory_map(struct khoser_mem_chunk *first_chunk) void *ptr; u64 phys; - ptr = fdt_getprop_w(kho_out.fdt, 0, PROP_PRESERVED_MEMORY_MAP, NULL); + ptr = fdt_getprop_w(kho_out.fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, NULL); /* Check and discard previous memory map */ phys = get_unaligned((u64 *)ptr); @@ -466,7 +464,7 @@ static phys_addr_t __init kho_get_mem_map_phys(const void *fdt) const void *mem_ptr; int len; - mem_ptr = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); + mem_ptr = fdt_getprop(fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, &len); if (!mem_ptr || len != sizeof(u64)) { pr_err("failed to get preserved memory bitmaps\n"); return 0; @@ -727,7 +725,8 @@ int kho_add_subtree(const char *name, void *fdt) goto out_pack; } - err = fdt_setprop(root_fdt, off, PROP_SUB_FDT, &phys, sizeof(phys)); + err = fdt_setprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, + &phys, sizeof(phys)); if (err < 0) goto out_pack; @@ -758,7 +757,7 @@ void kho_remove_subtree(void *fdt) const u64 *val; int len; - val = fdt_getprop(root_fdt, off, PROP_SUB_FDT, &len); + val = fdt_getprop(root_fdt, off, KHO_FDT_SUB_TREE_PROP_NAME, &len); if (!val || len != sizeof(phys_addr_t)) continue; @@ -1305,7 +1304,7 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys) if (offset < 0) return -ENOENT; - val = fdt_getprop(fdt, offset, PROP_SUB_FDT, &len); + val = fdt_getprop(fdt, offset, KHO_FDT_SUB_TREE_PROP_NAME, &len); if (!val || len != sizeof(*val)) return -EINVAL; @@ -1325,7 +1324,7 @@ static __init int kho_out_fdt_setup(void) err |= fdt_finish_reservemap(root); err |= fdt_begin_node(root, ""); err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); - err |= fdt_property(root, PROP_PRESERVED_MEMORY_MAP, &empty_mem_map, + err |= fdt_property(root, KHO_FDT_MEMORY_MAP_PROP_NAME, &empty_mem_map, sizeof(empty_mem_map)); err |= fdt_end_node(root); err |= fdt_finish(root); -- cgit v1.2.3 From ac2d8102c4b88713a8fa371d5d802fcff131d6ac Mon Sep 17 00:00:00 2001 From: Jason Miu Date: Mon, 5 Jan 2026 18:58:38 +0200 Subject: kho: relocate vmalloc preservation structure to KHO ABI header The `struct kho_vmalloc` defines the in-memory layout for preserving vmalloc regions across kexec. This layout is a contract between kernels and part of the KHO ABI. To reflect this relationship, the related structs and helper macros are relocated to the ABI header, `include/linux/kho/abi/kexec_handover.h`. This move places the structure's definition under the protection of the KHO_FDT_COMPATIBLE version string. The structure and its components are now also documented within the ABI header to describe the contract and prevent ABI breaks. [rppt@kernel.org: update comment, per Pratyush] Link: https://lkml.kernel.org/r/aW_Mqp6HcqLwQImS@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-6-rppt@kernel.org Signed-off-by: Jason Miu Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) Cc: Alexander Graf Cc: Jonathan Corbet Cc: Pasha Tatashin Cc: Pratyush Yadav Signed-off-by: Andrew Morton --- Documentation/core-api/kho/abi.rst | 6 +++ include/linux/kexec_handover.h | 27 +----------- include/linux/kho/abi/kexec_handover.h | 78 ++++++++++++++++++++++++++++++++++ include/linux/kho/abi/memfd.h | 2 +- kernel/liveupdate/kexec_handover.c | 15 ------- lib/test_kho.c | 1 + 6 files changed, 88 insertions(+), 41 deletions(-) (limited to 'include/linux') diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index a1ee0f481727..1d9916adee23 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -10,6 +10,12 @@ Core Kexec Handover ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: Kexec Handover ABI +vmalloc preservation ABI +======================== + +.. kernel-doc:: include/linux/kho/abi/kexec_handover.h + :doc: Kexec Handover ABI for vmalloc Preservation + See Also ======== diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 5f7b9de97e8d..a56ff3ffaf17 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -11,34 +11,11 @@ struct kho_scratch { phys_addr_t size; }; +struct kho_vmalloc; + struct folio; struct page; -#define DECLARE_KHOSER_PTR(name, type) \ - union { \ - phys_addr_t phys; \ - type ptr; \ - } name -#define KHOSER_STORE_PTR(dest, val) \ - ({ \ - typeof(val) v = val; \ - typecheck(typeof((dest).ptr), v); \ - (dest).phys = virt_to_phys(v); \ - }) -#define KHOSER_LOAD_PTR(src) \ - ({ \ - typeof(src) s = src; \ - (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \ - }) - -struct kho_vmalloc_chunk; -struct kho_vmalloc { - DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *); - unsigned int total_pages; - unsigned short flags; - unsigned short order; -}; - #ifdef CONFIG_KEXEC_HANDOVER bool kho_is_enabled(void); bool is_kho_boot(void); diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index af9fa8c134c7..2201a0d2c159 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -10,6 +10,8 @@ #ifndef _LINUX_KHO_ABI_KEXEC_HANDOVER_H #define _LINUX_KHO_ABI_KEXEC_HANDOVER_H +#include + /** * DOC: Kexec Handover ABI * @@ -82,4 +84,80 @@ /* The FDT property for sub-FDTs. */ #define KHO_FDT_SUB_TREE_PROP_NAME "fdt" +/** + * DOC: Kexec Handover ABI for vmalloc Preservation + * + * The Kexec Handover ABI for preserving vmalloc'ed memory is defined by + * a set of structures and helper macros. The layout of these structures is a + * stable contract between kernels and is versioned by the KHO_FDT_COMPATIBLE + * string. + * + * The preservation is managed through a main descriptor &struct kho_vmalloc, + * which points to a linked list of &struct kho_vmalloc_chunk structures. These + * chunks contain the physical addresses of the preserved pages, allowing the + * next kernel to reconstruct the vmalloc area with the same content and layout. + * Helper macros are also defined for storing and loading pointers within + * these structures. + */ + +/* Helper macro to define a union for a serializable pointer. */ +#define DECLARE_KHOSER_PTR(name, type) \ + union { \ + u64 phys; \ + type ptr; \ + } name + +/* Stores the physical address of a serializable pointer. */ +#define KHOSER_STORE_PTR(dest, val) \ + ({ \ + typeof(val) v = val; \ + typecheck(typeof((dest).ptr), v); \ + (dest).phys = virt_to_phys(v); \ + }) + +/* Loads the stored physical address back to a pointer. */ +#define KHOSER_LOAD_PTR(src) \ + ({ \ + typeof(src) s = src; \ + (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \ + }) + +/* + * This header is embedded at the beginning of each `kho_vmalloc_chunk` + * and contains a pointer to the next chunk in the linked list, + * stored as a physical address for handover. + */ +struct kho_vmalloc_hdr { + DECLARE_KHOSER_PTR(next, struct kho_vmalloc_chunk *); +}; + +#define KHO_VMALLOC_SIZE \ + ((PAGE_SIZE - sizeof(struct kho_vmalloc_hdr)) / \ + sizeof(u64)) + +/* + * Each chunk is a single page and is part of a linked list that describes + * a preserved vmalloc area. It contains the header with the link to the next + * chunk and a zero terminated array of physical addresses of the pages that + * make up the preserved vmalloc area. + */ +struct kho_vmalloc_chunk { + struct kho_vmalloc_hdr hdr; + u64 phys[KHO_VMALLOC_SIZE]; +}; + +static_assert(sizeof(struct kho_vmalloc_chunk) == PAGE_SIZE); + +/* + * Describes a preserved vmalloc memory area, including the + * total number of pages, allocation flags, page order, and a pointer to the + * first chunk of physical page addresses. + */ +struct kho_vmalloc { + DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *); + unsigned int total_pages; + unsigned short flags; + unsigned short order; +}; + #endif /* _LINUX_KHO_ABI_KEXEC_HANDOVER_H */ diff --git a/include/linux/kho/abi/memfd.h b/include/linux/kho/abi/memfd.h index c211c31334a3..68cb6303b846 100644 --- a/include/linux/kho/abi/memfd.h +++ b/include/linux/kho/abi/memfd.h @@ -12,7 +12,7 @@ #define _LINUX_KHO_ABI_MEMFD_H #include -#include +#include /** * DOC: memfd Live Update ABI diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 8f57d6e040af..66fcdda0ebdc 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -876,21 +876,6 @@ void kho_unpreserve_pages(struct page *page, unsigned int nr_pages) } EXPORT_SYMBOL_GPL(kho_unpreserve_pages); -struct kho_vmalloc_hdr { - DECLARE_KHOSER_PTR(next, struct kho_vmalloc_chunk *); -}; - -#define KHO_VMALLOC_SIZE \ - ((PAGE_SIZE - sizeof(struct kho_vmalloc_hdr)) / \ - sizeof(phys_addr_t)) - -struct kho_vmalloc_chunk { - struct kho_vmalloc_hdr hdr; - phys_addr_t phys[KHO_VMALLOC_SIZE]; -}; - -static_assert(sizeof(struct kho_vmalloc_chunk) == PAGE_SIZE); - /* vmalloc flags KHO supports */ #define KHO_VMALLOC_SUPPORTED_FLAGS (VM_ALLOC | VM_ALLOW_HUGE_VMAP) diff --git a/lib/test_kho.c b/lib/test_kho.c index 47de56280795..3431daca6968 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -19,6 +19,7 @@ #include #include #include +#include #include -- cgit v1.2.3 From dd1e79ef6ca188678ece81a77d0076ae7403116c Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Mon, 5 Jan 2026 18:58:39 +0200 Subject: kho/abi: add memblock ABI header Introduce KHO ABI header describing preservation ABI for memblock's reserve_mem regions and link the relevant documentation to KHO docs. [lukas.bulwahn@redhat.com: MAINTAINERS: adjust file entry in MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION] Link: https://lkml.kernel.org/r/20260107090438.22901-1-lukas.bulwahn@redhat.com [rppt@kernel.org: update reserved_mem node description, per Pratyush] Link: https://lkml.kernel.org/r/aW_M-HYZzx5SkbnZ@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: Jason Miu Cc: Jonathan Corbet Cc: Pasha Tatashin Signed-off-by: Andrew Morton --- Documentation/core-api/kho/abi.rst | 6 ++ .../core-api/kho/bindings/memblock/memblock.yaml | 39 ------------ .../kho/bindings/memblock/reserve-mem.yaml | 40 ------------ MAINTAINERS | 2 +- include/linux/kho/abi/memblock.h | 73 ++++++++++++++++++++++ mm/memblock.c | 4 +- 6 files changed, 81 insertions(+), 83 deletions(-) delete mode 100644 Documentation/core-api/kho/bindings/memblock/memblock.yaml delete mode 100644 Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml create mode 100644 include/linux/kho/abi/memblock.h (limited to 'include/linux') diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index 1d9916adee23..2e63be3486cf 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -16,6 +16,12 @@ vmalloc preservation ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: Kexec Handover ABI for vmalloc Preservation +memblock preservation ABI +========================= + +.. kernel-doc:: include/linux/kho/abi/memblock.h + :doc: memblock kexec handover ABI + See Also ======== diff --git a/Documentation/core-api/kho/bindings/memblock/memblock.yaml b/Documentation/core-api/kho/bindings/memblock/memblock.yaml deleted file mode 100644 index d388c28eb91d..000000000000 --- a/Documentation/core-api/kho/bindings/memblock/memblock.yaml +++ /dev/null @@ -1,39 +0,0 @@ -# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -%YAML 1.2 ---- -title: Memblock reserved memory - -maintainers: - - Mike Rapoport - -description: | - Memblock can serialize its current memory reservations created with - reserve_mem command line option across kexec through KHO. - The post-KHO kernel can then consume these reservations and they are - guaranteed to have the same physical address. - -properties: - compatible: - enum: - - reserve-mem-v1 - -patternProperties: - "$[0-9a-f_]+^": - $ref: reserve-mem.yaml# - description: reserved memory regions - -required: - - compatible - -additionalProperties: false - -examples: - - | - memblock { - compatible = "memblock-v1"; - n1 { - compatible = "reserve-mem-v1"; - start = <0xc06b 0x4000000>; - size = <0x04 0x00>; - }; - }; diff --git a/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml deleted file mode 100644 index 10282d3d1bcd..000000000000 --- a/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml +++ /dev/null @@ -1,40 +0,0 @@ -# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -%YAML 1.2 ---- -title: Memblock reserved memory regions - -maintainers: - - Mike Rapoport - -description: | - Memblock can serialize its current memory reservations created with - reserve_mem command line option across kexec through KHO. - This object describes each such region. - -properties: - compatible: - enum: - - reserve-mem-v1 - - start: - description: | - physical address (u64) of the reserved memory region. - - size: - description: | - size (u64) of the reserved memory region. - -required: - - compatible - - start - - size - -additionalProperties: false - -examples: - - | - n1 { - compatible = "reserve-mem-v1"; - start = <0xc06b 0x4000000>; - size = <0x04 0x00>; - }; diff --git a/MAINTAINERS b/MAINTAINERS index 9d724a7ade71..92b377cd131b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16396,7 +16396,7 @@ S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock.git for-next T: git git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock.git fixes F: Documentation/core-api/boot-time-mm.rst -F: Documentation/core-api/kho/bindings/memblock/* +F: include/linux/kho/abi/memblock.h F: include/linux/memblock.h F: mm/bootmem_info.c F: mm/memblock.c diff --git a/include/linux/kho/abi/memblock.h b/include/linux/kho/abi/memblock.h new file mode 100644 index 000000000000..27b042f470e1 --- /dev/null +++ b/include/linux/kho/abi/memblock.h @@ -0,0 +1,73 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_KHO_ABI_MEMBLOCK_H +#define _LINUX_KHO_ABI_MEMBLOCK_H + +/** + * DOC: memblock kexec handover ABI + * + * Memblock can serialize its current memory reservations created with + * reserve_mem command line option across kexec through KHO. + * The post-KHO kernel can then consume these reservations and they are + * guaranteed to have the same physical address. + * + * The state is serialized using Flattened Device Tree (FDT) format. Any + * modification to the FDT structure, node properties, or the compatible + * strings constitutes a breaking change. Such changes require incrementing the + * version number in the relevant `_COMPATIBLE` string to prevent a new kernel + * from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented. + * However, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + * + * FDT Structure Overview: + * The entire memblock state is encapsulated within a single KHO entry named + * "memblock". + * This entry contains an FDT with the following layout: + * + * .. code-block:: none + * + * / { + * compatible = "memblock-v1"; + * + * n1 { + * compatible = "reserve-mem-v1"; + * start = <0xc06b 0x4000000>; + * size = <0x04 0x00>; + * }; + * }; + * + * Main memblock node (/): + * + * - compatible: "memblock-v1" + + * Identifies the overall memblock ABI version. + * + * reserved_mem node: + * These nodes describe all reserve_mem regions. The node name is the name + * defined by the user for a reserve_mem region. + * + * - compatible: "reserve-mem-v1" + * + * Identifies the ABI version of reserve_mem descriptions + * + * - start: u64 + * + * Physical address of the reserved memory region. + * + * - size: u64 + * + * size in bytes of the reserved memory region. + */ + +/* Top level memblock FDT node name. */ +#define MEMBLOCK_KHO_FDT "memblock" + +/* The compatible string for the memblock FDT root node. */ +#define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1" + +/* The compatible string for the reserve_mem FDT nodes. */ +#define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1" + +#endif /* _LINUX_KHO_ABI_MEMBLOCK_H */ diff --git a/mm/memblock.c b/mm/memblock.c index 905d06b16348..6cff515d82f4 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -21,6 +21,7 @@ #ifdef CONFIG_KEXEC_HANDOVER #include #include +#include #endif /* CONFIG_KEXEC_HANDOVER */ #include @@ -2442,9 +2443,6 @@ int reserve_mem_release_by_name(const char *name) } #ifdef CONFIG_KEXEC_HANDOVER -#define MEMBLOCK_KHO_FDT "memblock" -#define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1" -#define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1" static int __init reserved_mem_preserve(void) { -- cgit v1.2.3 From 4cc67b048459bebb7a60b693044ec83fb853eba1 Mon Sep 17 00:00:00 2001 From: "Maciej W. Rozycki" Date: Sun, 11 Jan 2026 21:21:57 +0000 Subject: linux/log2.h: reduce instruction count for is_power_of_2() Follow an observation that (n ^ (n - 1)) will only ever retain the most significant bit set in the word operated on if that is the only bit set in the first place, and use it to determine whether a number is a whole power of 2, avoiding the need for an explicit check for nonzero. This reduces the sequence produced to 3 instructions only across Alpha, MIPS, and RISC-V targets, down from 4, 5, and 4 respectively, removing a branch in the two latter cases. And it's 5 instructions on POWER and x86-64 vs 8 and 9 respectively. There are no branches now emitted here for targets that have a suitable conditional set operation, although an inline expansion will often end with one, depending on what code a call to this function is used in. Credit goes to GCC authors for coming up with this optimisation used as the fallback for (__builtin_popcountl(n) == 1), equivalent to this code, for targets where the hardware population count operation is considered expensive. Link: https://lkml.kernel.org/r/alpine.DEB.2.21.2601111836250.30566@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki Cc: Jens Axboe Cc: John Garry Cc: "Martin K. Petersen" Cc: Su Hui Signed-off-by: Andrew Morton --- include/linux/log2.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/log2.h b/include/linux/log2.h index 2eac3fc9303d..e17ceb32e0c9 100644 --- a/include/linux/log2.h +++ b/include/linux/log2.h @@ -44,7 +44,7 @@ int __ilog2_u64(u64 n) static __always_inline __attribute__((const)) bool is_power_of_2(unsigned long n) { - return (n != 0 && ((n & (n - 1)) == 0)); + return n - 1 < (n ^ (n - 1)); } /** -- cgit v1.2.3 From e428b013d9dff30f7a65509e33047ba975cce8ba Mon Sep 17 00:00:00 2001 From: Finn Thain Date: Tue, 13 Jan 2026 16:22:28 +1100 Subject: atomic: specify alignment for atomic_t and atomic64_t Some recent commits incorrectly assumed 4-byte alignment of locks. That assumption fails on Linux/m68k (and, interestingly, would have failed on Linux/cris also). The jump label implementation makes a similar alignment assumption. The expectation that atomic_t and atomic64_t variables will be naturally aligned seems reasonable, as indeed they are on 64-bit architectures. But atomic64_t isn't naturally aligned on csky, m68k, microblaze, nios2, openrisc and sh. Neither atomic_t nor atomic64_t are naturally aligned on m68k. This patch brings a little uniformity by specifying natural alignment for atomic types. One benefit is that atomic64_t variables do not get split across a page boundary. The cost is that some structs grow which leads to cache misses and wasted memory. See also, commit bbf2a330d92c ("x86: atomic64: The atomic64_t data type should be 8 bytes aligned on 32-bit too"). Link: https://lkml.kernel.org/r/a76bc24a4e7c1d8112d7d5fa8d14e4b694a0e90c.1768281748.git.fthain@linux-m68k.org Link: https://lore.kernel.org/lkml/CAFr9PX=MYUDGJS2kAvPMkkfvH+0-SwQB_kxE4ea0J_wZ_pk=7w@mail.gmail.com Link: https://lore.kernel.org/lkml/CAMuHMdW7Ab13DdGs2acMQcix5ObJK0O2dG_Fxzr8_g58Rc1_0g@mail.gmail.com/ Signed-off-by: Finn Thain Acked-by: Guo Ren Reviewed-by: Arnd Bergmann Cc: Guo Ren Cc: Geert Uytterhoeven Cc: Dinh Nguyen Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: Alexei Starovoitov Cc: Andrii Nakryiko Cc: Ard Biesheuvel Cc: Boqun Feng Cc: "Borislav Petkov (AMD)" Cc: Daniel Borkman Cc: Dave Hansen Cc: Eduard Zingerman Cc: Gary Guo Cc: Hao Luo Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jiri Olsa Cc: John Fastabend Cc: KP Singh Cc: Marc Rutland Cc: Martin KaFai Lau Cc: Peter Zijlstra Cc: Sasha Levin (Microsoft) Cc: Song Liu Cc: Stanislav Fomichev Cc: Thomas Gleixner Cc: Will Deacon Cc: Yonghong Song Signed-off-by: Andrew Morton --- include/asm-generic/atomic64.h | 2 +- include/linux/types.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/asm-generic/atomic64.h b/include/asm-generic/atomic64.h index 100d24b02e52..f22ccfc0df98 100644 --- a/include/asm-generic/atomic64.h +++ b/include/asm-generic/atomic64.h @@ -10,7 +10,7 @@ #include typedef struct { - s64 counter; + s64 __aligned(sizeof(s64)) counter; } atomic64_t; #define ATOMIC64_INIT(i) { (i) } diff --git a/include/linux/types.h b/include/linux/types.h index 0cbb684eec5c..f69be881369f 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -179,7 +179,7 @@ typedef phys_addr_t resource_size_t; typedef unsigned long irq_hw_number_t; typedef struct { - int counter; + int __aligned(sizeof(int)) counter; } atomic_t; #define ATOMIC_INIT(i) { (i) } -- cgit v1.2.3 From 80047d84eed25e9c92cfb9169980a0dfec110246 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 13 Jan 2026 16:22:28 +1100 Subject: atomic: add alignment check to instrumented atomic operations Add a Kconfig option for debug builds which logs a warning when an instrumented atomic operation takes place that's misaligned. Some platforms don't trap for this. [fthain@linux-m68k.org: added __DISABLE_EXPORTS conditional and refactored as helper function] Link: https://lkml.kernel.org/r/51ebf844e006ca0de408f5d3a831e7b39d7fc31c.1768281748.git.fthain@linux-m68k.org Link: https://lore.kernel.org/lkml/20250901093600.GF4067720@noisy.programming.kicks-ass.net/ Link: https://lore.kernel.org/linux-next/df9fbd22-a648-ada4-fee0-68fe4325ff82@linux-m68k.org/ Signed-off-by: Finn Thain Signed-off-by: Peter Zijlstra (Intel) Suggested-by: Geert Uytterhoeven Cc: Sasha Levin Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Ard Biesheuvel Cc: "H. Peter Anvin" Cc: Alexei Starovoitov Cc: Andrii Nakryiko Cc: Arnd Bergmann Cc: Boqun Feng Cc: Daniel Borkman Cc: Dinh Nguyen Cc: Eduard Zingerman Cc: Gary Guo Cc: Guo Ren Cc: Hao Luo Cc: Jiri Olsa Cc: John Fastabend Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: KP Singh Cc: Marc Rutland Cc: Martin KaFai Lau Cc: Rich Felker Cc: Song Liu Cc: Stafford Horne Cc: Stanislav Fomichev Cc: Stefan Kristiansson Cc: Will Deacon Cc: Yonghong Song Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- include/linux/instrumented.h | 11 +++++++++++ lib/Kconfig.debug | 10 ++++++++++ 2 files changed, 21 insertions(+) (limited to 'include/linux') diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h index 711a1f0d1a73..e34b6a557e0a 100644 --- a/include/linux/instrumented.h +++ b/include/linux/instrumented.h @@ -7,6 +7,7 @@ #ifndef _LINUX_INSTRUMENTED_H #define _LINUX_INSTRUMENTED_H +#include #include #include #include @@ -55,6 +56,13 @@ static __always_inline void instrument_read_write(const volatile void *v, size_t kcsan_check_read_write(v, size); } +static __always_inline void instrument_atomic_check_alignment(const volatile void *v, size_t size) +{ +#ifndef __DISABLE_EXPORTS + WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ATOMIC) && ((unsigned long)v & (size - 1))); +#endif +} + /** * instrument_atomic_read - instrument atomic read access * @v: address of access @@ -67,6 +75,7 @@ static __always_inline void instrument_atomic_read(const volatile void *v, size_ { kasan_check_read(v, size); kcsan_check_atomic_read(v, size); + instrument_atomic_check_alignment(v, size); } /** @@ -81,6 +90,7 @@ static __always_inline void instrument_atomic_write(const volatile void *v, size { kasan_check_write(v, size); kcsan_check_atomic_write(v, size); + instrument_atomic_check_alignment(v, size); } /** @@ -95,6 +105,7 @@ static __always_inline void instrument_atomic_read_write(const volatile void *v, { kasan_check_write(v, size); kcsan_check_atomic_read_write(v, size); + instrument_atomic_check_alignment(v, size); } /** diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 17d759a04021..9eb685d1ec44 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1360,6 +1360,16 @@ config DEBUG_PREEMPT depending on workload as it triggers debugging routines for each this_cpu operation. It should only be used for debugging purposes. +config DEBUG_ATOMIC + bool "Debug atomic variables" + depends on DEBUG_KERNEL + help + If you say Y here then the kernel will add a runtime alignment check + to atomic accesses. Useful for architectures that do not have trap on + mis-aligned access. + + This option has potentially significant overhead. + menu "Lock Debugging (spinlocks, mutexes, etc...)" config LOCK_DEBUGGING_SUPPORT -- cgit v1.2.3 From 9a229ae249e0a24276901ad6807f31b32124f5c5 Mon Sep 17 00:00:00 2001 From: Finn Thain Date: Tue, 13 Jan 2026 16:22:28 +1100 Subject: atomic: add option for weaker alignment check Add a new Kconfig symbol to make CONFIG_DEBUG_ATOMIC more useful on those architectures which do not align dynamic allocations to 8-byte boundaries. Without this, CONFIG_DEBUG_ATOMIC produces excessive WARN splats. Link: https://lkml.kernel.org/r/6d25a12934fe9199332f4d65d17c17de450139a8.1768281748.git.fthain@linux-m68k.org Signed-off-by: Finn Thain Cc: Alexei Starovoitov Cc: Andrii Nakryiko Cc: Ard Biesheuvel Cc: Arnd Bergmann Cc: Boqun Feng Cc: "Borislav Petkov (AMD)" Cc: Daniel Borkman Cc: Dave Hansen Cc: Dinh Nguyen Cc: Eduard Zingerman Cc: Gary Guo Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Hao Luo Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jiri Olsa Cc: John Fastabend Cc: John Paul Adrian Glaubitz Cc: Jonas Bonn Cc: KP Singh Cc: Marc Rutland Cc: Martin KaFai Lau Cc: Peter Zijlstra Cc: Rich Felker Cc: Sasha Levin (Microsoft) Cc: Song Liu Cc: Stafford Horne Cc: Stanislav Fomichev Cc: Stefan Kristiansson Cc: Thomas Gleixner Cc: Will Deacon Cc: Yonghong Song Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- include/linux/instrumented.h | 8 +++++++- lib/Kconfig.debug | 8 ++++++++ 2 files changed, 15 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h index e34b6a557e0a..a1b4cf81adc2 100644 --- a/include/linux/instrumented.h +++ b/include/linux/instrumented.h @@ -59,7 +59,13 @@ static __always_inline void instrument_read_write(const volatile void *v, size_t static __always_inline void instrument_atomic_check_alignment(const volatile void *v, size_t size) { #ifndef __DISABLE_EXPORTS - WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ATOMIC) && ((unsigned long)v & (size - 1))); + if (IS_ENABLED(CONFIG_DEBUG_ATOMIC)) { + unsigned int mask = size - 1; + + if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_LARGEST_ALIGN)) + mask &= sizeof(struct { long x; } __aligned_largest) - 1; + WARN_ON_ONCE((unsigned long)v & mask); + } #endif } diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9eb685d1ec44..7eed3b197ca9 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1370,6 +1370,14 @@ config DEBUG_ATOMIC This option has potentially significant overhead. +config DEBUG_ATOMIC_LARGEST_ALIGN + bool "Check alignment only up to __aligned_largest" + depends on DEBUG_ATOMIC + help + If you say Y here then the check for natural alignment of + atomic accesses will be constrained to the compiler's largest + alignment for scalar types. + menu "Lock Debugging (spinlocks, mutexes, etc...)" config LOCK_DEBUGGING_SUPPORT -- cgit v1.2.3 From 840fe43d371fc59ef2da6b6bb88a4d480eed9a38 Mon Sep 17 00:00:00 2001 From: Pratyush Yadav Date: Fri, 16 Jan 2026 11:22:14 +0000 Subject: kho: use unsigned long for nr_pages Patch series "kho: clean up page initialization logic", v2. This series simplifies the page initialization logic in kho_restore_page(). It was originally only a single patch [0], but on Pasha's suggestion, I added another patch to use unsigned long for nr_pages. Technically speaking, the patches aren't related and can be applied independently, but bundling them together since patch 2 relies on 1 and it is easier to manage them this way. This patch (of 2): With 4k pages, a 32-bit nr_pages can span up to 16 TiB. While it is a lot, there exist systems with terabytes of RAM. gup is also moving to using long for nr_pages. Use unsigned long and make KHO future-proof. Link: https://lkml.kernel.org/r/20260116112217.915803-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260116112217.915803-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav Suggested-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pasha Tatashin Cc: Alexander Graf Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 6 +++--- kernel/liveupdate/kexec_handover.c | 11 ++++++----- 2 files changed, 9 insertions(+), 8 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index a56ff3ffaf17..ac4129d1d741 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -22,15 +22,15 @@ bool is_kho_boot(void); int kho_preserve_folio(struct folio *folio); void kho_unpreserve_folio(struct folio *folio); -int kho_preserve_pages(struct page *page, unsigned int nr_pages); -void kho_unpreserve_pages(struct page *page, unsigned int nr_pages); +int kho_preserve_pages(struct page *page, unsigned long nr_pages); +void kho_unpreserve_pages(struct page *page, unsigned long nr_pages); int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); void *kho_alloc_preserve(size_t size); void kho_unpreserve_free(void *mem); void kho_restore_free(void *mem); struct folio *kho_restore_folio(phys_addr_t phys); -struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); +struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); int kho_add_subtree(const char *name, void *fdt); void kho_remove_subtree(void *fdt); diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index e44fd7ceff2e..56cc1aad5c5c 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -219,7 +219,8 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, static struct page *kho_restore_page(phys_addr_t phys, bool is_folio) { struct page *page = pfn_to_online_page(PHYS_PFN(phys)); - unsigned int nr_pages, ref_cnt; + unsigned long nr_pages; + unsigned int ref_cnt; union kho_page_info info; if (!page) @@ -246,7 +247,7 @@ static struct page *kho_restore_page(phys_addr_t phys, bool is_folio) * count of 1 */ ref_cnt = is_folio ? 0 : 1; - for (unsigned int i = 1; i < nr_pages; i++) + for (unsigned long i = 1; i < nr_pages; i++) set_page_count(page + i, ref_cnt); if (is_folio && info.order) @@ -288,7 +289,7 @@ EXPORT_SYMBOL_GPL(kho_restore_folio); * * Return: 0 on success, error code on failure */ -struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages) +struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages) { const unsigned long start_pfn = PHYS_PFN(phys); const unsigned long end_pfn = start_pfn + nr_pages; @@ -837,7 +838,7 @@ EXPORT_SYMBOL_GPL(kho_unpreserve_folio); * * Return: 0 on success, error code on failure */ -int kho_preserve_pages(struct page *page, unsigned int nr_pages) +int kho_preserve_pages(struct page *page, unsigned long nr_pages) { struct kho_mem_track *track = &kho_out.track; const unsigned long start_pfn = page_to_pfn(page); @@ -881,7 +882,7 @@ EXPORT_SYMBOL_GPL(kho_preserve_pages); * kho_preserve_pages() call. Unpreserving arbitrary sub-ranges of larger * preserved blocks is not supported. */ -void kho_unpreserve_pages(struct page *page, unsigned int nr_pages) +void kho_unpreserve_pages(struct page *page, unsigned long nr_pages) { struct kho_mem_track *track = &kho_out.track; const unsigned long start_pfn = page_to_pfn(page); -- cgit v1.2.3 From e8d899d301346a5591c9d1af06c3c9b3501cf84b Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Fri, 16 Jan 2026 16:26:27 -0700 Subject: compiler-clang.h: require LLVM 19.1.0 or higher for __typeof_unqual__ When building the kernel using a version of LLVM between llvmorg-19-init (the first commit of the LLVM 19 development cycle) and the change in LLVM that actually added __typeof_unqual__ for all C modes [1], which might happen during a bisect of LLVM, there is a build failure: In file included from arch/x86/kernel/asm-offsets.c:9: In file included from include/linux/crypto.h:15: In file included from include/linux/completion.h:12: In file included from include/linux/swait.h:7: In file included from include/linux/spinlock.h:56: In file included from include/linux/preempt.h:79: arch/x86/include/asm/preempt.h:61:2: error: call to undeclared function '__typeof_unqual__'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 61 | raw_cpu_and_4(__preempt_count, ~PREEMPT_NEED_RESCHED); | ^ arch/x86/include/asm/percpu.h:478:36: note: expanded from macro 'raw_cpu_and_4' 478 | #define raw_cpu_and_4(pcp, val) percpu_binary_op(4, , "and", (pcp), val) | ^ arch/x86/include/asm/percpu.h:210:3: note: expanded from macro 'percpu_binary_op' 210 | TYPEOF_UNQUAL(_var) pto_tmp__; \ | ^ include/linux/compiler.h:248:29: note: expanded from macro 'TYPEOF_UNQUAL' 248 | # define TYPEOF_UNQUAL(exp) __typeof_unqual__(exp) | ^ The current logic of CC_HAS_TYPEOF_UNQUAL just checks for a major version of 19 but half of the 19 development cycle did not have support for __typeof_unqual__. Harden the logic of CC_HAS_TYPEOF_UNQUAL to avoid this error by only using __typeof_unqual__ with a released version of LLVM 19, which is greater than or equal to 19.1.0 with LLVM's versioning scheme that matches GCC's [2]. Link: https://github.com/llvm/llvm-project/commit/cc308f60d41744b5920ec2e2e5b25e1273c8704b [1] Link: https://github.com/llvm/llvm-project/commit/4532617ae420056bf32f6403dde07fb99d276a49 [2] Link: https://lkml.kernel.org/r/20260116-require-llvm-19-1-for-typeof_unqual-v1-1-3b9a4a4b212b@kernel.org Fixes: ac053946f5c4 ("compiler.h: introduce TYPEOF_UNQUAL() macro") Signed-off-by: Nathan Chancellor Cc: Bill Wendling Cc: Justin Stitt Cc: Uros Bizjak Cc: Signed-off-by: Andrew Morton --- include/linux/compiler-clang.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index 7edf1a07b535..e1123dd28486 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -153,4 +153,4 @@ * Bindgen uses LLVM even if our C compiler is GCC, so we cannot * rely on the auto-detected CONFIG_CC_HAS_TYPEOF_UNQUAL. */ -#define CC_HAS_TYPEOF_UNQUAL (__clang_major__ >= 19) +#define CC_HAS_TYPEOF_UNQUAL (__clang_major__ > 19 || (__clang_major__ == 19 && __clang_minor__ > 0)) -- cgit v1.2.3 From f2e0abdc88ce68cdba0a66ccc05a3e96b688a2c7 Mon Sep 17 00:00:00 2001 From: Yury Norov Date: Thu, 15 Jan 2026 23:25:04 -0500 Subject: kernel.h: drop STACK_MAGIC macro Patch series "Unload linux/kernel.h", v5. kernel.h hosts declarations that can be placed better. This series decouples kernel.h with some explicit and implicit dependencies; also, moves tracing functionality to a new independent header. This patch (of 6): The macro was introduced in 1994, v1.0.4, for stacks protection. Since that, people found better ways to protect stacks, and now the macro is only used by i915 selftests. Move it to a local header and drop from the kernel.h. Link: https://lkml.kernel.org/r/20260116042510.241009-1-ynorov@nvidia.com Link: https://lkml.kernel.org/r/20260116042510.241009-2-ynorov@nvidia.com Signed-off-by: Yury Norov Reviewed-by: Andy Shevchenko Acked-by: Randy Dunlap Acked-by: Jani Nikula Reviewed-by: Christophe Leroy (CS GROUP) Reviewed-by: Aaron Tomlin Reviewed-by: Andi Shyti Reviewed-by: Joel Fernandes Cc: Greg Kroah-Hartman Cc: Petr Pavlu Cc: Steven Rostedt (Google) Signed-off-by: Andrew Morton --- drivers/gpu/drm/i915/gt/selftest_ring_submission.c | 1 + drivers/gpu/drm/i915/i915_selftest.h | 2 ++ include/linux/kernel.h | 2 -- 3 files changed, 3 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/drivers/gpu/drm/i915/gt/selftest_ring_submission.c b/drivers/gpu/drm/i915/gt/selftest_ring_submission.c index 87ceb0f374b6..600333ae6c8c 100644 --- a/drivers/gpu/drm/i915/gt/selftest_ring_submission.c +++ b/drivers/gpu/drm/i915/gt/selftest_ring_submission.c @@ -3,6 +3,7 @@ * Copyright © 2020 Intel Corporation */ +#include "i915_selftest.h" #include "intel_engine_pm.h" #include "selftests/igt_flush_test.h" diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h index bdf3e22c0a34..72922028f4ba 100644 --- a/drivers/gpu/drm/i915/i915_selftest.h +++ b/drivers/gpu/drm/i915/i915_selftest.h @@ -26,6 +26,8 @@ #include +#define STACK_MAGIC 0xdeadbeef + struct pci_dev; struct drm_i915_private; diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 35b8f2a5aca5..cefe733a0c10 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -39,8 +39,6 @@ #include -#define STACK_MAGIC 0xdeadbeef - struct completion; struct user; -- cgit v1.2.3 From 25b66674b1036c1eb3069bf62329a9c60850d782 Mon Sep 17 00:00:00 2001 From: Yury Norov Date: Thu, 15 Jan 2026 23:25:05 -0500 Subject: moduleparam: include required headers explicitly The following patch drops moduleparam.h dependency on kernel.h. In preparation to it, list all the required headers explicitly. Link: https://lkml.kernel.org/r/20260116042510.241009-3-ynorov@nvidia.com Signed-off-by: Yury Norov Suggested-by: Petr Pavlu Reviewed-by: Petr Pavlu Reviewed-by: Andy Shevchenko Reviewed-by: Joel Fernandes Cc: Aaron Tomlin Cc: Andi Shyti Cc: Christophe Leroy (CS GROUP) Cc: Greg Kroah-Hartman Cc: Jani Nikula Cc: Randy Dunlap Cc: Steven Rostedt (Google) Signed-off-by: Andrew Morton --- include/linux/moduleparam.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/linux') diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h index 915f32f7d888..03a977168c52 100644 --- a/include/linux/moduleparam.h +++ b/include/linux/moduleparam.h @@ -2,9 +2,14 @@ #ifndef _LINUX_MODULE_PARAMS_H #define _LINUX_MODULE_PARAMS_H /* (C) Copyright 2001, 2002 Rusty Russell IBM Corporation */ + +#include +#include +#include #include #include #include +#include /* * The maximum module name length, including the NUL byte. -- cgit v1.2.3 From 90ddd39b881df74b14918cee031154f6ddb7af33 Mon Sep 17 00:00:00 2001 From: Yury Norov Date: Thu, 15 Jan 2026 23:25:06 -0500 Subject: kernel.h: move VERIFY_OCTAL_PERMISSIONS() to sysfs.h The macro is related to sysfs, but is defined in kernel.h. Move it to the proper header, and unload the generic kernel.h. Now that the macro is removed from kernel.h, linux/moduleparam.h is decoupled, and kernel.h inclusion can be removed. Link: https://lkml.kernel.org/r/20260116042510.241009-4-ynorov@nvidia.com Signed-off-by: Yury Norov Acked-by: Randy Dunlap Tested-by: Randy Dunlap Reviewed-by: Andy Shevchenko Reviewed-by: Petr Pavlu Acked-by: Greg Kroah-Hartman Reviewed-by: Joel Fernandes Cc: Aaron Tomlin Cc: Andi Shyti Cc: Christophe Leroy (CS GROUP) Cc: Jani Nikula Cc: Steven Rostedt (Google) Signed-off-by: Andrew Morton --- Documentation/filesystems/sysfs.rst | 2 +- include/linux/kernel.h | 12 ------------ include/linux/moduleparam.h | 2 +- include/linux/sysfs.h | 13 +++++++++++++ 4 files changed, 15 insertions(+), 14 deletions(-) (limited to 'include/linux') diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst index 2703c04af7d0..ffcef4d6bc8d 100644 --- a/Documentation/filesystems/sysfs.rst +++ b/Documentation/filesystems/sysfs.rst @@ -120,7 +120,7 @@ is equivalent to doing:: .store = store_foo, }; -Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally +Note as stated in include/linux/sysfs.h "OTHER_WRITABLE? Generally considered a bad idea." so trying to set a sysfs file writable for everyone will fail reverting to RO mode for "Others". diff --git a/include/linux/kernel.h b/include/linux/kernel.h index cefe733a0c10..09850b26061c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -388,16 +388,4 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } # define REBUILD_DUE_TO_DYNAMIC_FTRACE #endif -/* Permissions on a sysfs file: you didn't miss the 0 prefix did you? */ -#define VERIFY_OCTAL_PERMISSIONS(perms) \ - (BUILD_BUG_ON_ZERO((perms) < 0) + \ - BUILD_BUG_ON_ZERO((perms) > 0777) + \ - /* USER_READABLE >= GROUP_READABLE >= OTHER_READABLE */ \ - BUILD_BUG_ON_ZERO((((perms) >> 6) & 4) < (((perms) >> 3) & 4)) + \ - BUILD_BUG_ON_ZERO((((perms) >> 3) & 4) < ((perms) & 4)) + \ - /* USER_WRITABLE >= GROUP_WRITABLE */ \ - BUILD_BUG_ON_ZERO((((perms) >> 6) & 2) < (((perms) >> 3) & 2)) + \ - /* OTHER_WRITABLE? Generally considered a bad idea. */ \ - BUILD_BUG_ON_ZERO((perms) & 2) + \ - (perms)) #endif diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h index 03a977168c52..281a006dc284 100644 --- a/include/linux/moduleparam.h +++ b/include/linux/moduleparam.h @@ -8,7 +8,7 @@ #include #include #include -#include +#include #include /* diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index c33a96b7391a..99b775f3ff46 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -808,4 +808,17 @@ static inline void sysfs_put(struct kernfs_node *kn) kernfs_put(kn); } +/* Permissions on a sysfs file: you didn't miss the 0 prefix did you? */ +#define VERIFY_OCTAL_PERMISSIONS(perms) \ + (BUILD_BUG_ON_ZERO((perms) < 0) + \ + BUILD_BUG_ON_ZERO((perms) > 0777) + \ + /* USER_READABLE >= GROUP_READABLE >= OTHER_READABLE */ \ + BUILD_BUG_ON_ZERO((((perms) >> 6) & 4) < (((perms) >> 3) & 4)) + \ + BUILD_BUG_ON_ZERO((((perms) >> 3) & 4) < ((perms) & 4)) + \ + /* USER_WRITABLE >= GROUP_WRITABLE */ \ + BUILD_BUG_ON_ZERO((((perms) >> 6) & 2) < (((perms) >> 3) & 2)) + \ + /* OTHER_WRITABLE? Generally considered a bad idea. */ \ + BUILD_BUG_ON_ZERO((perms) & 2) + \ + (perms)) + #endif /* _SYSFS_H_ */ -- cgit v1.2.3 From 269586d68994ca307ded058255e243692e3bf753 Mon Sep 17 00:00:00 2001 From: Yury Norov Date: Thu, 15 Jan 2026 23:25:07 -0500 Subject: kernel.h: include linux/instruction_pointer.h explicitly In preparation for decoupling linux/instruction_pointer.h and linux/kernel.h, include instruction_pointer.h explicitly where needed. Link: https://lkml.kernel.org/r/20260116042510.241009-5-ynorov@nvidia.com Signed-off-by: Yury Norov Reviewed-by: Andy Shevchenko Reviewed-by: Joel Fernandes Cc: Aaron Tomlin Cc: Andi Shyti Cc: Christophe Leroy (CS GROUP) Cc: Greg Kroah-Hartman Cc: Jani Nikula Cc: Petr Pavlu Cc: Randy Dunlap Cc: Steven Rostedt (Google) Signed-off-by: Andrew Morton --- arch/s390/include/asm/processor.h | 1 + include/linux/ww_mutex.h | 1 + 2 files changed, 2 insertions(+) (limited to 'include/linux') diff --git a/arch/s390/include/asm/processor.h b/arch/s390/include/asm/processor.h index 3affba95845b..cc187afa07b3 100644 --- a/arch/s390/include/asm/processor.h +++ b/arch/s390/include/asm/processor.h @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h index 45ff6f7a872b..9b30fa2ec508 100644 --- a/include/linux/ww_mutex.h +++ b/include/linux/ww_mutex.h @@ -17,6 +17,7 @@ #ifndef __LINUX_WW_MUTEX_H #define __LINUX_WW_MUTEX_H +#include #include #include -- cgit v1.2.3 From 86e685ff364394b477cd1c476029480a2a1960c5 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 15 Jan 2026 23:25:08 -0500 Subject: tracing: remove size parameter in __trace_puts() The __trace_puts() function takes a string pointer and the size of the string itself. All users currently simply pass in the strlen() of the string it is also passing in. There's no reason to pass in the size. Instead have the __trace_puts() function do the strlen() within the function itself. This fixes a header recursion issue where using strlen() in the macro calling __trace_puts() requires adding #include in order to use strlen(). Removing the use of strlen() from the header fixes the recursion issue. Link: https://lore.kernel.org/all/aUN8Hm377C5A0ILX@yury/ Link: https://lkml.kernel.org/r/20260116042510.241009-6-ynorov@nvidia.com Signed-off-by: Steven Rostedt (Google) Signed-off-by: Yury Norov Reviewed-by: Andy Shevchenko Reviewed-by: Joel Fernandes Cc: Aaron Tomlin Cc: Andi Shyti Cc: Christophe Leroy (CS GROUP) Cc: Greg Kroah-Hartman Cc: Jani Nikula Cc: Petr Pavlu Cc: Randy Dunlap Signed-off-by: Andrew Morton --- include/linux/kernel.h | 4 ++-- kernel/trace/trace.c | 7 +++---- kernel/trace/trace.h | 2 +- 3 files changed, 6 insertions(+), 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 09850b26061c..5838c419ed37 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -328,10 +328,10 @@ int __trace_printk(unsigned long ip, const char *fmt, ...); if (__builtin_constant_p(str)) \ __trace_bputs(_THIS_IP_, trace_printk_fmt); \ else \ - __trace_puts(_THIS_IP_, str, strlen(str)); \ + __trace_puts(_THIS_IP_, str); \ }) extern int __trace_bputs(unsigned long ip, const char *str); -extern int __trace_puts(unsigned long ip, const char *str, int size); +extern int __trace_puts(unsigned long ip, const char *str); extern void trace_dump_stack(int skip); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index baec63134ab6..e18005807395 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1178,11 +1178,10 @@ EXPORT_SYMBOL_GPL(__trace_array_puts); * __trace_puts - write a constant string into the trace buffer. * @ip: The address of the caller * @str: The constant string to write - * @size: The size of the string. */ -int __trace_puts(unsigned long ip, const char *str, int size) +int __trace_puts(unsigned long ip, const char *str) { - return __trace_array_puts(printk_trace, ip, str, size); + return __trace_array_puts(printk_trace, ip, str, strlen(str)); } EXPORT_SYMBOL_GPL(__trace_puts); @@ -1201,7 +1200,7 @@ int __trace_bputs(unsigned long ip, const char *str) int size = sizeof(struct bputs_entry); if (!printk_binsafe(tr)) - return __trace_puts(ip, str, strlen(str)); + return __trace_puts(ip, str); if (!(tr->trace_flags & TRACE_ITER(PRINTK))) return 0; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index b6d42fe06115..de4e6713b84e 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -2116,7 +2116,7 @@ extern void tracing_log_err(struct trace_array *tr, * about performance). The internal_trace_puts() is for such * a purpose. */ -#define internal_trace_puts(str) __trace_puts(_THIS_IP_, str, strlen(str)) +#define internal_trace_puts(str) __trace_puts(_THIS_IP_, str) #undef FTRACE_ENTRY #define FTRACE_ENTRY(call, struct_name, id, tstruct, print) \ -- cgit v1.2.3 From bec261fec6d41318e414c4064f2b67c6db628acd Mon Sep 17 00:00:00 2001 From: Yury Norov Date: Thu, 15 Jan 2026 23:25:09 -0500 Subject: tracing: move tracing declarations from kernel.h to a dedicated header Tracing is a half of the kernel.h in terms of LOCs, although it's a self-consistent part. It is intended for quick debugging purposes and isn't used by the normal tracing utilities. Move it to a separate header. If someone needs to just throw a trace_printk() in their driver, they will not have to pull all the heavy tracing machinery. This is a pure move. Link: https://lkml.kernel.org/r/20260116042510.241009-7-ynorov@nvidia.com Signed-off-by: Yury Norov Acked-by: Steven Rostedt Reviewed-by: Andy Shevchenko Reviewed-by: Joel Fernandes Cc: Aaron Tomlin Cc: Andi Shyti Cc: Christophe Leroy (CS GROUP) Cc: Greg Kroah-Hartman Cc: Jani Nikula Cc: Petr Pavlu Cc: Randy Dunlap Signed-off-by: Andrew Morton --- include/linux/kernel.h | 196 +---------------------------------------- include/linux/trace_printk.h | 204 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 205 insertions(+), 195 deletions(-) create mode 100644 include/linux/trace_printk.h (limited to 'include/linux') diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 5838c419ed37..e5570a16cbb1 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -31,7 +31,7 @@ #include #include #include -#include +#include #include #include @@ -189,200 +189,6 @@ enum system_states { }; extern enum system_states system_state; -/* - * General tracing related utility functions - trace_printk(), - * tracing_on/tracing_off and tracing_start()/tracing_stop - * - * Use tracing_on/tracing_off when you want to quickly turn on or off - * tracing. It simply enables or disables the recording of the trace events. - * This also corresponds to the user space /sys/kernel/tracing/tracing_on - * file, which gives a means for the kernel and userspace to interact. - * Place a tracing_off() in the kernel where you want tracing to end. - * From user space, examine the trace, and then echo 1 > tracing_on - * to continue tracing. - * - * tracing_stop/tracing_start has slightly more overhead. It is used - * by things like suspend to ram where disabling the recording of the - * trace is not enough, but tracing must actually stop because things - * like calling smp_processor_id() may crash the system. - * - * Most likely, you want to use tracing_on/tracing_off. - */ - -enum ftrace_dump_mode { - DUMP_NONE, - DUMP_ALL, - DUMP_ORIG, - DUMP_PARAM, -}; - -#ifdef CONFIG_TRACING -void tracing_on(void); -void tracing_off(void); -int tracing_is_on(void); -void tracing_snapshot(void); -void tracing_snapshot_alloc(void); - -extern void tracing_start(void); -extern void tracing_stop(void); - -static inline __printf(1, 2) -void ____trace_printk_check_format(const char *fmt, ...) -{ -} -#define __trace_printk_check_format(fmt, args...) \ -do { \ - if (0) \ - ____trace_printk_check_format(fmt, ##args); \ -} while (0) - -/** - * trace_printk - printf formatting in the ftrace buffer - * @fmt: the printf format for printing - * - * Note: __trace_printk is an internal function for trace_printk() and - * the @ip is passed in via the trace_printk() macro. - * - * This function allows a kernel developer to debug fast path sections - * that printk is not appropriate for. By scattering in various - * printk like tracing in the code, a developer can quickly see - * where problems are occurring. - * - * This is intended as a debugging tool for the developer only. - * Please refrain from leaving trace_printks scattered around in - * your code. (Extra memory is used for special buffers that are - * allocated when trace_printk() is used.) - * - * A little optimization trick is done here. If there's only one - * argument, there's no need to scan the string for printf formats. - * The trace_puts() will suffice. But how can we take advantage of - * using trace_puts() when trace_printk() has only one argument? - * By stringifying the args and checking the size we can tell - * whether or not there are args. __stringify((__VA_ARGS__)) will - * turn into "()\0" with a size of 3 when there are no args, anything - * else will be bigger. All we need to do is define a string to this, - * and then take its size and compare to 3. If it's bigger, use - * do_trace_printk() otherwise, optimize it to trace_puts(). Then just - * let gcc optimize the rest. - */ - -#define trace_printk(fmt, ...) \ -do { \ - char _______STR[] = __stringify((__VA_ARGS__)); \ - if (sizeof(_______STR) > 3) \ - do_trace_printk(fmt, ##__VA_ARGS__); \ - else \ - trace_puts(fmt); \ -} while (0) - -#define do_trace_printk(fmt, args...) \ -do { \ - static const char *trace_printk_fmt __used \ - __section("__trace_printk_fmt") = \ - __builtin_constant_p(fmt) ? fmt : NULL; \ - \ - __trace_printk_check_format(fmt, ##args); \ - \ - if (__builtin_constant_p(fmt)) \ - __trace_bprintk(_THIS_IP_, trace_printk_fmt, ##args); \ - else \ - __trace_printk(_THIS_IP_, fmt, ##args); \ -} while (0) - -extern __printf(2, 3) -int __trace_bprintk(unsigned long ip, const char *fmt, ...); - -extern __printf(2, 3) -int __trace_printk(unsigned long ip, const char *fmt, ...); - -/** - * trace_puts - write a string into the ftrace buffer - * @str: the string to record - * - * Note: __trace_bputs is an internal function for trace_puts and - * the @ip is passed in via the trace_puts macro. - * - * This is similar to trace_printk() but is made for those really fast - * paths that a developer wants the least amount of "Heisenbug" effects, - * where the processing of the print format is still too much. - * - * This function allows a kernel developer to debug fast path sections - * that printk is not appropriate for. By scattering in various - * printk like tracing in the code, a developer can quickly see - * where problems are occurring. - * - * This is intended as a debugging tool for the developer only. - * Please refrain from leaving trace_puts scattered around in - * your code. (Extra memory is used for special buffers that are - * allocated when trace_puts() is used.) - * - * Returns: 0 if nothing was written, positive # if string was. - * (1 when __trace_bputs is used, strlen(str) when __trace_puts is used) - */ - -#define trace_puts(str) ({ \ - static const char *trace_printk_fmt __used \ - __section("__trace_printk_fmt") = \ - __builtin_constant_p(str) ? str : NULL; \ - \ - if (__builtin_constant_p(str)) \ - __trace_bputs(_THIS_IP_, trace_printk_fmt); \ - else \ - __trace_puts(_THIS_IP_, str); \ -}) -extern int __trace_bputs(unsigned long ip, const char *str); -extern int __trace_puts(unsigned long ip, const char *str); - -extern void trace_dump_stack(int skip); - -/* - * The double __builtin_constant_p is because gcc will give us an error - * if we try to allocate the static variable to fmt if it is not a - * constant. Even with the outer if statement. - */ -#define ftrace_vprintk(fmt, vargs) \ -do { \ - if (__builtin_constant_p(fmt)) { \ - static const char *trace_printk_fmt __used \ - __section("__trace_printk_fmt") = \ - __builtin_constant_p(fmt) ? fmt : NULL; \ - \ - __ftrace_vbprintk(_THIS_IP_, trace_printk_fmt, vargs); \ - } else \ - __ftrace_vprintk(_THIS_IP_, fmt, vargs); \ -} while (0) - -extern __printf(2, 0) int -__ftrace_vbprintk(unsigned long ip, const char *fmt, va_list ap); - -extern __printf(2, 0) int -__ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap); - -extern void ftrace_dump(enum ftrace_dump_mode oops_dump_mode); -#else -static inline void tracing_start(void) { } -static inline void tracing_stop(void) { } -static inline void trace_dump_stack(int skip) { } - -static inline void tracing_on(void) { } -static inline void tracing_off(void) { } -static inline int tracing_is_on(void) { return 0; } -static inline void tracing_snapshot(void) { } -static inline void tracing_snapshot_alloc(void) { } - -static inline __printf(1, 2) -int trace_printk(const char *fmt, ...) -{ - return 0; -} -static __printf(1, 0) inline int -ftrace_vprintk(const char *fmt, va_list ap) -{ - return 0; -} -static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } -#endif /* CONFIG_TRACING */ - /* Rebuild everything on CONFIG_DYNAMIC_FTRACE */ #ifdef CONFIG_DYNAMIC_FTRACE # define REBUILD_DUE_TO_DYNAMIC_FTRACE diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h new file mode 100644 index 000000000000..bb5874097f24 --- /dev/null +++ b/include/linux/trace_printk.h @@ -0,0 +1,204 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_TRACE_PRINTK_H +#define _LINUX_TRACE_PRINTK_H + +#include +#include +#include +#include + +/* + * General tracing related utility functions - trace_printk(), + * tracing_on/tracing_off and tracing_start()/tracing_stop + * + * Use tracing_on/tracing_off when you want to quickly turn on or off + * tracing. It simply enables or disables the recording of the trace events. + * This also corresponds to the user space /sys/kernel/tracing/tracing_on + * file, which gives a means for the kernel and userspace to interact. + * Place a tracing_off() in the kernel where you want tracing to end. + * From user space, examine the trace, and then echo 1 > tracing_on + * to continue tracing. + * + * tracing_stop/tracing_start has slightly more overhead. It is used + * by things like suspend to ram where disabling the recording of the + * trace is not enough, but tracing must actually stop because things + * like calling smp_processor_id() may crash the system. + * + * Most likely, you want to use tracing_on/tracing_off. + */ + +enum ftrace_dump_mode { + DUMP_NONE, + DUMP_ALL, + DUMP_ORIG, + DUMP_PARAM, +}; + +#ifdef CONFIG_TRACING +void tracing_on(void); +void tracing_off(void); +int tracing_is_on(void); +void tracing_snapshot(void); +void tracing_snapshot_alloc(void); + +extern void tracing_start(void); +extern void tracing_stop(void); + +static inline __printf(1, 2) +void ____trace_printk_check_format(const char *fmt, ...) +{ +} +#define __trace_printk_check_format(fmt, args...) \ +do { \ + if (0) \ + ____trace_printk_check_format(fmt, ##args); \ +} while (0) + +/** + * trace_printk - printf formatting in the ftrace buffer + * @fmt: the printf format for printing + * + * Note: __trace_printk is an internal function for trace_printk() and + * the @ip is passed in via the trace_printk() macro. + * + * This function allows a kernel developer to debug fast path sections + * that printk is not appropriate for. By scattering in various + * printk like tracing in the code, a developer can quickly see + * where problems are occurring. + * + * This is intended as a debugging tool for the developer only. + * Please refrain from leaving trace_printks scattered around in + * your code. (Extra memory is used for special buffers that are + * allocated when trace_printk() is used.) + * + * A little optimization trick is done here. If there's only one + * argument, there's no need to scan the string for printf formats. + * The trace_puts() will suffice. But how can we take advantage of + * using trace_puts() when trace_printk() has only one argument? + * By stringifying the args and checking the size we can tell + * whether or not there are args. __stringify((__VA_ARGS__)) will + * turn into "()\0" with a size of 3 when there are no args, anything + * else will be bigger. All we need to do is define a string to this, + * and then take its size and compare to 3. If it's bigger, use + * do_trace_printk() otherwise, optimize it to trace_puts(). Then just + * let gcc optimize the rest. + */ + +#define trace_printk(fmt, ...) \ +do { \ + char _______STR[] = __stringify((__VA_ARGS__)); \ + if (sizeof(_______STR) > 3) \ + do_trace_printk(fmt, ##__VA_ARGS__); \ + else \ + trace_puts(fmt); \ +} while (0) + +#define do_trace_printk(fmt, args...) \ +do { \ + static const char *trace_printk_fmt __used \ + __section("__trace_printk_fmt") = \ + __builtin_constant_p(fmt) ? fmt : NULL; \ + \ + __trace_printk_check_format(fmt, ##args); \ + \ + if (__builtin_constant_p(fmt)) \ + __trace_bprintk(_THIS_IP_, trace_printk_fmt, ##args); \ + else \ + __trace_printk(_THIS_IP_, fmt, ##args); \ +} while (0) + +extern __printf(2, 3) +int __trace_bprintk(unsigned long ip, const char *fmt, ...); + +extern __printf(2, 3) +int __trace_printk(unsigned long ip, const char *fmt, ...); + +/** + * trace_puts - write a string into the ftrace buffer + * @str: the string to record + * + * Note: __trace_bputs is an internal function for trace_puts and + * the @ip is passed in via the trace_puts macro. + * + * This is similar to trace_printk() but is made for those really fast + * paths that a developer wants the least amount of "Heisenbug" effects, + * where the processing of the print format is still too much. + * + * This function allows a kernel developer to debug fast path sections + * that printk is not appropriate for. By scattering in various + * printk like tracing in the code, a developer can quickly see + * where problems are occurring. + * + * This is intended as a debugging tool for the developer only. + * Please refrain from leaving trace_puts scattered around in + * your code. (Extra memory is used for special buffers that are + * allocated when trace_puts() is used.) + * + * Returns: 0 if nothing was written, positive # if string was. + * (1 when __trace_bputs is used, strlen(str) when __trace_puts is used) + */ + +#define trace_puts(str) ({ \ + static const char *trace_printk_fmt __used \ + __section("__trace_printk_fmt") = \ + __builtin_constant_p(str) ? str : NULL; \ + \ + if (__builtin_constant_p(str)) \ + __trace_bputs(_THIS_IP_, trace_printk_fmt); \ + else \ + __trace_puts(_THIS_IP_, str); \ +}) +extern int __trace_bputs(unsigned long ip, const char *str); +extern int __trace_puts(unsigned long ip, const char *str); + +extern void trace_dump_stack(int skip); + +/* + * The double __builtin_constant_p is because gcc will give us an error + * if we try to allocate the static variable to fmt if it is not a + * constant. Even with the outer if statement. + */ +#define ftrace_vprintk(fmt, vargs) \ +do { \ + if (__builtin_constant_p(fmt)) { \ + static const char *trace_printk_fmt __used \ + __section("__trace_printk_fmt") = \ + __builtin_constant_p(fmt) ? fmt : NULL; \ + \ + __ftrace_vbprintk(_THIS_IP_, trace_printk_fmt, vargs); \ + } else \ + __ftrace_vprintk(_THIS_IP_, fmt, vargs); \ +} while (0) + +extern __printf(2, 0) int +__ftrace_vbprintk(unsigned long ip, const char *fmt, va_list ap); + +extern __printf(2, 0) int +__ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap); + +extern void ftrace_dump(enum ftrace_dump_mode oops_dump_mode); +#else +static inline void tracing_start(void) { } +static inline void tracing_stop(void) { } +static inline void trace_dump_stack(int skip) { } + +static inline void tracing_on(void) { } +static inline void tracing_off(void) { } +static inline int tracing_is_on(void) { return 0; } +static inline void tracing_snapshot(void) { } +static inline void tracing_snapshot_alloc(void) { } + +static inline __printf(1, 2) +int trace_printk(const char *fmt, ...) +{ + return 0; +} +static __printf(1, 0) inline int +ftrace_vprintk(const char *fmt, va_list ap) +{ + return 0; +} +static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } +#endif /* CONFIG_TRACING */ + +#endif -- cgit v1.2.3 From 503efe850c7463a1e59df133b84461ef53c0361f Mon Sep 17 00:00:00 2001 From: Wang Yaxin Date: Mon, 19 Jan 2026 10:02:41 +0800 Subject: delayacct: add timestamp of delay max Problem ======= Commit 658eb5ab916d ("delayacct: add delay max to record delay peak") introduced the delay max for getdelays, which records abnormal latency peaks and helps us understand the magnitude of such delays. However, the peak latency value alone is insufficient for effective root cause analysis. Without the precise timestamp of when the peak occurred, we still lack the critical context needed to correlate it with other system events. Solution ======== To address this, we need to additionally record a precise timestamp when the maximum latency occurs. By correlating this timestamp with system logs and monitoring metrics, we can identify processes with abnormal resource usage at the same moment, which can help us to pinpoint root causes. Use Case ======== bash-4.4# ./getdelays -d -t 227 print delayacct stats ON TGID 227 CPU count real total virtual total delay total delay average delay max delay min delay max timestamp 46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58 IO count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A SWAP count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A RECLAIM count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A THRAS HING count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A COMPACT count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A WPCOPY count delay total delay average delay max delay min delay max timestamp 182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24 IRQ count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A Link: https://lkml.kernel.org/r/20260119100241520gWubW8-5QfhSf9gjqcc_E@zte.com.cn Signed-off-by: Wang Yaxin Cc: Fan Yu Cc: Jonathan Corbet Cc: xu xin Cc: Yang Yang Signed-off-by: Andrew Morton --- Documentation/accounting/delay-accounting.rst | 32 ++--- include/linux/delayacct.h | 8 ++ include/linux/sched.h | 5 + include/uapi/linux/taskstats.h | 22 +++- kernel/delayacct.c | 31 +++-- kernel/sched/stats.h | 8 +- tools/accounting/getdelays.c | 172 ++++++++++++++++++++++---- 7 files changed, 223 insertions(+), 55 deletions(-) (limited to 'include/linux') diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst index 86d7902a657f..e209c46241b0 100644 --- a/Documentation/accounting/delay-accounting.rst +++ b/Documentation/accounting/delay-accounting.rst @@ -107,22 +107,22 @@ Get sum and peak of delays, since system boot, for all pids with tgid 242:: TGID 242 - CPU count real total virtual total delay total delay average delay max delay min - 39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms - IO count delay total delay average delay max delay min - 0 0 0.000ms 0.000000ms 0.000000ms - SWAP count delay total delay average delay max delay min - 0 0 0.000ms 0.000000ms 0.000000ms - RECLAIM count delay total delay average delay max delay min - 0 0 0.000ms 0.000000ms 0.000000ms - THRASHING count delay total delay average delay max delay min - 0 0 0.000ms 0.000000ms 0.000000ms - COMPACT count delay total delay average delay max delay min - 0 0 0.000ms 0.000000ms 0.000000ms - WPCOPY count delay total delay average delay max delay min - 156 11215873 0.072ms 0.207403ms 0.033913ms - IRQ count delay total delay average delay max delay min - 0 0 0.000ms 0.000000ms 0.000000ms + CPU count real total virtual total delay total delay average delay max delay min delay max timestamp + 46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58 + IO count delay total delay average delay max delay min delay max timestamp + 0 0 0.000ms 0.000000ms 0.000000ms N/A + SWAP count delay total delay average delay max delay min delay max timestamp + 0 0 0.000ms 0.000000ms 0.000000ms N/A + RECLAIM count delay total delay average delay max delay min delay max timestamp + 0 0 0.000ms 0.000000ms 0.000000ms N/A + THRASHING count delay total delay average delay max delay min delay max timestamp + 0 0 0.000ms 0.000000ms 0.000000ms N/A + COMPACT count delay total delay average delay max delay min delay max timestamp + 0 0 0.000ms 0.000000ms 0.000000ms N/A + WPCOPY count delay total delay average delay max delay min delay max timestamp + 182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24 + IRQ count delay total delay average delay max delay min delay max timestamp + 0 0 0.000ms 0.000000ms 0.000000ms N/A Get IO accounting for pid 1, it works only with -p:: diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index 800dcc360db2..ecb06f16d22c 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -69,6 +69,14 @@ struct task_delay_info { u32 compact_count; /* total count of memory compact */ u32 wpcopy_count; /* total count of write-protect copy */ u32 irq_count; /* total count of IRQ/SOFTIRQ */ + + struct timespec64 blkio_delay_max_ts; + struct timespec64 swapin_delay_max_ts; + struct timespec64 freepages_delay_max_ts; + struct timespec64 thrashing_delay_max_ts; + struct timespec64 compact_delay_max_ts; + struct timespec64 wpcopy_delay_max_ts; + struct timespec64 irq_delay_max_ts; }; #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index da0133524d08..1d22b6229b95 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -49,6 +49,7 @@ #include #include #include +#include #ifndef COMPILE_OFFSETS #include #endif @@ -86,6 +87,7 @@ struct signal_struct; struct task_delay_info; struct task_group; struct task_struct; +struct timespec64; struct user_event_mm; #include @@ -435,6 +437,9 @@ struct sched_info { /* When were we last queued to run? */ unsigned long long last_queued; + /* Timestamp of max time spent waiting on a runqueue: */ + struct timespec64 max_run_delay_ts; + #endif /* CONFIG_SCHED_INFO */ }; diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h index 5929030d4e8b..1b31e8e14d2f 100644 --- a/include/uapi/linux/taskstats.h +++ b/include/uapi/linux/taskstats.h @@ -18,6 +18,16 @@ #define _LINUX_TASKSTATS_H #include +#ifdef __KERNEL__ +#include +#else +#ifndef _LINUX_TIME64_H +struct timespec64 { + __s64 tv_sec; /* seconds */ + long tv_nsec; /* nanoseconds */ +}; +#endif +#endif /* Format for per-task data returned to userland when * - a task exits @@ -34,7 +44,7 @@ */ -#define TASKSTATS_VERSION 16 +#define TASKSTATS_VERSION 17 #define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN * in linux/sched.h */ @@ -230,6 +240,16 @@ struct taskstats { __u64 irq_delay_max; __u64 irq_delay_min; + + /*v17: delay max timestamp record*/ + struct timespec64 cpu_delay_max_ts; + struct timespec64 blkio_delay_max_ts; + struct timespec64 swapin_delay_max_ts; + struct timespec64 freepages_delay_max_ts; + struct timespec64 thrashing_delay_max_ts; + struct timespec64 compact_delay_max_ts; + struct timespec64 wpcopy_delay_max_ts; + struct timespec64 irq_delay_max_ts; }; diff --git a/kernel/delayacct.c b/kernel/delayacct.c index 30e7912ebb0d..d58ffc63bcba 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -18,6 +18,7 @@ do { \ d->type##_delay_max = tsk->delays->type##_delay_max; \ d->type##_delay_min = tsk->delays->type##_delay_min; \ + d->type##_delay_max_ts = tsk->delays->type##_delay_max_ts; \ tmp = d->type##_delay_total + tsk->delays->type##_delay; \ d->type##_delay_total = (tmp < d->type##_delay_total) ? 0 : tmp; \ d->type##_count += tsk->delays->type##_count; \ @@ -104,7 +105,8 @@ void __delayacct_tsk_init(struct task_struct *tsk) * Finish delay accounting for a statistic using its timestamps (@start), * accumulator (@total) and @count */ -static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *count, u64 *max, u64 *min) +static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *count, + u64 *max, u64 *min, struct timespec64 *ts) { s64 ns = local_clock() - *start; unsigned long flags; @@ -113,8 +115,10 @@ static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *cou raw_spin_lock_irqsave(lock, flags); *total += ns; (*count)++; - if (ns > *max) + if (ns > *max) { *max = ns; + ktime_get_real_ts64(ts); + } if (*min == 0 || ns < *min) *min = ns; raw_spin_unlock_irqrestore(lock, flags); @@ -137,7 +141,8 @@ void __delayacct_blkio_end(struct task_struct *p) &p->delays->blkio_delay, &p->delays->blkio_count, &p->delays->blkio_delay_max, - &p->delays->blkio_delay_min); + &p->delays->blkio_delay_min, + &p->delays->blkio_delay_max_ts); } int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) @@ -170,6 +175,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) d->cpu_delay_max = tsk->sched_info.max_run_delay; d->cpu_delay_min = tsk->sched_info.min_run_delay; + d->cpu_delay_max_ts = tsk->sched_info.max_run_delay_ts; tmp = (s64)d->cpu_delay_total + t2; d->cpu_delay_total = (tmp < (s64)d->cpu_delay_total) ? 0 : tmp; tmp = (s64)d->cpu_run_virtual_total + t3; @@ -217,7 +223,8 @@ void __delayacct_freepages_end(void) ¤t->delays->freepages_delay, ¤t->delays->freepages_count, ¤t->delays->freepages_delay_max, - ¤t->delays->freepages_delay_min); + ¤t->delays->freepages_delay_min, + ¤t->delays->freepages_delay_max_ts); } void __delayacct_thrashing_start(bool *in_thrashing) @@ -241,7 +248,8 @@ void __delayacct_thrashing_end(bool *in_thrashing) ¤t->delays->thrashing_delay, ¤t->delays->thrashing_count, ¤t->delays->thrashing_delay_max, - ¤t->delays->thrashing_delay_min); + ¤t->delays->thrashing_delay_min, + ¤t->delays->thrashing_delay_max_ts); } void __delayacct_swapin_start(void) @@ -256,7 +264,8 @@ void __delayacct_swapin_end(void) ¤t->delays->swapin_delay, ¤t->delays->swapin_count, ¤t->delays->swapin_delay_max, - ¤t->delays->swapin_delay_min); + ¤t->delays->swapin_delay_min, + ¤t->delays->swapin_delay_max_ts); } void __delayacct_compact_start(void) @@ -271,7 +280,8 @@ void __delayacct_compact_end(void) ¤t->delays->compact_delay, ¤t->delays->compact_count, ¤t->delays->compact_delay_max, - ¤t->delays->compact_delay_min); + ¤t->delays->compact_delay_min, + ¤t->delays->compact_delay_max_ts); } void __delayacct_wpcopy_start(void) @@ -286,7 +296,8 @@ void __delayacct_wpcopy_end(void) ¤t->delays->wpcopy_delay, ¤t->delays->wpcopy_count, ¤t->delays->wpcopy_delay_max, - ¤t->delays->wpcopy_delay_min); + ¤t->delays->wpcopy_delay_min, + ¤t->delays->wpcopy_delay_max_ts); } void __delayacct_irq(struct task_struct *task, u32 delta) @@ -296,8 +307,10 @@ void __delayacct_irq(struct task_struct *task, u32 delta) raw_spin_lock_irqsave(&task->delays->lock, flags); task->delays->irq_delay += delta; task->delays->irq_count++; - if (delta > task->delays->irq_delay_max) + if (delta > task->delays->irq_delay_max) { task->delays->irq_delay_max = delta; + ktime_get_real_ts64(&task->delays->irq_delay_max_ts); + } if (delta && (!task->delays->irq_delay_min || delta < task->delays->irq_delay_min)) task->delays->irq_delay_min = delta; raw_spin_unlock_irqrestore(&task->delays->lock, flags); diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index c903f1a42891..a612cf253c87 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -253,8 +253,10 @@ static inline void sched_info_dequeue(struct rq *rq, struct task_struct *t) delta = rq_clock(rq) - t->sched_info.last_queued; t->sched_info.last_queued = 0; t->sched_info.run_delay += delta; - if (delta > t->sched_info.max_run_delay) + if (delta > t->sched_info.max_run_delay) { t->sched_info.max_run_delay = delta; + ktime_get_real_ts64(&t->sched_info.max_run_delay_ts); + } if (delta && (!t->sched_info.min_run_delay || delta < t->sched_info.min_run_delay)) t->sched_info.min_run_delay = delta; rq_sched_info_dequeue(rq, delta); @@ -278,8 +280,10 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t) t->sched_info.run_delay += delta; t->sched_info.last_arrival = now; t->sched_info.pcount++; - if (delta > t->sched_info.max_run_delay) + if (delta > t->sched_info.max_run_delay) { t->sched_info.max_run_delay = delta; + ktime_get_real_ts64(&t->sched_info.max_run_delay_ts); + } if (delta && (!t->sched_info.min_run_delay || delta < t->sched_info.min_run_delay)) t->sched_info.min_run_delay = delta; diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c index 21cb3c3d1331..64796c0223be 100644 --- a/tools/accounting/getdelays.c +++ b/tools/accounting/getdelays.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -194,6 +195,37 @@ static int get_family_id(int sd) #define average_ms(t, c) (t / 1000000ULL / (c ? c : 1)) #define delay_ms(t) (t / 1000000ULL) +/* + * Format timespec64 to human readable string (YYYY-MM-DD HH:MM:SS) + * Returns formatted string or "N/A" if timestamp is zero + */ +static const char *format_timespec64(struct timespec64 *ts) +{ + static char buffer[32]; + struct tm tm_info; + time_t time_sec; + + /* Check if timestamp is zero (not set) */ + if (ts->tv_sec == 0 && ts->tv_nsec == 0) + return "N/A"; + + time_sec = (time_t)ts->tv_sec; + + /* Use thread-safe localtime_r */ + if (localtime_r(&time_sec, &tm_info) == NULL) + return "N/A"; + + snprintf(buffer, sizeof(buffer), "%04d-%02d-%02dT%02d:%02d:%02d", + tm_info.tm_year + 1900, + tm_info.tm_mon + 1, + tm_info.tm_mday, + tm_info.tm_hour, + tm_info.tm_min, + tm_info.tm_sec); + + return buffer; +} + /* * Version compatibility note: * Field availability depends on taskstats version (t->version), @@ -205,13 +237,28 @@ static int get_family_id(int sd) * version >= 13 - supports WPCOPY statistics * version >= 14 - supports IRQ statistics * version >= 16 - supports *_max and *_min delay statistics + * version >= 17 - supports delay max timestamp statistics * * Always verify version before accessing version-dependent fields * to maintain backward compatibility. */ #define PRINT_CPU_DELAY(version, t) \ do { \ - if (version >= 16) { \ + if (version >= 17) { \ + printf("%-10s%15s%15s%15s%15s%15s%15s%15s%25s\n", \ + "CPU", "count", "real total", "virtual total", \ + "delay total", "delay average", "delay max", \ + "delay min", "delay max timestamp"); \ + printf(" %15llu%15llu%15llu%15llu%15.3fms%13.6fms%13.6fms%23s\n", \ + (unsigned long long)(t)->cpu_count, \ + (unsigned long long)(t)->cpu_run_real_total, \ + (unsigned long long)(t)->cpu_run_virtual_total, \ + (unsigned long long)(t)->cpu_delay_total, \ + average_ms((double)(t)->cpu_delay_total, (t)->cpu_count), \ + delay_ms((double)(t)->cpu_delay_max), \ + delay_ms((double)(t)->cpu_delay_min), \ + format_timespec64(&(t)->cpu_delay_max_ts)); \ + } else if (version >= 16) { \ printf("%-10s%15s%15s%15s%15s%15s%15s%15s\n", \ "CPU", "count", "real total", "virtual total", \ "delay total", "delay average", "delay max", "delay min"); \ @@ -257,44 +304,115 @@ static int get_family_id(int sd) } \ } while (0) +#define PRINT_FILED_DELAY_WITH_TS(name, version, t, count, total, max, min, max_ts) \ + do { \ + if (version >= 17) { \ + printf("%-10s%15s%15s%15s%15s%15s%25s\n", \ + name, "count", "delay total", "delay average", \ + "delay max", "delay min", "delay max timestamp"); \ + printf(" %15llu%15llu%15.3fms%13.6fms%13.6fms%23s\n", \ + (unsigned long long)(t)->count, \ + (unsigned long long)(t)->total, \ + average_ms((double)(t)->total, (t)->count), \ + delay_ms((double)(t)->max), \ + delay_ms((double)(t)->min), \ + format_timespec64(&(t)->max_ts)); \ + } else if (version >= 16) { \ + printf("%-10s%15s%15s%15s%15s%15s\n", \ + name, "count", "delay total", "delay average", \ + "delay max", "delay min"); \ + printf(" %15llu%15llu%15.3fms%13.6fms%13.6fms\n", \ + (unsigned long long)(t)->count, \ + (unsigned long long)(t)->total, \ + average_ms((double)(t)->total, (t)->count), \ + delay_ms((double)(t)->max), \ + delay_ms((double)(t)->min)); \ + } else { \ + printf("%-10s%15s%15s%15s\n", \ + name, "count", "delay total", "delay average"); \ + printf(" %15llu%15llu%15.3fms\n", \ + (unsigned long long)(t)->count, \ + (unsigned long long)(t)->total, \ + average_ms((double)(t)->total, (t)->count)); \ + } \ + } while (0) + static void print_delayacct(struct taskstats *t) { printf("\n\n"); PRINT_CPU_DELAY(t->version, t); - PRINT_FILED_DELAY("IO", t->version, t, - blkio_count, blkio_delay_total, - blkio_delay_max, blkio_delay_min); + /* Use new macro with timestamp support for version >= 17 */ + if (t->version >= 17) { + PRINT_FILED_DELAY_WITH_TS("IO", t->version, t, + blkio_count, blkio_delay_total, + blkio_delay_max, blkio_delay_min, blkio_delay_max_ts); - PRINT_FILED_DELAY("SWAP", t->version, t, - swapin_count, swapin_delay_total, - swapin_delay_max, swapin_delay_min); + PRINT_FILED_DELAY_WITH_TS("SWAP", t->version, t, + swapin_count, swapin_delay_total, + swapin_delay_max, swapin_delay_min, swapin_delay_max_ts); - PRINT_FILED_DELAY("RECLAIM", t->version, t, - freepages_count, freepages_delay_total, - freepages_delay_max, freepages_delay_min); + PRINT_FILED_DELAY_WITH_TS("RECLAIM", t->version, t, + freepages_count, freepages_delay_total, + freepages_delay_max, freepages_delay_min, freepages_delay_max_ts); - PRINT_FILED_DELAY("THRASHING", t->version, t, - thrashing_count, thrashing_delay_total, - thrashing_delay_max, thrashing_delay_min); + PRINT_FILED_DELAY_WITH_TS("THRASHING", t->version, t, + thrashing_count, thrashing_delay_total, + thrashing_delay_max, thrashing_delay_min, thrashing_delay_max_ts); - if (t->version >= 11) { - PRINT_FILED_DELAY("COMPACT", t->version, t, - compact_count, compact_delay_total, - compact_delay_max, compact_delay_min); - } + if (t->version >= 11) { + PRINT_FILED_DELAY_WITH_TS("COMPACT", t->version, t, + compact_count, compact_delay_total, + compact_delay_max, compact_delay_min, compact_delay_max_ts); + } - if (t->version >= 13) { - PRINT_FILED_DELAY("WPCOPY", t->version, t, - wpcopy_count, wpcopy_delay_total, - wpcopy_delay_max, wpcopy_delay_min); - } + if (t->version >= 13) { + PRINT_FILED_DELAY_WITH_TS("WPCOPY", t->version, t, + wpcopy_count, wpcopy_delay_total, + wpcopy_delay_max, wpcopy_delay_min, wpcopy_delay_max_ts); + } - if (t->version >= 14) { - PRINT_FILED_DELAY("IRQ", t->version, t, - irq_count, irq_delay_total, - irq_delay_max, irq_delay_min); + if (t->version >= 14) { + PRINT_FILED_DELAY_WITH_TS("IRQ", t->version, t, + irq_count, irq_delay_total, + irq_delay_max, irq_delay_min, irq_delay_max_ts); + } + } else { + /* Use original macro for older versions */ + PRINT_FILED_DELAY("IO", t->version, t, + blkio_count, blkio_delay_total, + blkio_delay_max, blkio_delay_min); + + PRINT_FILED_DELAY("SWAP", t->version, t, + swapin_count, swapin_delay_total, + swapin_delay_max, swapin_delay_min); + + PRINT_FILED_DELAY("RECLAIM", t->version, t, + freepages_count, freepages_delay_total, + freepages_delay_max, freepages_delay_min); + + PRINT_FILED_DELAY("THRASHING", t->version, t, + thrashing_count, thrashing_delay_total, + thrashing_delay_max, thrashing_delay_min); + + if (t->version >= 11) { + PRINT_FILED_DELAY("COMPACT", t->version, t, + compact_count, compact_delay_total, + compact_delay_max, compact_delay_min); + } + + if (t->version >= 13) { + PRINT_FILED_DELAY("WPCOPY", t->version, t, + wpcopy_count, wpcopy_delay_total, + wpcopy_delay_max, wpcopy_delay_min); + } + + if (t->version >= 14) { + PRINT_FILED_DELAY("IRQ", t->version, t, + irq_count, irq_delay_total, + irq_delay_max, irq_delay_min); + } } } -- cgit v1.2.3 From 8924336531e21b187d724b5fdf5277269c9ec22c Mon Sep 17 00:00:00 2001 From: Ondrej Mosnacek Date: Thu, 22 Jan 2026 15:13:03 +0100 Subject: ipc: don't audit capability check in ipc_permissions() The IPC sysctls implement the ctl_table_root::permissions hook and they override the file access mode based on the CAP_CHECKPOINT_RESTORE capability, which is being checked regardless of whether any access is actually denied or not, so if an LSM denies the capability, an audit record may be logged even when access is in fact granted. It wouldn't be viable to restructure the sysctl permission logic to only check the capability when the access would be actually denied if it's not granted. Thus, do the same as in net_ctl_permissions() (net/sysctl_net.c) - switch from ns_capable() to ns_capable_noaudit(), so that the check never emits an audit record. Link: https://lkml.kernel.org/r/20260122141303.241133-1-omosnace@redhat.com Fixes: 0889f44e2810 ("ipc: Check permissions for checkpoint_restart sysctls at open time") Signed-off-by: Ondrej Mosnacek Acked-by: Alexey Gladkov Acked-by: Serge Hallyn Cc: Eric Biederman Cc: Paul Moore Signed-off-by: Andrew Morton --- include/linux/capability.h | 6 ++++++ ipc/ipc_sysctl.c | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/capability.h b/include/linux/capability.h index 1fb08922552c..37db92b3d6f8 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -203,6 +203,12 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns) ns_capable(ns, CAP_SYS_ADMIN); } +static inline bool checkpoint_restore_ns_capable_noaudit(struct user_namespace *ns) +{ + return ns_capable_noaudit(ns, CAP_CHECKPOINT_RESTORE) || + ns_capable_noaudit(ns, CAP_SYS_ADMIN); +} + /* audit system wants to get cap info from files as well */ int get_vfs_caps_from_disk(struct mnt_idmap *idmap, const struct dentry *dentry, diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c index 15b17e86e198..9b087ebeb643 100644 --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -214,7 +214,7 @@ static int ipc_permissions(struct ctl_table_header *head, const struct ctl_table if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) || (table->data == &ns->ids[IPC_MSG_IDS].next_id) || (table->data == &ns->ids[IPC_SHM_IDS].next_id)) && - checkpoint_restore_ns_capable(ns->user_ns)) + checkpoint_restore_ns_capable_noaudit(ns->user_ns)) mode = 0666; else #endif -- cgit v1.2.3 From 2e171ab29f916455a49274a2042bac4a4b35570e Mon Sep 17 00:00:00 2001 From: Pnina Feder Date: Thu, 22 Jan 2026 12:24:57 +0200 Subject: panic: add panic_force_cpu= parameter to redirect panic to a specific CPU Some platforms require panic handling to execute on a specific CPU for crash dump to work reliably. This can be due to firmware limitations, interrupt routing constraints, or platform-specific requirements where only a single CPU is able to safely enter the crash kernel. Add the panic_force_cpu= kernel command-line parameter to redirect panic execution to a designated CPU. When the parameter is provided, the CPU that initially triggers panic forwards the panic context to the target CPU via IPI, which then proceeds with the normal panic and kexec flow. The IPI delivery is implemented as a weak function (panic_smp_redirect_cpu) so architectures with NMI support can override it for more reliable delivery. If the specified CPU is invalid, offline, or a panic is already in progress on another CPU, the redirection is skipped and panic continues on the current CPU. [pnina.feder@mobileye.com: fix unused variable warning] Link: https://lkml.kernel.org/r/20260126122618.2967950-1-pnina.feder@mobileye.com Link: https://lkml.kernel.org/r/20260122102457.1154599-1-pnina.feder@mobileye.com Signed-off-by: Pnina Feder Reviewed-by: Petr Mladek Cc: Baoquan He Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Mel Gorman Cc: Peter Zijlstra Cc: Sergey Senozhatsky Cc: Steven Rostedt Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- Documentation/admin-guide/kernel-parameters.txt | 15 +++ include/linux/panic.h | 8 ++ include/linux/smp.h | 1 + kernel/panic.c | 164 +++++++++++++++++++++++- 4 files changed, 186 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 73d846211144..97161861781c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4788,6 +4788,21 @@ Kernel parameters panic_on_warn=1 panic() instead of WARN(). Useful to cause kdump on a WARN(). + panic_force_cpu= + [KNL,SMP] Force panic handling to execute on a specific CPU. + Format: + Some platforms require panic handling to occur on a + specific CPU for the crash kernel to function correctly. + This can be due to firmware limitations, interrupt routing + constraints, or platform-specific requirements where only + a particular CPU can safely enter the crash kernel. + When set, panic() will redirect execution to the specified + CPU before proceeding with the normal panic and kexec flow. + If the target CPU is offline or unavailable, panic proceeds + on the current CPU. + This option should only be used for systems with the above + constraints as it might cause the panic operation to be less reliable. + panic_print= Bitmask for printing system info when panic happens. User can chose combination of the following bits: bit 0: print all tasks info diff --git a/include/linux/panic.h b/include/linux/panic.h index a00bc0937698..f1dd417e54b2 100644 --- a/include/linux/panic.h +++ b/include/linux/panic.h @@ -41,6 +41,14 @@ void abort(void); * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec(). */ extern atomic_t panic_cpu; + +/* + * panic_redirect_cpu is used when panic is redirected to a specific CPU via + * the panic_force_cpu= boot parameter. It holds the CPU number that originally + * triggered the panic before redirection. A value of PANIC_CPU_INVALID means + * no redirection has occurred. + */ +extern atomic_t panic_redirect_cpu; #define PANIC_CPU_INVALID -1 bool panic_try_start(void); diff --git a/include/linux/smp.h b/include/linux/smp.h index 91d0ecf3b8d3..1ebd88026119 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -62,6 +62,7 @@ int smp_call_function_single_async(int cpu, call_single_data_t *csd); void __noreturn panic_smp_self_stop(void); void __noreturn nmi_panic_self_stop(struct pt_regs *regs); void crash_smp_send_stop(void); +int panic_smp_redirect_cpu(int target_cpu, void *msg); /* * Call a function on all processors diff --git a/kernel/panic.c b/kernel/panic.c index 0c20fcaae98a..c78600212b6c 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -42,6 +42,7 @@ #define PANIC_TIMER_STEP 100 #define PANIC_BLINK_SPD 18 +#define PANIC_MSG_BUFSZ 1024 #ifdef CONFIG_SMP /* @@ -74,6 +75,8 @@ EXPORT_SYMBOL_GPL(panic_timeout); unsigned long panic_print; +static int panic_force_cpu = -1; + ATOMIC_NOTIFIER_HEAD(panic_notifier_list); EXPORT_SYMBOL(panic_notifier_list); @@ -300,6 +303,150 @@ void __weak crash_smp_send_stop(void) } atomic_t panic_cpu = ATOMIC_INIT(PANIC_CPU_INVALID); +atomic_t panic_redirect_cpu = ATOMIC_INIT(PANIC_CPU_INVALID); + +#if defined(CONFIG_SMP) && defined(CONFIG_CRASH_DUMP) +static char *panic_force_buf; + +static int __init panic_force_cpu_setup(char *str) +{ + int cpu; + + if (!str) + return -EINVAL; + + if (kstrtoint(str, 0, &cpu) || cpu < 0 || cpu >= nr_cpu_ids) { + pr_warn("panic_force_cpu: invalid value '%s'\n", str); + return -EINVAL; + } + + panic_force_cpu = cpu; + return 0; +} +early_param("panic_force_cpu", panic_force_cpu_setup); + +static int __init panic_force_cpu_late_init(void) +{ + if (panic_force_cpu < 0) + return 0; + + panic_force_buf = kmalloc(PANIC_MSG_BUFSZ, GFP_KERNEL); + + return 0; +} +late_initcall(panic_force_cpu_late_init); + +static void do_panic_on_target_cpu(void *info) +{ + panic("%s", (char *)info); +} + +/** + * panic_smp_redirect_cpu - Redirect panic to target CPU + * @target_cpu: CPU that should handle the panic + * @msg: formatted panic message + * + * Default implementation uses IPI. Architectures with NMI support + * can override this for more reliable delivery. + * + * Return: 0 on success, negative errno on failure + */ +int __weak panic_smp_redirect_cpu(int target_cpu, void *msg) +{ + static call_single_data_t panic_csd; + + panic_csd.func = do_panic_on_target_cpu; + panic_csd.info = msg; + + return smp_call_function_single_async(target_cpu, &panic_csd); +} + +/** + * panic_try_force_cpu - Redirect panic to a specific CPU for crash kernel + * @fmt: panic message format string + * @args: arguments for format string + * + * Some platforms require panic handling to occur on a specific CPU + * for the crash kernel to function correctly. This function redirects + * panic handling to the CPU specified via the panic_force_cpu= boot parameter. + * + * Returns false if panic should proceed on current CPU. + * Returns true if panic was redirected. + */ +__printf(1, 0) +static bool panic_try_force_cpu(const char *fmt, va_list args) +{ + int this_cpu = raw_smp_processor_id(); + int old_cpu = PANIC_CPU_INVALID; + const char *msg; + + /* Feature not enabled via boot parameter */ + if (panic_force_cpu < 0) + return false; + + /* Already on target CPU - proceed normally */ + if (this_cpu == panic_force_cpu) + return false; + + /* Target CPU is offline, can't redirect */ + if (!cpu_online(panic_force_cpu)) { + pr_warn("panic: target CPU %d is offline, continuing on CPU %d\n", + panic_force_cpu, this_cpu); + return false; + } + + /* Another panic already in progress */ + if (panic_in_progress()) + return false; + + /* + * Only one CPU can do the redirect. Use atomic cmpxchg to ensure + * we don't race with another CPU also trying to redirect. + */ + if (!atomic_try_cmpxchg(&panic_redirect_cpu, &old_cpu, this_cpu)) + return false; + + /* + * Use dynamically allocated buffer if available, otherwise + * fall back to static message for early boot panics or allocation failure. + */ + if (panic_force_buf) { + vsnprintf(panic_force_buf, PANIC_MSG_BUFSZ, fmt, args); + msg = panic_force_buf; + } else { + msg = "Redirected panic (buffer unavailable)"; + } + + console_verbose(); + bust_spinlocks(1); + + pr_emerg("panic: Redirecting from CPU %d to CPU %d for crash kernel.\n", + this_cpu, panic_force_cpu); + + /* Dump original CPU before redirecting */ + if (!test_taint(TAINT_DIE) && + oops_in_progress <= 1 && + IS_ENABLED(CONFIG_DEBUG_BUGVERBOSE)) { + dump_stack(); + } + + if (panic_smp_redirect_cpu(panic_force_cpu, (void *)msg) != 0) { + atomic_set(&panic_redirect_cpu, PANIC_CPU_INVALID); + pr_warn("panic: failed to redirect to CPU %d, continuing on CPU %d\n", + panic_force_cpu, this_cpu); + return false; + } + + /* IPI/NMI sent, this CPU should stop */ + return true; +} +#else +__printf(1, 0) +static inline bool panic_try_force_cpu(const char *fmt, va_list args) +{ + return false; +} +#endif /* CONFIG_SMP && CONFIG_CRASH_DUMP */ bool panic_try_start(void) { @@ -428,7 +575,7 @@ static void panic_other_cpus_shutdown(bool crash_kexec) */ void vpanic(const char *fmt, va_list args) { - static char buf[1024]; + static char buf[PANIC_MSG_BUFSZ]; long i, i_next = 0, len; int state = 0; bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers; @@ -452,6 +599,15 @@ void vpanic(const char *fmt, va_list args) local_irq_disable(); preempt_disable_notrace(); + /* Redirect panic to target CPU if configured via panic_force_cpu=. */ + if (panic_try_force_cpu(fmt, args)) { + /* + * Mark ourselves offline so panic_other_cpus_shutdown() won't wait + * for us on architectures that check num_online_cpus(). + */ + set_cpu_online(smp_processor_id(), false); + panic_smp_self_stop(); + } /* * It's possible to come here directly from a panic-assertion and * not have preempt disabled. Some functions called from here want @@ -484,7 +640,11 @@ void vpanic(const char *fmt, va_list args) /* * Avoid nested stack-dumping if a panic occurs during oops processing */ - if (test_taint(TAINT_DIE) || oops_in_progress > 1) { + if (atomic_read(&panic_redirect_cpu) != PANIC_CPU_INVALID && + panic_force_cpu == raw_smp_processor_id()) { + pr_emerg("panic: Redirected from CPU %d, skipping stack dump.\n", + atomic_read(&panic_redirect_cpu)); + } else if (test_taint(TAINT_DIE) || oops_in_progress > 1) { panic_this_cpu_backtrace_printed = true; } else if (IS_ENABLED(CONFIG_DEBUG_BUGVERBOSE)) { dump_stack(); -- cgit v1.2.3 From 989b3c5af63ecb1cbaf1598fe3f79865538bc1ea Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Thu, 18 Dec 2025 10:57:48 -0500 Subject: list: add primitives for private list manipulations Patch series "list private v2 & luo flb", v9. This series introduces two connected infrastructure improvements: a new API for handling private linked lists, and the "File-Lifecycle-Bound" (FLB) mechanism for the Live Update Orchestrator. 1. Private List Primitives (patches 1-3) Recently, Linux introduced the ability to mark structure members as __private and access them via ACCESS_PRIVATE(). This enforces better encapsulation by ensuring internal details are only accessible by the owning subsystem. However, struct list_head is frequently used as an internal linkage mechanism within these private sections. The standard macros in do not support ACCESS_PRIVATE() natively. Consequently, subsystems using private lists are forced to implement ad-hoc workarounds or local iterator macros. This series adds , providing a set of primitives identical to those in but designed for private list heads. It also includes a KUnit test suite to verify that the macros correctly handle pointer offsets and qualifiers. 2. This series adds FLB (patches 4-5) support to Live Update that also internally uses private lists. FLB allows global kernel state (such as IOMMU domains or HugeTLB state) to be preserved once, shared across multiple file descriptors, and restored when needed. This is necessary for subsystems where multiple preserved file descriptors depend on a single, shared underlying resource. Preserving this state for each individual file would be redundant and incorrect. FLB uses reference counting tied to the lifecycle of preserved files. The state is preserved when the first file depending on it is preserved, and restored or cleaned up only when the last file is handled. This patch (of 5): Linux recently added an ability to add private members to structs (i.e. __private) and access them via ACCESS_PRIVATE(). This ensures that those members are only accessible by the subsystem which owns the struct type, and not to the object owner. However, struct list_head often needs to be placed into the private section to be manipulated privately by the subsystem. Add macros to support private list manipulations in . [akpm@linux-foundation.org: fix kerneldoc] Link: https://lkml.kernel.org/r/20251218155752.3045808-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251218155752.3045808-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Cc: Alexander Graf Cc: David Gow Cc: David Matlack Cc: David Rientjes Cc: Jonathan Corbet Cc: Kees Cook Cc: Mike Rapoport Cc: Petr Mladek Cc: Pratyush Yadav Cc: Samiullah Khawaja Cc: Tamir Duberstein Signed-off-by: Andrew Morton --- Documentation/core-api/list.rst | 9 ++ include/linux/list_private.h | 256 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 265 insertions(+) create mode 100644 include/linux/list_private.h (limited to 'include/linux') diff --git a/Documentation/core-api/list.rst b/Documentation/core-api/list.rst index 86873ce9adbf..241464ca0549 100644 --- a/Documentation/core-api/list.rst +++ b/Documentation/core-api/list.rst @@ -774,3 +774,12 @@ Full List API .. kernel-doc:: include/linux/list.h :internal: + +Private List API +================ + +.. kernel-doc:: include/linux/list_private.h + :doc: Private List Primitives + +.. kernel-doc:: include/linux/list_private.h + :internal: diff --git a/include/linux/list_private.h b/include/linux/list_private.h new file mode 100644 index 000000000000..19b01d16beda --- /dev/null +++ b/include/linux/list_private.h @@ -0,0 +1,256 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + */ +#ifndef _LINUX_LIST_PRIVATE_H +#define _LINUX_LIST_PRIVATE_H + +/** + * DOC: Private List Primitives + * + * Provides a set of list primitives identical in function to those in + * ````, but designed for cases where the embedded + * ``&struct list_head`` is private member. + */ + +#include +#include + +#define __list_private_offset(type, member) \ + ((size_t)(&ACCESS_PRIVATE(((type *)0), member))) + +/** + * list_private_entry - get the struct for this entry + * @ptr: the &struct list_head pointer. + * @type: the type of the struct this is embedded in. + * @member: the identifier passed to ACCESS_PRIVATE. + */ +#define list_private_entry(ptr, type, member) ({ \ + const struct list_head *__mptr = (ptr); \ + (type *)((char *)__mptr - __list_private_offset(type, member)); \ +}) + +/** + * list_private_first_entry - get the first element from a list + * @ptr: the list head to take the element from. + * @type: the type of the struct this is embedded in. + * @member: the identifier passed to ACCESS_PRIVATE. + */ +#define list_private_first_entry(ptr, type, member) \ + list_private_entry((ptr)->next, type, member) + +/** + * list_private_last_entry - get the last element from a list + * @ptr: the list head to take the element from. + * @type: the type of the struct this is embedded in. + * @member: the identifier passed to ACCESS_PRIVATE. + */ +#define list_private_last_entry(ptr, type, member) \ + list_private_entry((ptr)->prev, type, member) + +/** + * list_private_next_entry - get the next element in list + * @pos: the type * to cursor + * @member: the name of the list_head within the struct. + */ +#define list_private_next_entry(pos, member) \ + list_private_entry(ACCESS_PRIVATE(pos, member).next, typeof(*(pos)), member) + +/** + * list_private_next_entry_circular - get the next element in list + * @pos: the type * to cursor. + * @head: the list head to take the element from. + * @member: the name of the list_head within the struct. + * + * Wraparound if pos is the last element (return the first element). + * Note, that list is expected to be not empty. + */ +#define list_private_next_entry_circular(pos, head, member) \ + (list_is_last(&ACCESS_PRIVATE(pos, member), head) ? \ + list_private_first_entry(head, typeof(*(pos)), member) : \ + list_private_next_entry(pos, member)) + +/** + * list_private_prev_entry - get the prev element in list + * @pos: the type * to cursor + * @member: the name of the list_head within the struct. + */ +#define list_private_prev_entry(pos, member) \ + list_private_entry(ACCESS_PRIVATE(pos, member).prev, typeof(*(pos)), member) + +/** + * list_private_prev_entry_circular - get the prev element in list + * @pos: the type * to cursor. + * @head: the list head to take the element from. + * @member: the name of the list_head within the struct. + * + * Wraparound if pos is the first element (return the last element). + * Note, that list is expected to be not empty. + */ +#define list_private_prev_entry_circular(pos, head, member) \ + (list_is_first(&ACCESS_PRIVATE(pos, member), head) ? \ + list_private_last_entry(head, typeof(*(pos)), member) : \ + list_private_prev_entry(pos, member)) + +/** + * list_private_entry_is_head - test if the entry points to the head of the list + * @pos: the type * to cursor + * @head: the head for your list. + * @member: the name of the list_head within the struct. + */ +#define list_private_entry_is_head(pos, head, member) \ + list_is_head(&ACCESS_PRIVATE(pos, member), (head)) + +/** + * list_private_for_each_entry - iterate over list of given type + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_head within the struct. + */ +#define list_private_for_each_entry(pos, head, member) \ + for (pos = list_private_first_entry(head, typeof(*pos), member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = list_private_next_entry(pos, member)) + +/** + * list_private_for_each_entry_reverse - iterate backwards over list of given type. + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_head within the struct. + */ +#define list_private_for_each_entry_reverse(pos, head, member) \ + for (pos = list_private_last_entry(head, typeof(*pos), member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = list_private_prev_entry(pos, member)) + +/** + * list_private_for_each_entry_continue - continue iteration over list of given type + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Continue to iterate over list of given type, continuing after + * the current position. + */ +#define list_private_for_each_entry_continue(pos, head, member) \ + for (pos = list_private_next_entry(pos, member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = list_private_next_entry(pos, member)) + +/** + * list_private_for_each_entry_continue_reverse - iterate backwards from the given point + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Start to iterate over list of given type backwards, continuing after + * the current position. + */ +#define list_private_for_each_entry_continue_reverse(pos, head, member) \ + for (pos = list_private_prev_entry(pos, member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = list_private_prev_entry(pos, member)) + +/** + * list_private_for_each_entry_from - iterate over list of given type from the current point + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Iterate over list of given type, continuing from current position. + */ +#define list_private_for_each_entry_from(pos, head, member) \ + for (; !list_private_entry_is_head(pos, head, member); \ + pos = list_private_next_entry(pos, member)) + +/** + * list_private_for_each_entry_from_reverse - iterate backwards over list of given type + * from the current point + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Iterate backwards over list of given type, continuing from current position. + */ +#define list_private_for_each_entry_from_reverse(pos, head, member) \ + for (; !list_private_entry_is_head(pos, head, member); \ + pos = list_private_prev_entry(pos, member)) + +/** + * list_private_for_each_entry_safe - iterate over list of given type safe against removal of list entry + * @pos: the type * to use as a loop cursor. + * @n: another type * to use as temporary storage + * @head: the head for your list. + * @member: the name of the list_head within the struct. + */ +#define list_private_for_each_entry_safe(pos, n, head, member) \ + for (pos = list_private_first_entry(head, typeof(*pos), member), \ + n = list_private_next_entry(pos, member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = n, n = list_private_next_entry(n, member)) + +/** + * list_private_for_each_entry_safe_continue - continue list iteration safe against removal + * @pos: the type * to use as a loop cursor. + * @n: another type * to use as temporary storage + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Iterate over list of given type, continuing after current point, + * safe against removal of list entry. + */ +#define list_private_for_each_entry_safe_continue(pos, n, head, member) \ + for (pos = list_private_next_entry(pos, member), \ + n = list_private_next_entry(pos, member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = n, n = list_private_next_entry(n, member)) + +/** + * list_private_for_each_entry_safe_from - iterate over list from current point safe against removal + * @pos: the type * to use as a loop cursor. + * @n: another type * to use as temporary storage + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Iterate over list of given type from current point, safe against + * removal of list entry. + */ +#define list_private_for_each_entry_safe_from(pos, n, head, member) \ + for (n = list_private_next_entry(pos, member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = n, n = list_private_next_entry(n, member)) + +/** + * list_private_for_each_entry_safe_reverse - iterate backwards over list safe against removal + * @pos: the type * to use as a loop cursor. + * @n: another type * to use as temporary storage + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Iterate backwards over list of given type, safe against removal + * of list entry. + */ +#define list_private_for_each_entry_safe_reverse(pos, n, head, member) \ + for (pos = list_private_last_entry(head, typeof(*pos), member), \ + n = list_private_prev_entry(pos, member); \ + !list_private_entry_is_head(pos, head, member); \ + pos = n, n = list_private_prev_entry(n, member)) + +/** + * list_private_safe_reset_next - reset a stale list_for_each_entry_safe loop + * @pos: the loop cursor used in the list_for_each_entry_safe loop + * @n: temporary storage used in list_for_each_entry_safe + * @member: the name of the list_head within the struct. + * + * list_safe_reset_next is not safe to use in general if the list may be + * modified concurrently (eg. the lock is dropped in the loop body). An + * exception to this is if the cursor element (pos) is pinned in the list, + * and list_safe_reset_next is called after re-taking the lock and before + * completing the current iteration of the loop body. + */ +#define list_private_safe_reset_next(pos, n, member) \ + n = list_private_next_entry(pos, member) + +#endif /* _LINUX_LIST_PRIVATE_H */ -- cgit v1.2.3 From cab056f2aae7250af50e503b81a80dfc567a1acd Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Thu, 18 Dec 2025 10:57:51 -0500 Subject: liveupdate: luo_flb: introduce File-Lifecycle-Bound global state Introduce a mechanism for managing global kernel state whose lifecycle is tied to the preservation of one or more files. This is necessary for subsystems where multiple preserved file descriptors depend on a single, shared underlying resource. An example is HugeTLB, where multiple file descriptors such as memfd and guest_memfd may rely on the state of a single HugeTLB subsystem. Preserving this state for each individual file would be redundant and incorrect. The state should be preserved only once when the first file is preserved, and restored/finished only once the last file is handled. This patch introduces File-Lifecycle-Bound (FLB) objects to solve this problem. An FLB is a global, reference-counted object with a defined set of operations: - A file handler (struct liveupdate_file_handler) declares a dependency on one or more FLBs via a new registration function, liveupdate_register_flb(). - When the first file depending on an FLB is preserved, the FLB's .preserve() callback is invoked to save the shared global state. The reference count is then incremented for each subsequent file. - Conversely, when the last file is unpreserved (before reboot) or finished (after reboot), the FLB's .unpreserve() or .finish() callback is invoked to clean up the global resource. The implementation includes: - A new set of ABI definitions (luo_flb_ser, luo_flb_head_ser) and a corresponding FDT node (luo-flb) to serialize the state of all active FLBs and pass them via Kexec Handover. - Core logic in luo_flb.c to manage FLB registration, reference counting, and the invocation of lifecycle callbacks. - An API (liveupdate_flb_get/_incoming/_outgoing) for other kernel subsystems to safely access the live object managed by an FLB, both before and after the live update. This framework provides the necessary infrastructure for more complex subsystems like IOMMU, VFIO, and KVM to integrate with the Live Update Orchestrator. Link: https://lkml.kernel.org/r/20251218155752.3045808-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Cc: Alexander Graf Cc: David Gow Cc: David Matlack Cc: David Rientjes Cc: Jonathan Corbet Cc: Kees Cook Cc: Mike Rapoport Cc: Petr Mladek Cc: Pratyush Yadav Cc: Samiullah Khawaja Cc: Tamir Duberstein Signed-off-by: Andrew Morton --- Documentation/core-api/liveupdate.rst | 11 + include/linux/kho/abi/luo.h | 76 ++++ include/linux/liveupdate.h | 147 ++++++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/luo_core.c | 7 +- kernel/liveupdate/luo_file.c | 24 +- kernel/liveupdate/luo_flb.c | 654 ++++++++++++++++++++++++++++++++++ kernel/liveupdate/luo_internal.h | 7 + 8 files changed, 924 insertions(+), 3 deletions(-) create mode 100644 kernel/liveupdate/luo_flb.c (limited to 'include/linux') diff --git a/Documentation/core-api/liveupdate.rst b/Documentation/core-api/liveupdate.rst index e2aba13494cf..5a292d0f3706 100644 --- a/Documentation/core-api/liveupdate.rst +++ b/Documentation/core-api/liveupdate.rst @@ -18,6 +18,11 @@ LUO Preserving File Descriptors .. kernel-doc:: kernel/liveupdate/luo_file.c :doc: LUO File Descriptors +LUO File Lifecycle Bound Global Data +==================================== +.. kernel-doc:: kernel/liveupdate/luo_flb.c + :doc: LUO File Lifecycle Bound Global Data + Live Update Orchestrator ABI ============================ .. kernel-doc:: include/linux/kho/abi/luo.h @@ -40,6 +45,9 @@ Public API .. kernel-doc:: kernel/liveupdate/luo_core.c :export: +.. kernel-doc:: kernel/liveupdate/luo_flb.c + :export: + .. kernel-doc:: kernel/liveupdate/luo_file.c :export: @@ -48,6 +56,9 @@ Internal API .. kernel-doc:: kernel/liveupdate/luo_core.c :internal: +.. kernel-doc:: kernel/liveupdate/luo_flb.c + :internal: + .. kernel-doc:: kernel/liveupdate/luo_session.c :internal: diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index beb86847b544..a44010aafb5e 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -37,6 +37,11 @@ * compatible = "luo-session-v1"; * luo-session-header = ; * }; + * + * luo-flb { + * compatible = "luo-flb-v1"; + * luo-flb-header = ; + * }; * }; * * Main LUO Node (/): @@ -56,6 +61,17 @@ * is the header for a contiguous block of memory containing an array of * `struct luo_session_ser`, one for each preserved session. * + * File-Lifecycle-Bound Node (luo-flb): + * This node describes all preserved global objects whose lifecycle is bound + * to that of the preserved files (e.g., shared IOMMU state). + * + * - compatible: "luo-flb-v1" + * Identifies the FLB ABI version. + * - luo-flb-header: u64 + * The physical address of a `struct luo_flb_header_ser`. This structure is + * the header for a contiguous block of memory containing an array of + * `struct luo_flb_ser`, one for each preserved global object. + * * Serialization Structures: * The FDT properties point to memory regions containing arrays of simple, * `__packed` structures. These structures contain the actual preserved state. @@ -74,6 +90,16 @@ * Metadata for a single preserved file. Contains the `compatible` string to * find the correct handler in the new kernel, a user-provided `token` for * identification, and an opaque `data` handle for the handler to use. + * + * - struct luo_flb_header_ser: + * Header for the FLB array. Contains the total page count of the + * preserved memory block and the number of `struct luo_flb_ser` entries + * that follow. + * + * - struct luo_flb_ser: + * Metadata for a single preserved global object. Contains its `name` + * (compatible string), an opaque `data` handle, and the `count` + * number of files depending on it. */ #ifndef _LINUX_KHO_ABI_LUO_H @@ -163,4 +189,54 @@ struct luo_session_ser { struct luo_file_set_ser file_set_ser; } __packed; +/* The max size is set so it can be reliably used during in serialization */ +#define LIVEUPDATE_FLB_COMPAT_LENGTH 48 + +#define LUO_FDT_FLB_NODE_NAME "luo-flb" +#define LUO_FDT_FLB_COMPATIBLE "luo-flb-v1" +#define LUO_FDT_FLB_HEADER "luo-flb-header" + +/** + * struct luo_flb_header_ser - Header for the serialized FLB data block. + * @pgcnt: The total number of pages occupied by the entire preserved memory + * region, including this header and the subsequent array of + * &struct luo_flb_ser entries. + * @count: The number of &struct luo_flb_ser entries that follow this header + * in the memory block. + * + * This structure is located at the physical address specified by the + * `LUO_FDT_FLB_HEADER` FDT property. It provides the new kernel with the + * necessary information to find and iterate over the array of preserved + * File-Lifecycle-Bound objects and to manage the underlying memory. + * + * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + */ +struct luo_flb_header_ser { + u64 pgcnt; + u64 count; +} __packed; + +/** + * struct luo_flb_ser - Represents the serialized state of a single FLB object. + * @name: The unique compatibility string of the FLB object, used to find the + * corresponding &struct liveupdate_flb handler in the new kernel. + * @data: The opaque u64 handle returned by the FLB's .preserve() operation + * in the old kernel. This handle encapsulates the entire state needed + * for restoration. + * @count: The reference count at the time of serialization; i.e., the number + * of preserved files that depended on this FLB. This is used by the + * new kernel to correctly manage the FLB's lifecycle. + * + * An array of these structures is created in a preserved memory region and + * passed to the new kernel. Each entry allows the LUO core to restore one + * global, shared object. + * + * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + */ +struct luo_flb_ser { + char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; + u64 data; + u64 count; +} __packed; + #endif /* _LINUX_KHO_ABI_LUO_H */ diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h index a7f6ee5b6771..fe82a6c3005f 100644 --- a/include/linux/liveupdate.h +++ b/include/linux/liveupdate.h @@ -11,10 +11,13 @@ #include #include #include +#include #include #include struct liveupdate_file_handler; +struct liveupdate_flb; +struct liveupdate_session; struct file; /** @@ -99,6 +102,118 @@ struct liveupdate_file_handler { * registered file handlers. */ struct list_head __private list; + /* A list of FLB dependencies. */ + struct list_head __private flb_list; +}; + +/** + * struct liveupdate_flb_op_args - Arguments for FLB operation callbacks. + * @flb: The global FLB instance for which this call is performed. + * @data: For .preserve(): [OUT] The callback sets this field. + * For .unpreserve(): [IN] The handle from .preserve(). + * For .retrieve(): [IN] The handle from .preserve(). + * @obj: For .preserve(): [OUT] Sets this to the live object. + * For .retrieve(): [OUT] Sets this to the live object. + * For .finish(): [IN] The live object from .retrieve(). + * + * This structure bundles all parameters for the FLB operation callbacks. + */ +struct liveupdate_flb_op_args { + struct liveupdate_flb *flb; + u64 data; + void *obj; +}; + +/** + * struct liveupdate_flb_ops - Callbacks for global File-Lifecycle-Bound data. + * @preserve: Called when the first file using this FLB is preserved. + * The callback must save its state and return a single, + * self-contained u64 handle by setting the 'argp->data' + * field and 'argp->obj'. + * @unpreserve: Called when the last file using this FLB is unpreserved + * (aborted before reboot). Receives the handle via + * 'argp->data' and live object via 'argp->obj'. + * @retrieve: Called on-demand in the new kernel, the first time a + * component requests access to the shared object. It receives + * the preserved handle via 'argp->data' and must reconstruct + * the live object, returning it by setting the 'argp->obj' + * field. + * @finish: Called in the new kernel when the last file using this FLB + * is finished. Receives the live object via 'argp->obj' for + * cleanup. + * @owner: Module reference + * + * Operations that manage global shared data with file bound lifecycle, + * triggered by the first file that uses it and concluded by the last file that + * uses it, across all sessions. + */ +struct liveupdate_flb_ops { + int (*preserve)(struct liveupdate_flb_op_args *argp); + void (*unpreserve)(struct liveupdate_flb_op_args *argp); + int (*retrieve)(struct liveupdate_flb_op_args *argp); + void (*finish)(struct liveupdate_flb_op_args *argp); + struct module *owner; +}; + +/* + * struct luo_flb_private_state - Private FLB state structures. + * @count: The number of preserved files currently depending on this FLB. + * This is used to trigger the preserve/unpreserve/finish ops on the + * first/last file. + * @data: The opaque u64 handle returned by .preserve() or passed to + * .retrieve(). + * @obj: The live kernel object returned by .preserve() or .retrieve(). + * @lock: A mutex that protects all fields within this structure, providing + * the synchronization service for the FLB's ops. + * @finished: True once the FLB's finish() callback has run. + * @retrieved: True once the FLB's retrieve() callback has run. + */ +struct luo_flb_private_state { + long count; + u64 data; + void *obj; + struct mutex lock; + bool finished; + bool retrieved; +}; + +/* + * struct luo_flb_private - Keep separate incoming and outgoing states. + * @list: A global list of registered FLBs. + * @outgoing: The runtime state for the pre-reboot + * (preserve/unpreserve) lifecycle. + * @incoming: The runtime state for the post-reboot (retrieve/finish) + * lifecycle. + * @users: With how many File-Handlers this FLB is registered. + * @initialized: true when private fields have been initialized. + */ +struct luo_flb_private { + struct list_head list; + struct luo_flb_private_state outgoing; + struct luo_flb_private_state incoming; + int users; + bool initialized; +}; + +/** + * struct liveupdate_flb - A global definition for a shared data object. + * @ops: Callback functions + * @compatible: The compatibility string (e.g., "iommu-core-v1" + * that uniquely identifies the FLB type this handler + * supports. This is matched against the compatible string + * associated with individual &struct liveupdate_flb + * instances. + * + * This struct is the "template" that a driver registers to define a shared, + * file-lifecycle-bound object. The actual runtime state (the live object, + * refcount, etc.) is managed privately by the LUO core. + */ +struct liveupdate_flb { + const struct liveupdate_flb_ops *ops; + const char compatible[LIVEUPDATE_FLB_COMPAT_LENGTH]; + + /* private: */ + struct luo_flb_private __private private; }; #ifdef CONFIG_LIVEUPDATE @@ -112,6 +227,14 @@ int liveupdate_reboot(void); int liveupdate_register_file_handler(struct liveupdate_file_handler *fh); int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); +int liveupdate_register_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb); +int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb); + +int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp); +int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp); + #else /* CONFIG_LIVEUPDATE */ static inline bool liveupdate_enabled(void) @@ -134,5 +257,29 @@ static inline int liveupdate_unregister_file_handler(struct liveupdate_file_hand return -EOPNOTSUPP; } +static inline int liveupdate_register_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb) +{ + return -EOPNOTSUPP; +} + +static inline int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb) +{ + return -EOPNOTSUPP; +} + +static inline int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, + void **objp) +{ + return -EOPNOTSUPP; +} + +static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, + void **objp) +{ + return -EOPNOTSUPP; +} + #endif /* CONFIG_LIVEUPDATE */ #endif /* _LINUX_LIVEUPDATE_H */ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index 7cad2eece32d..d2f779cbe279 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -3,6 +3,7 @@ luo-y := \ luo_core.o \ luo_file.o \ + luo_flb.o \ luo_session.o obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index a26c093eb8eb..dda7bb57d421 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -127,7 +127,9 @@ static int __init luo_early_startup(void) if (err) return err; - return 0; + err = luo_flb_setup_incoming(luo_global.fdt_in); + + return err; } static int __init liveupdate_early_init(void) @@ -164,6 +166,7 @@ static int __init luo_fdt_setup(void) err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); err |= fdt_property(fdt_out, LUO_FDT_LIVEUPDATE_NUM, &ln, sizeof(ln)); err |= luo_session_setup_outgoing(fdt_out); + err |= luo_flb_setup_outgoing(fdt_out); err |= fdt_end_node(fdt_out); err |= fdt_finish(fdt_out); if (err) @@ -225,6 +228,8 @@ int liveupdate_reboot(void) if (err) return err; + luo_flb_serialize(); + err = kho_finalize(); if (err) { pr_err("kho_finalize failed %d\n", err); diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 1a8a1bb73a58..cade273c50c9 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -285,10 +285,14 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) if (err) goto err_free_files_mem; + err = luo_flb_file_preserve(fh); + if (err) + goto err_free_files_mem; + luo_file = kzalloc(sizeof(*luo_file), GFP_KERNEL); if (!luo_file) { err = -ENOMEM; - goto err_free_files_mem; + goto err_flb_unpreserve; } luo_file->file = file; @@ -312,6 +316,8 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) err_kfree: kfree(luo_file); +err_flb_unpreserve: + luo_flb_file_unpreserve(fh); err_free_files_mem: luo_free_files_mem(file_set); err_fput: @@ -353,6 +359,7 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) args.serialized_data = luo_file->serialized_data; args.private_data = luo_file->private_data; luo_file->fh->ops->unpreserve(&args); + luo_flb_file_unpreserve(luo_file->fh); list_del(&luo_file->list); file_set->count--; @@ -630,6 +637,7 @@ static void luo_file_finish_one(struct luo_file_set *file_set, args.retrieved = luo_file->retrieved; luo_file->fh->ops->finish(&args); + luo_flb_file_finish(luo_file->fh); } /** @@ -851,6 +859,7 @@ int liveupdate_register_file_handler(struct liveupdate_file_handler *fh) goto err_resume; } + INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, flb_list)); INIT_LIST_HEAD(&ACCESS_PRIVATE(fh, list)); list_add_tail(&ACCESS_PRIVATE(fh, list), &luo_file_handler_list); luo_session_resume(); @@ -871,23 +880,34 @@ err_resume: * * It ensures safe removal by checking that: * No live update session is currently in progress. + * No FLB registered with this file handler. * * If the unregistration fails, the internal test state is reverted. * * Return: 0 Success. -EOPNOTSUPP when live update is not enabled. -EBUSY A live - * update is in progress, can't quiesce live update. + * update is in progress, can't quiesce live update or FLB is registred with + * this file handler. */ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) { + int err = -EBUSY; + if (!liveupdate_enabled()) return -EOPNOTSUPP; if (!luo_session_quiesce()) return -EBUSY; + if (!list_empty(&ACCESS_PRIVATE(fh, flb_list))) + goto err_resume; + list_del(&ACCESS_PRIVATE(fh, list)); module_put(fh->ops->owner); luo_session_resume(); return 0; + +err_resume: + luo_session_resume(); + return err; } diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c new file mode 100644 index 000000000000..4c437de5c0b0 --- /dev/null +++ b/kernel/liveupdate/luo_flb.c @@ -0,0 +1,654 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: LUO File Lifecycle Bound Global Data + * + * File-Lifecycle-Bound (FLB) objects provide a mechanism for managing global + * state that is shared across multiple live-updatable files. The lifecycle of + * this shared state is tied to the preservation of the files that depend on it. + * + * An FLB represents a global resource, such as the IOMMU core state, that is + * required by multiple file descriptors (e.g., all VFIO fds). + * + * The preservation of the FLB's state is triggered when the *first* file + * depending on it is preserved. The cleanup of this state (unpreserve or + * finish) is triggered when the *last* file depending on it is unpreserved or + * finished. + * + * Handler Dependency: A file handler declares its dependency on one or more + * FLBs by registering them via liveupdate_register_flb(). + * + * Callback Model: Each FLB is defined by a set of operations + * (&struct liveupdate_flb_ops) that LUO invokes at key points: + * + * - .preserve(): Called for the first file. Saves global state. + * - .unpreserve(): Called for the last file (if aborted pre-reboot). + * - .retrieve(): Called on-demand in the new kernel to restore the state. + * - .finish(): Called for the last file in the new kernel for cleanup. + * + * This reference-counted approach ensures that shared state is saved exactly + * once and restored exactly once, regardless of how many files depend on it, + * and that its lifecycle is correctly managed across the kexec transition. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "luo_internal.h" + +#define LUO_FLB_PGCNT 1ul +#define LUO_FLB_MAX (((LUO_FLB_PGCNT << PAGE_SHIFT) - \ + sizeof(struct luo_flb_header_ser)) / sizeof(struct luo_flb_ser)) + +struct luo_flb_header { + struct luo_flb_header_ser *header_ser; + struct luo_flb_ser *ser; + bool active; +}; + +struct luo_flb_global { + struct luo_flb_header incoming; + struct luo_flb_header outgoing; + struct list_head list; + long count; +}; + +static struct luo_flb_global luo_flb_global = { + .list = LIST_HEAD_INIT(luo_flb_global.list), +}; + +/* + * struct luo_flb_link - Links an FLB definition to a file handler's internal + * list of dependencies. + * @flb: A pointer to the registered &struct liveupdate_flb definition. + * @list: The list_head for linking. + */ +struct luo_flb_link { + struct liveupdate_flb *flb; + struct list_head list; +}; + +/* luo_flb_get_private - Access private field, and if needed initialize it. */ +static struct luo_flb_private *luo_flb_get_private(struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = &ACCESS_PRIVATE(flb, private); + + if (!private->initialized) { + mutex_init(&private->incoming.lock); + mutex_init(&private->outgoing.lock); + INIT_LIST_HEAD(&private->list); + private->users = 0; + private->initialized = true; + } + + return private; +} + +static int luo_flb_file_preserve_one(struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + + scoped_guard(mutex, &private->outgoing.lock) { + if (!private->outgoing.count) { + struct liveupdate_flb_op_args args = {0}; + int err; + + args.flb = flb; + err = flb->ops->preserve(&args); + if (err) + return err; + private->outgoing.data = args.data; + private->outgoing.obj = args.obj; + } + private->outgoing.count++; + } + + return 0; +} + +static void luo_flb_file_unpreserve_one(struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + + scoped_guard(mutex, &private->outgoing.lock) { + private->outgoing.count--; + if (!private->outgoing.count) { + struct liveupdate_flb_op_args args = {0}; + + args.flb = flb; + args.data = private->outgoing.data; + args.obj = private->outgoing.obj; + + if (flb->ops->unpreserve) + flb->ops->unpreserve(&args); + + private->outgoing.data = 0; + private->outgoing.obj = NULL; + } + } +} + +static int luo_flb_retrieve_one(struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + struct luo_flb_header *fh = &luo_flb_global.incoming; + struct liveupdate_flb_op_args args = {0}; + bool found = false; + int err; + + guard(mutex)(&private->incoming.lock); + + if (private->incoming.finished) + return -ENODATA; + + if (private->incoming.retrieved) + return 0; + + if (!fh->active) + return -ENODATA; + + for (int i = 0; i < fh->header_ser->count; i++) { + if (!strcmp(fh->ser[i].name, flb->compatible)) { + private->incoming.data = fh->ser[i].data; + private->incoming.count = fh->ser[i].count; + found = true; + break; + } + } + + if (!found) + return -ENOENT; + + args.flb = flb; + args.data = private->incoming.data; + + err = flb->ops->retrieve(&args); + if (err) + return err; + + private->incoming.obj = args.obj; + private->incoming.retrieved = true; + + return 0; +} + +static void luo_flb_file_finish_one(struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + u64 count; + + scoped_guard(mutex, &private->incoming.lock) + count = --private->incoming.count; + + if (!count) { + struct liveupdate_flb_op_args args = {0}; + + if (!private->incoming.retrieved) { + int err = luo_flb_retrieve_one(flb); + + if (WARN_ON(err)) + return; + } + + scoped_guard(mutex, &private->incoming.lock) { + args.flb = flb; + args.obj = private->incoming.obj; + flb->ops->finish(&args); + + private->incoming.data = 0; + private->incoming.obj = NULL; + private->incoming.finished = true; + } + } +} + +/** + * luo_flb_file_preserve - Notifies FLBs that a file is about to be preserved. + * @fh: The file handler for the preserved file. + * + * This function iterates through all FLBs associated with the given file + * handler. It increments the reference count for each FLB. If the count becomes + * 1, it triggers the FLB's .preserve() callback to save the global state. + * + * This operation is atomic. If any FLB's .preserve() op fails, it will roll + * back by calling .unpreserve() on any FLBs that were successfully preserved + * during this call. + * + * Context: Called from luo_preserve_file() + * Return: 0 on success, or a negative errno on failure. + */ +int luo_flb_file_preserve(struct liveupdate_file_handler *fh) +{ + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); + struct luo_flb_link *iter; + int err = 0; + + list_for_each_entry(iter, flb_list, list) { + err = luo_flb_file_preserve_one(iter->flb); + if (err) + goto exit_err; + } + + return 0; + +exit_err: + list_for_each_entry_continue_reverse(iter, flb_list, list) + luo_flb_file_unpreserve_one(iter->flb); + + return err; +} + +/** + * luo_flb_file_unpreserve - Notifies FLBs that a dependent file was unpreserved. + * @fh: The file handler for the unpreserved file. + * + * This function iterates through all FLBs associated with the given file + * handler, in reverse order of registration. It decrements the reference count + * for each FLB. If the count becomes 0, it triggers the FLB's .unpreserve() + * callback to clean up the global state. + * + * Context: Called when a preserved file is being cleaned up before reboot + * (e.g., from luo_file_unpreserve_files()). + */ +void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh) +{ + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); + struct luo_flb_link *iter; + + list_for_each_entry_reverse(iter, flb_list, list) + luo_flb_file_unpreserve_one(iter->flb); +} + +/** + * luo_flb_file_finish - Notifies FLBs that a dependent file has been finished. + * @fh: The file handler for the finished file. + * + * This function iterates through all FLBs associated with the given file + * handler, in reverse order of registration. It decrements the incoming + * reference count for each FLB. If the count becomes 0, it triggers the FLB's + * .finish() callback for final cleanup in the new kernel. + * + * Context: Called from luo_file_finish() for each file being finished. + */ +void luo_flb_file_finish(struct liveupdate_file_handler *fh) +{ + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); + struct luo_flb_link *iter; + + list_for_each_entry_reverse(iter, flb_list, list) + luo_flb_file_finish_one(iter->flb); +} + +/** + * liveupdate_register_flb - Associate an FLB with a file handler and register it globally. + * @fh: The file handler that will now depend on the FLB. + * @flb: The File-Lifecycle-Bound object to associate. + * + * Establishes a dependency, informing the LUO core that whenever a file of + * type @fh is preserved, the state of @flb must also be managed. + * + * On the first registration of a given @flb object, it is added to a global + * registry. This function checks for duplicate registrations, both for a + * specific handler and globally, and ensures the total number of unique + * FLBs does not exceed the system limit. + * + * Context: Typically called from a subsystem's module init function after + * both the handler and the FLB have been defined and initialized. + * Return: 0 on success. Returns a negative errno on failure: + * -EINVAL if arguments are NULL or not initialized. + * -ENOMEM on memory allocation failure. + * -EEXIST if this FLB is already registered with this handler. + * -ENOSPC if the maximum number of global FLBs has been reached. + * -EOPNOTSUPP if live update is disabled or not configured. + */ +int liveupdate_register_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); + struct luo_flb_link *link __free(kfree) = NULL; + struct liveupdate_flb *gflb; + struct luo_flb_link *iter; + int err; + + if (!liveupdate_enabled()) + return -EOPNOTSUPP; + + if (WARN_ON(!flb->ops->preserve || !flb->ops->unpreserve || + !flb->ops->retrieve || !flb->ops->finish)) { + return -EINVAL; + } + + /* + * File handler must already be registered, as it initializes the + * flb_list + */ + if (WARN_ON(list_empty(&ACCESS_PRIVATE(fh, list)))) + return -EINVAL; + + link = kzalloc(sizeof(*link), GFP_KERNEL); + if (!link) + return -ENOMEM; + + /* + * Ensure the system is quiescent (no active sessions). + * This acts as a global lock for registration: no other thread can + * be in this section, and no sessions can be creating/using FDs. + */ + if (!luo_session_quiesce()) + return -EBUSY; + + /* Check that this FLB is not already linked to this file handler */ + err = -EEXIST; + list_for_each_entry(iter, flb_list, list) { + if (iter->flb == flb) + goto err_resume; + } + + /* + * If this FLB is not linked to global list it's the first time the FLB + * is registered + */ + if (!private->users) { + if (WARN_ON(!list_empty(&private->list))) { + err = -EINVAL; + goto err_resume; + } + + if (luo_flb_global.count == LUO_FLB_MAX) { + err = -ENOSPC; + goto err_resume; + } + + /* Check that compatible string is unique in global list */ + list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { + if (!strcmp(gflb->compatible, flb->compatible)) + goto err_resume; + } + + if (!try_module_get(flb->ops->owner)) { + err = -EAGAIN; + goto err_resume; + } + + list_add_tail(&private->list, &luo_flb_global.list); + luo_flb_global.count++; + } + + /* Finally, link the FLB to the file handler */ + private->users++; + link->flb = flb; + list_add_tail(&no_free_ptr(link)->list, flb_list); + luo_session_resume(); + + return 0; + +err_resume: + luo_session_resume(); + return err; +} + +/** + * liveupdate_unregister_flb - Remove an FLB dependency from a file handler. + * @fh: The file handler that is currently depending on the FLB. + * @flb: The File-Lifecycle-Bound object to remove. + * + * Removes the association between the specified file handler and the FLB + * previously established by liveupdate_register_flb(). + * + * This function manages the global lifecycle of the FLB. It decrements the + * FLB's usage count. If this was the last file handler referencing this FLB, + * the FLB is removed from the global registry and the reference to its + * owner module (acquired during registration) is released. + * + * Context: This function ensures the session is quiesced (no active FDs + * being created) during the update. It is typically called from a + * subsystem's module exit function. + * Return: 0 on success. + * -EOPNOTSUPP if live update is disabled. + * -EBUSY if the live update session is active and cannot be quiesced. + * -ENOENT if the FLB was not found in the file handler's list. + */ +int liveupdate_unregister_flb(struct liveupdate_file_handler *fh, + struct liveupdate_flb *flb) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + struct list_head *flb_list = &ACCESS_PRIVATE(fh, flb_list); + struct luo_flb_link *iter; + int err = -ENOENT; + + if (!liveupdate_enabled()) + return -EOPNOTSUPP; + + /* + * Ensure the system is quiescent (no active sessions). + * This acts as a global lock for unregistration. + */ + if (!luo_session_quiesce()) + return -EBUSY; + + /* Find and remove the link from the file handler's list */ + list_for_each_entry(iter, flb_list, list) { + if (iter->flb == flb) { + list_del(&iter->list); + kfree(iter); + err = 0; + break; + } + } + + if (err) + goto err_resume; + + private->users--; + /* + * If this is the last file-handler with which we are registred, remove + * from the global list, and relese module reference. + */ + if (!private->users) { + list_del_init(&private->list); + luo_flb_global.count--; + module_put(flb->ops->owner); + } + + luo_session_resume(); + + return 0; + +err_resume: + luo_session_resume(); + return err; +} + +/** + * liveupdate_flb_get_incoming - Retrieve the incoming FLB object. + * @flb: The FLB definition. + * @objp: Output parameter; will be populated with the live shared object. + * + * Returns a pointer to its shared live object for the incoming (post-reboot) + * path. + * + * If this is the first time the object is requested in the new kernel, this + * function will trigger the FLB's .retrieve() callback to reconstruct the + * object from its preserved state. Subsequent calls will return the same + * cached object. + * + * Return: 0 on success, or a negative errno on failure. -ENODATA means no + * incoming FLB data, -ENOENT means specific flb not found in the incoming + * data, and -EOPNOTSUPP when live update is disabled or not configured. + */ +int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + + if (!liveupdate_enabled()) + return -EOPNOTSUPP; + + if (!private->incoming.obj) { + int err = luo_flb_retrieve_one(flb); + + if (err) + return err; + } + + guard(mutex)(&private->incoming.lock); + *objp = private->incoming.obj; + + return 0; +} + +/** + * liveupdate_flb_get_outgoing - Retrieve the outgoing FLB object. + * @flb: The FLB definition. + * @objp: Output parameter; will be populated with the live shared object. + * + * Returns a pointer to its shared live object for the outgoing (pre-reboot) + * path. + * + * This function assumes the object has already been created by the FLB's + * .preserve() callback, which is triggered when the first dependent file + * is preserved. + * + * Return: 0 on success, or a negative errno on failure. + */ +int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp) +{ + struct luo_flb_private *private = luo_flb_get_private(flb); + + if (!liveupdate_enabled()) + return -EOPNOTSUPP; + + guard(mutex)(&private->outgoing.lock); + *objp = private->outgoing.obj; + + return 0; +} + +int __init luo_flb_setup_outgoing(void *fdt_out) +{ + struct luo_flb_header_ser *header_ser; + u64 header_ser_pa; + int err; + + header_ser = kho_alloc_preserve(LUO_FLB_PGCNT << PAGE_SHIFT); + if (IS_ERR(header_ser)) + return PTR_ERR(header_ser); + + header_ser_pa = virt_to_phys(header_ser); + + err = fdt_begin_node(fdt_out, LUO_FDT_FLB_NODE_NAME); + err |= fdt_property_string(fdt_out, "compatible", + LUO_FDT_FLB_COMPATIBLE); + err |= fdt_property(fdt_out, LUO_FDT_FLB_HEADER, &header_ser_pa, + sizeof(header_ser_pa)); + err |= fdt_end_node(fdt_out); + + if (err) + goto err_unpreserve; + + header_ser->pgcnt = LUO_FLB_PGCNT; + luo_flb_global.outgoing.header_ser = header_ser; + luo_flb_global.outgoing.ser = (void *)(header_ser + 1); + luo_flb_global.outgoing.active = true; + + return 0; + +err_unpreserve: + kho_unpreserve_free(header_ser); + + return err; +} + +int __init luo_flb_setup_incoming(void *fdt_in) +{ + struct luo_flb_header_ser *header_ser; + int err, header_size, offset; + const void *ptr; + u64 header_ser_pa; + + offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); + if (offset < 0) { + pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); + + return -ENOENT; + } + + err = fdt_node_check_compatible(fdt_in, offset, + LUO_FDT_FLB_COMPATIBLE); + if (err) { + pr_err("FLB node is incompatible with '%s' [%d]\n", + LUO_FDT_FLB_COMPATIBLE, err); + + return -EINVAL; + } + + header_size = 0; + ptr = fdt_getprop(fdt_in, offset, LUO_FDT_FLB_HEADER, &header_size); + if (!ptr || header_size != sizeof(u64)) { + pr_err("Unable to get FLB header property '%s' [%d]\n", + LUO_FDT_FLB_HEADER, header_size); + + return -EINVAL; + } + + header_ser_pa = get_unaligned((u64 *)ptr); + header_ser = phys_to_virt(header_ser_pa); + + luo_flb_global.incoming.header_ser = header_ser; + luo_flb_global.incoming.ser = (void *)(header_ser + 1); + luo_flb_global.incoming.active = true; + + return 0; +} + +/** + * luo_flb_serialize - Serializes all active FLB objects for KHO. + * + * This function is called from the reboot path. It iterates through all + * registered File-Lifecycle-Bound (FLB) objects. For each FLB that has been + * preserved (i.e., its reference count is greater than zero), it writes its + * metadata into the memory region designated for Kexec Handover. + * + * The serialized data includes the FLB's compatibility string, its opaque + * data handle, and the final reference count. This allows the new kernel to + * find the appropriate handler and reconstruct the FLB's state. + * + * Context: Called from liveupdate_reboot() just before kho_finalize(). + */ +void luo_flb_serialize(void) +{ + struct luo_flb_header *fh = &luo_flb_global.outgoing; + struct liveupdate_flb *gflb; + int i = 0; + + list_private_for_each_entry(gflb, &luo_flb_global.list, private.list) { + struct luo_flb_private *private = luo_flb_get_private(gflb); + + if (private->outgoing.count > 0) { + strscpy(fh->ser[i].name, gflb->compatible, + sizeof(fh->ser[i].name)); + fh->ser[i].data = private->outgoing.data; + fh->ser[i].count = private->outgoing.count; + i++; + } + } + + fh->header_ser->count = i; +} diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 3f1e0c94637e..99db13d99530 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -100,4 +100,11 @@ int luo_file_deserialize(struct luo_file_set *file_set, void luo_file_set_init(struct luo_file_set *file_set); void luo_file_set_destroy(struct luo_file_set *file_set); +int luo_flb_file_preserve(struct liveupdate_file_handler *fh); +void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); +void luo_flb_file_finish(struct liveupdate_file_handler *fh); +int __init luo_flb_setup_outgoing(void *fdt); +int __init luo_flb_setup_incoming(void *fdt); +void luo_flb_serialize(void); + #endif /* _LINUX_LUO_INTERNAL_H */ -- cgit v1.2.3 From f653ff7af96951faa69c68665d44bed80702544f Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Thu, 18 Dec 2025 10:57:52 -0500 Subject: tests/liveupdate: add in-kernel liveupdate test Introduce an in-kernel test module to validate the core logic of the Live Update Orchestrator's File-Lifecycle-Bound feature. This provides a low-level, controlled environment to test FLB registration and callback invocation without requiring userspace interaction or actual kexec reboots. The test is enabled by the CONFIG_LIVEUPDATE_TEST Kconfig option. Link: https://lkml.kernel.org/r/20251218155752.3045808-6-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Cc: Alexander Graf Cc: David Gow Cc: David Matlack Cc: David Rientjes Cc: Jonathan Corbet Cc: Kees Cook Cc: Mike Rapoport Cc: Petr Mladek Cc: Pratyush Yadav Cc: Samiullah Khawaja Cc: Tamir Duberstein Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + include/linux/kho/abi/luo.h | 5 ++ kernel/liveupdate/luo_file.c | 8 +- kernel/liveupdate/luo_internal.h | 8 ++ lib/Kconfig.debug | 23 ++++++ lib/tests/Makefile | 1 + lib/tests/liveupdate.c | 158 +++++++++++++++++++++++++++++++++++++++ 7 files changed, 203 insertions(+), 1 deletion(-) create mode 100644 lib/tests/liveupdate.c (limited to 'include/linux') diff --git a/MAINTAINERS b/MAINTAINERS index 92b377cd131b..a2a4cfd19fad 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14652,6 +14652,7 @@ F: include/linux/liveupdate.h F: include/linux/liveupdate/ F: include/uapi/linux/liveupdate.h F: kernel/liveupdate/ +F: lib/tests/liveupdate.c F: mm/memfd_luo.c F: tools/testing/selftests/liveupdate/ diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index a44010aafb5e..46750a0ddf88 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -239,4 +239,9 @@ struct luo_flb_ser { u64 count; } __packed; +/* Kernel Live Update Test ABI */ +#ifdef CONFIG_LIVEUPDATE_TEST +#define LIVEUPDATE_TEST_FLB_COMPATIBLE(i) "liveupdate-test-flb-v" #i +#endif + #endif /* _LINUX_KHO_ABI_LUO_H */ diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index cade273c50c9..35d2a8b1a0df 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -864,6 +864,8 @@ int liveupdate_register_file_handler(struct liveupdate_file_handler *fh) list_add_tail(&ACCESS_PRIVATE(fh, list), &luo_file_handler_list); luo_session_resume(); + liveupdate_test_register(fh); + return 0; err_resume: @@ -895,8 +897,10 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) if (!liveupdate_enabled()) return -EOPNOTSUPP; + liveupdate_test_unregister(fh); + if (!luo_session_quiesce()) - return -EBUSY; + goto err_register; if (!list_empty(&ACCESS_PRIVATE(fh, flb_list))) goto err_resume; @@ -909,5 +913,7 @@ int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) err_resume: luo_session_resume(); +err_register: + liveupdate_test_register(fh); return err; } diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 99db13d99530..8083d8739b09 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -107,4 +107,12 @@ int __init luo_flb_setup_outgoing(void *fdt); int __init luo_flb_setup_incoming(void *fdt); void luo_flb_serialize(void); +#ifdef CONFIG_LIVEUPDATE_TEST +void liveupdate_test_register(struct liveupdate_file_handler *fh); +void liveupdate_test_unregister(struct liveupdate_file_handler *fh); +#else +static inline void liveupdate_test_register(struct liveupdate_file_handler *fh) { } +static inline void liveupdate_test_unregister(struct liveupdate_file_handler *fh) { } +#endif + #endif /* _LINUX_LUO_INTERNAL_H */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 234b73f9baf7..ef201f1cc498 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2825,6 +2825,29 @@ config LINEAR_RANGES_TEST If unsure, say N. +config LIVEUPDATE_TEST + bool "Live Update Kernel Test" + default n + depends on LIVEUPDATE + help + Enable a built-in kernel test module for the Live Update + Orchestrator. + + This module validates the File-Lifecycle-Bound subsystem by + registering a set of mock FLB objects with any real file handlers + that support live update (such as the memfd handler). + + When live update operations are performed, this test module will + output messages to the kernel log (dmesg), confirming that its + registration and various callback functions (preserve, retrieve, + finish, etc.) are being invoked correctly. + + This is a debugging and regression testing tool for developers + working on the Live Update subsystem. It should not be enabled in + production kernels. + + If unsure, say N + config CMDLINE_KUNIT_TEST tristate "KUnit test for cmdline API" if !KUNIT_ALL_TESTS depends on KUNIT diff --git a/lib/tests/Makefile b/lib/tests/Makefile index f740b0a26750..436b7b7a65f0 100644 --- a/lib/tests/Makefile +++ b/lib/tests/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_LIST_PRIVATE_KUNIT_TEST) += list-private-test.o obj-$(CONFIG_KFIFO_KUNIT_TEST) += kfifo_kunit.o obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o +obj-$(CONFIG_LIVEUPDATE_TEST) += liveupdate.o CFLAGS_longest_symbol_kunit.o += $(call cc-disable-warning, missing-prototypes) obj-$(CONFIG_LONGEST_SYM_KUNIT_TEST) += longest_symbol_kunit.o diff --git a/lib/tests/liveupdate.c b/lib/tests/liveupdate.c new file mode 100644 index 000000000000..496d6ef91a30 --- /dev/null +++ b/lib/tests/liveupdate.c @@ -0,0 +1,158 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + */ + +#define pr_fmt(fmt) KBUILD_MODNAME " test: " fmt + +#include +#include +#include +#include +#include +#include "../../kernel/liveupdate/luo_internal.h" + +static const struct liveupdate_flb_ops test_flb_ops; +#define DEFINE_TEST_FLB(i) { \ + .ops = &test_flb_ops, \ + .compatible = LIVEUPDATE_TEST_FLB_COMPATIBLE(i), \ +} + +/* Number of Test FLBs to register with every file handler */ +#define TEST_NFLBS 3 +static struct liveupdate_flb test_flbs[TEST_NFLBS] = { + DEFINE_TEST_FLB(0), + DEFINE_TEST_FLB(1), + DEFINE_TEST_FLB(2), +}; + +#define TEST_FLB_MAGIC_BASE 0xFEEDF00DCAFEBEE0ULL + +static int test_flb_preserve(struct liveupdate_flb_op_args *argp) +{ + ptrdiff_t index = argp->flb - test_flbs; + + pr_info("%s: preserve was triggered\n", argp->flb->compatible); + argp->data = TEST_FLB_MAGIC_BASE + index; + + return 0; +} + +static void test_flb_unpreserve(struct liveupdate_flb_op_args *argp) +{ + pr_info("%s: unpreserve was triggered\n", argp->flb->compatible); +} + +static int test_flb_retrieve(struct liveupdate_flb_op_args *argp) +{ + ptrdiff_t index = argp->flb - test_flbs; + u64 expected_data = TEST_FLB_MAGIC_BASE + index; + + if (argp->data == expected_data) { + pr_info("%s: found flb data from the previous boot\n", + argp->flb->compatible); + argp->obj = (void *)argp->data; + } else { + pr_err("%s: ERROR - incorrect data handle: %llx, expected %llx\n", + argp->flb->compatible, argp->data, expected_data); + return -EINVAL; + } + + return 0; +} + +static void test_flb_finish(struct liveupdate_flb_op_args *argp) +{ + ptrdiff_t index = argp->flb - test_flbs; + void *expected_obj = (void *)(TEST_FLB_MAGIC_BASE + index); + + if (argp->obj == expected_obj) { + pr_info("%s: finish was triggered\n", argp->flb->compatible); + } else { + pr_err("%s: ERROR - finish called with invalid object\n", + argp->flb->compatible); + } +} + +static const struct liveupdate_flb_ops test_flb_ops = { + .preserve = test_flb_preserve, + .unpreserve = test_flb_unpreserve, + .retrieve = test_flb_retrieve, + .finish = test_flb_finish, + .owner = THIS_MODULE, +}; + +static void liveupdate_test_init(void) +{ + static DEFINE_MUTEX(init_lock); + static bool initialized; + int i; + + guard(mutex)(&init_lock); + + if (initialized) + return; + + for (i = 0; i < TEST_NFLBS; i++) { + struct liveupdate_flb *flb = &test_flbs[i]; + void *obj; + int err; + + err = liveupdate_flb_get_incoming(flb, &obj); + if (err && err != -ENODATA && err != -ENOENT) { + pr_err("liveupdate_flb_get_incoming for %s failed: %pe\n", + flb->compatible, ERR_PTR(err)); + } + } + initialized = true; +} + +void liveupdate_test_register(struct liveupdate_file_handler *fh) +{ + int err, i; + + liveupdate_test_init(); + + for (i = 0; i < TEST_NFLBS; i++) { + struct liveupdate_flb *flb = &test_flbs[i]; + + err = liveupdate_register_flb(fh, flb); + if (err) { + pr_err("Failed to register %s %pe\n", + flb->compatible, ERR_PTR(err)); + } + } + + err = liveupdate_register_flb(fh, &test_flbs[0]); + if (!err || err != -EEXIST) { + pr_err("Failed: %s should be already registered, but got err: %pe\n", + test_flbs[0].compatible, ERR_PTR(err)); + } + + pr_info("Registered %d FLBs with file handler: [%s]\n", + TEST_NFLBS, fh->compatible); +} + +void liveupdate_test_unregister(struct liveupdate_file_handler *fh) +{ + int err, i; + + for (i = 0; i < TEST_NFLBS; i++) { + struct liveupdate_flb *flb = &test_flbs[i]; + + err = liveupdate_unregister_flb(fh, flb); + if (err) { + pr_err("Failed to unregister %s %pe\n", + flb->compatible, ERR_PTR(err)); + } + } + + pr_info("Unregistered %d FLBs from file handler: [%s]\n", + TEST_NFLBS, fh->compatible); +} + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Pasha Tatashin "); +MODULE_DESCRIPTION("In-kernel test for LUO mechanism"); -- cgit v1.2.3 From 9dc052234da736f7749f19ab6936342ec7dbe3ac Mon Sep 17 00:00:00 2001 From: Alan Maguire Date: Fri, 16 Jan 2026 09:17:30 +0000 Subject: kcsan, compiler_types: avoid duplicate type issues in BPF Type Format Enabling KCSAN is causing a large number of duplicate types in BTF for core kernel structs like task_struct [1]. This is due to the definition in include/linux/compiler_types.h `#ifdef __SANITIZE_THREAD__ ... `#define __data_racy volatile .. `#else ... `#define __data_racy ... `#endif Because some objects in the kernel are compiled without KCSAN flags (KCSAN_SANITIZE) we sometimes get the empty __data_racy annotation for objects; as a result we get multiple conflicting representations of the associated structs in DWARF, and these lead to multiple instances of core kernel types in BTF since they cannot be deduplicated due to the additional modifier in some instances. Moving the __data_racy definition under CONFIG_KCSAN avoids this problem, since the volatile modifier will be present for both KCSAN and KCSAN_SANITIZE objects in a CONFIG_KCSAN=y kernel. Link: https://lkml.kernel.org/r/20260116091730.324322-1-alan.maguire@oracle.com Fixes: 31f605a308e6 ("kcsan, compiler_types: Introduce __data_racy type qualifier") Signed-off-by: Alan Maguire Reported-by: Nilay Shroff Tested-by: Nilay Shroff Suggested-by: Marco Elver Reviewed-by: Marco Elver Acked-by: Yonghong Song Cc: Alexei Starovoitov Cc: Andrii Nakryiko Cc: Bart van Assche Cc: Daniel Borkman Cc: Eduard Zingerman Cc: Hao Luo Cc: Heiko Carstens Cc: "H. Peter Anvin" Cc: Jason A. Donenfeld Cc: Jiri Olsa Cc: John Fastabend Cc: Kees Cook Cc: KP Singh Cc: Martin KaFai Lau Cc: Miguel Ojeda Cc: Naman Jain Cc: Nathan Chancellor Cc: "Paul E . McKenney" Cc: Peter Zijlstra Cc: Stanislav Fomichev Cc: Uros Bizjak Cc: Signed-off-by: Andrew Morton --- include/linux/compiler_types.h | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index d3318a3c2577..86111a189a87 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -303,6 +303,22 @@ struct ftrace_likely_data { # define __no_kasan_or_inline __always_inline #endif +#ifdef CONFIG_KCSAN +/* + * Type qualifier to mark variables where all data-racy accesses should be + * ignored by KCSAN. Note, the implementation simply marks these variables as + * volatile, since KCSAN will treat such accesses as "marked". + * + * Defined here because defining __data_racy as volatile for KCSAN objects only + * causes problems in BPF Type Format (BTF) generation since struct members + * of core kernel data structs will be volatile in some objects and not in + * others. Instead define it globally for KCSAN kernels. + */ +# define __data_racy volatile +#else +# define __data_racy +#endif + #ifdef __SANITIZE_THREAD__ /* * Clang still emits instrumentation for __tsan_func_{entry,exit}() and builtin @@ -314,16 +330,9 @@ struct ftrace_likely_data { * disable all instrumentation. See Kconfig.kcsan where this is mandatory. */ # define __no_kcsan __no_sanitize_thread __disable_sanitizer_instrumentation -/* - * Type qualifier to mark variables where all data-racy accesses should be - * ignored by KCSAN. Note, the implementation simply marks these variables as - * volatile, since KCSAN will treat such accesses as "marked". - */ -# define __data_racy volatile # define __no_sanitize_or_inline __no_kcsan notrace __maybe_unused #else # define __no_kcsan -# define __data_racy #endif #ifdef __SANITIZE_MEMORY__ -- cgit v1.2.3