kernel/linux.git/Documentation/features/vm/TLB, branch v6.18.21

riscv: Add support for BATCHED_UNMAP_TLB_FLUSH

2024-01-11T16:01:53+00:00

Allow to defer the flushing of the TLB when unmapping pages, which allows to reduce the numbers of IPI and the number of sfence.vma. The ubenchmarch used in commit 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration") that was multithreaded to force the usage of IPI shows good performance improvement on all platforms: * Unmatched: ~34% * TH1520 : ~78% * Qemu : ~81% In addition, perf on qemu reports an important decrease in time spent dealing with IPIs: Before: 68.17% main [kernel.kallsyms] [k] __sbi_rfence_v02_call After : 8.64% main [kernel.kallsyms] [k] __sbi_rfence_v02_call * Benchmark: int stick_this_thread_to_core(int core_id) { int num_cores = sysconf(_SC_NPROCESSORS_ONLN); if (core_id < 0 || core_id >= num_cores) return EINVAL; cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core_id, &cpuset); pthread_t current_thread = pthread_self(); return pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); } static void *fn_thread (void *p_data) { int ret; pthread_t thread; stick_this_thread_to_core((int)p_data); while (1) { sleep(1); } return NULL; } int main() { volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); pthread_t threads[4]; int ret; for (int i = 0; i < 4; ++i) { ret = pthread_create(&threads[i], NULL, fn_thread, (void *)i); if (ret) { printf("%s", strerror (ret)); } } memset(p, 0x88, SIZE); for (int k = 0; k < 10000; k++) { /* swap in */ for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } /* swap out */ madvise(p, SIZE, MADV_PAGEOUT); } for (int i = 0; i < 4; i++) { pthread_cancel(threads[i]); } for (int i = 0; i < 4; i++) { pthread_join(threads[i], NULL); } return 0; } Signed-off-by: Alexandre Ghiti Reviewed-by: Jisheng Zhang Tested-by: Jisheng Zhang # Tested on TH1520 Tested-by: Nam Cao Link: https://lore.kernel.org/r/20240108193640.344929-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt

Documentation: Drop IA64 from feature descriptions

2023-09-11T08:13:18+00:00

Itanium (IA64) is going away, so drop it from the kernel feature documentation. Signed-off-by: Ard Biesheuvel

arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-08-18T17:12:37+00:00

On x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include #include #include #include int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 10000; k++) { /* swap in */ for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } /* swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ........ ....... ................. ...... # 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out [kernel.kallsyms] [k] memset64 1.63% a.out [kernel.kallsyms] [k] clear_page 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out [kernel.kallsyms] [k] xas_load 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also tested with benchmark in the commit on Kunpeng920 arm64 server and observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real 0m13.460s 0m11.771s user 0m0.248s 0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush It is tested on 4,8,128 CPU platforms and shows to be beneficial on large systems but may not have improvement on small systems like on a 4 CPU platform. Also this patch improve the performance of page migration. Using pmbench and tries to migrate the pages of pmbench between node 0 and node 1 for 100 times for 1G memory, this patch decrease the time used around 20% (prev 18.338318910 sec after 13.981866350 sec) and saved the time used by ptep_clear_flush(). Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com Tested-by: Yicong Yang Tested-by: Xin Hao Tested-by: Punit Agrawal Signed-off-by: Barry Song Signed-off-by: Yicong Yang Reviewed-by: Kefeng Wang Reviewed-by: Xin Hao Reviewed-by: Anshuman Khandual Reviewed-by: Catalin Marinas Cc: Anshuman Khandual Cc: Jonathan Corbet Cc: Nadav Amit Cc: Mel Gorman Cc: Anshuman Khandual Cc: Arnd Bergmann Cc: Barry Song Cc: Darren Hart Cc: Jonathan Cameron Cc: lipeifeng Cc: Mark Rutland Cc: Peter Zijlstra Cc: Ryan Roberts Cc: Steven Miao Cc: Will Deacon Cc: Zeng Tao Signed-off-by: Andrew Morton

Documentation/features: Use loongarch instead of loong

2022-12-05T09:50:12+00:00

The official arch name is LoongArch [1], we should use small letter loongarch instead of loong in Documentation/features, just use the features-refresh.sh to refresh all the related files. [1] https://www.kernel.org/doc/html/latest/loongarch/index.html Fixes: 5860800e8696 ("Documentation/features: Update the arch support status files") Signed-off-by: Tiezhu Yang Link: https://lore.kernel.org/r/1670156327-9631-3-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Jonathan Corbet

Documentation/features: Update the arch support status files

2022-06-09T15:35:57+00:00

The arch support status files don't match reality as of v5.19-rc1, use the features-refresh.sh to refresh all the arch-support.txt files in place. The main effect is to add entries for the new loong architecture. Signed-off-by: Zheng Zengkai Link: https://lore.kernel.org/r/20220609025656.143460-1-zhengzengkai@huawei.com Signed-off-by: Jonathan Corbet

Merge branch 'remove-h8300' of git://git.infradead.org/users/hch/misc into asm-generic

2022-04-04T12:42:49+00:00

* 'remove-h8300' of git://git.infradead.org/users/hch/misc: remove the h8300 architecture This is clearly the least actively maintained architecture we have at the moment, and probably the least useful. It is now the only one that does not support MMUs at all, and most of the boards only support 4MB of RAM, out of which the defconfig kernel needs more than half just for .text/.data. Guenter Roeck did the original patch to remove the architecture in 2013 after it had already been obsolete for a while, and Yoshinori Sato brought it back in a much more modern form in 2015. Looking at the git history since the reinstantiation, it's clear that almost all commits in the tree are build fixes or cross-architecture cleanups: $ git log --no-merges --format=%an v4.5.. arch/h8300/ | sort | uniq -c | sort -rn | head -n 12 25 Masahiro Yamada 18 Christoph Hellwig 14 Mike Rapoport 9 Arnd Bergmann 8 Mark Rutland 7 Peter Zijlstra 6 Kees Cook 6 Ingo Molnar 6 Al Viro 5 Randy Dunlap 4 Yury Norov Signed-off-by: Arnd Bergmann

nds32: Remove the architecture

2022-03-07T12:54:59+00:00

The nds32 architecture, also known as AndeStar V3, is a custom 32-bit RISC target designed by Andes Technologies. Support was added to the kernel in 2016 as the replacement RISC-V based V5 processors were already announced, and maintained by (current or former) Andes employees. As explained by Alan Kao, new customers are now all using RISC-V, and all known nds32 users are already on longterm stable kernels provided by Andes, with no development work going into mainline support any more. While the port is still in a reasonably good shape, it only gets worse over time without active maintainers, so it seems best to remove it before it becomes unusable. As always, if it turns out that there are mainline users after all, and they volunteer to maintain the port in the future, the removal can be reverted. Link: https://lore.kernel.org/linux-mm/YhdWNLUhk+x9RAzU@yamatobi.andestech.com/ Link: https://lore.kernel.org/lkml/20220302065213.82702-1-alankao@andestech.com/ Link: https://www.andestech.com/en/products-solutions/andestar-architecture/ Signed-off-by: Alan Kao [arnd: rewrite changelog to provide more background] Signed-off-by: Arnd Bergmann

remove the h8300 architecture

2022-02-23T07:52:50+00:00

Signed-off-by: Christoph Hellwig

Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64

2021-03-15T19:17:40+00:00

BATCHED_UNMAP_TLB_FLUSH is used on x86 to do batched tlb shootdown by sending one IPI to TLB flush all entries after unmapping pages rather than sending an IPI to flush each individual entry. On arm64, tlb shootdown is done by hardware. Flush instructions are innershareable. The local flushes are limited to the boot (1 per CPU) and when a task is getting a new ASID. So marking this feature as "TODO" is not proper. ".." isn't good as well. So this patch adds a "N/A" for this kind of features which are not needed on some architectures. Signed-off-by: Barry Song Acked-by: Will Deacon Cc: Mel Gorman Cc: Andy Lutomirski Cc: Catalin Marinas Cc: Will Deacon Link: https://lore.kernel.org/r/20210223003230.11976-1-song.bao.hua@hisilicon.com Signed-off-by: Jonathan Corbet

Documentation: features: remove c6x references

2021-02-25T18:25:57+00:00

The references to arch/c6x are obsolete now that the architecture is gone. Remove them. Signed-off-by: Arnd Bergmann Link: https://lore.kernel.org/r/20210225142841.3385428-1-arnd@kernel.org Signed-off-by: Jonathan Corbet