x86/mm: Enable broadcast TLB invalidation for multi-threaded processes - kernel/linux.git

diff options

author	Rik van Riel <riel@surriel.com>	2025-02-26 06:00:45 +0300
committer	Ingo Molnar <mingo@kernel.org>	2025-03-19 13:12:29 +0300
commit	4afeb0ed1753ebcad93ee3b45427ce85e9c8ec40 (patch)
tree	1d68278b950a636707c4bf0ca449f76795a37d9c /tools/perf/scripts/python/export-to-sqlite.py
parent	c9826613a9cbbf889b1fec6b7b943fe88ec2c212 (diff)
download	linux-4afeb0ed1753ebcad93ee3b45427ce85e9c8ec40.tar.xz

x86/mm: Enable broadcast TLB invalidation for multi-threaded processes

There is not enough room in the 12-bit ASID address space to hand out broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes when they are observed to be simultaneously running on 4 or more CPUs. This also allows single threaded process to continue using the cheaper, local TLB invalidation instructions like INVLPGB. Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in a generically named broadcast_tlb_flush() function which can later also be used for Intel RAR. Combined with the removal of unnecessary lru_add_drain calls() (see https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in a nice performance boost for the will-it-scale tlb_flush2_threads test on an AMD Milan system with 36 cores: - vanilla kernel: 527k loops/second - lru_add_drain removal: 731k loops/second - only INVLPGB: 527k loops/second - lru_add_drain + INVLPGB: 1157k loops/second Profiling with only the INVLPGB changes showed while TLB invalidation went down from 40% of the total CPU time to only around 4% of CPU time, the contention simply moved to the LRU lock. Fixing both at the same time about doubles the number of iterations per second from this case. Comparing will-it-scale tlb_flush2_threads with several different numbers of threads on a 72 CPU AMD Milan shows similar results. The number represents the total number of loops per second across all the threads: threads tip INVLPGB 1 315k 304k 2 423k 424k 4 644k 1032k 8 652k 1267k 16 737k 1368k 32 759k 1199k 64 636k 1094k 72 609k 993k 1 and 2 thread performance is similar with and without INVLPGB, because INVLPGB is only used on processes using 4 or more CPUs simultaneously. The number is the median across 5 runs. Some numbers closer to real world performance can be found at Phoronix, thanks to Michael: https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits [ bp: - Massage - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi - :%s/\<clear_asid_transition\>/mm_clear_asid_transition/cgi - Fold in a 0day bot fix: https://lore.kernel.org/oe-kbuild-all/202503040000.GtiWUsBm-lkp@intel.com ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/r/20250226030129.530345-11-riel@surriel.com

Diffstat (limited to 'tools/perf/scripts/python/export-to-sqlite.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: