linux

mirror of synced 2025-03-06 20:59:54 +01:00

Linux kernel source tree

Find a file

Mathieu Desnoyers 7e019dcc47 sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads commit `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay. These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive. However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data. Introduce the following changes to improve per-mm-cid cache locality: - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID, - Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus _not_ a viable approach for this), - Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access), - Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by: min(mm->nr_cpus_allowed, mm->mm_users) When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead. Spreading allocation to use all the cid values within the range [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ] improves cache locality while preserving mm_cid compactness within the expected user limits, - In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask, - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask. * Benchmarks Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core. Testing configurations: 8-core/1-L3: Use 8 cores within a single L3 24-core/24-L3: Use 24 cores, 1 core per L3 192-core/24-L3: Use 192 cores (all cores in the system) 384-thread/24-L3: Use 384 HW threads (all HW threads in the system) Intermittent workload delays between threads: 200ms, 10ms. Hardware: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Caches (sum of all): L1d: 6 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 192 MiB (192 instances) L3: 768 MiB (24 instances) Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid). Intermittent workload delay: 200ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 1374 19289 1336 14.4x 24-core/24-L3 2423 26721 1594 16.7x 192-core/24-L3 2291 15826 2153 7.3x 384-thread/24-L3 1874 13234 1907 6.9x Intermittent workload delay: 10ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 662 756 686 1.1x 24-core/24-L3 1378 3648 1035 3.5x 192-core/24-L3 1439 10833 1482 7.3x 384-thread/24-L3 1503 10570 1556 6.8x [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs" patch series with a simpler and more general approach. ] [ This patch applies on top of v6.12-rc1. ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/		2024-10-14 12:52:40 +02:00
arch	ARM64:	2024-10-06 10:53:28 -07:00
block	block-6.12-20241004	2024-10-04 10:43:44 -07:00
certs	sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3	2024-09-20 19:52:48 +03:00
crypto	move asm/unaligned.h to linux/unaligned.h	2024-10-02 17:23:23 -04:00
Documentation	platform-drivers-x86 for v6.12-2	2024-10-06 11:11:01 -07:00
drivers	platform-drivers-x86 for v6.12-2	2024-10-06 11:11:01 -07:00
fs	sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads	2024-10-14 12:52:40 +02:00
include	sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads	2024-10-14 12:52:40 +02:00
init	cfi: encode cfi normalized integers + kasan/gcov bug in Kconfig	2024-09-26 21:27:27 +02:00
io_uring	io_uring/net: harden multishot termination case for recv	2024-09-30 08:26:59 -06:00
ipc	struct fd layout change (and conversion to accessor helpers)	2024-09-23 09:35:36 -07:00
kernel	sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads	2024-10-14 12:52:40 +02:00
lib	slab fixes for 6.12-rc1	2024-10-04 12:05:39 -07:00
LICENSES	LICENSES: add 0BSD license text	2024-09-01 20:43:24 -07:00
mm	mm, slab: suppress warnings in test_leak_destroy kunit test	2024-10-02 16:28:46 +02:00
net	Including fixes from ieee802154, bluetooth and netfilter.	2024-10-03 09:44:00 -07:00
rust	rust: kunit: use C-string literals to clean warning	2024-10-01 23:46:42 +02:00
samples	[tree-wide] finally take no_llseek out	2024-09-27 08:18:43 -07:00
scripts	kbuild: deb-pkg: Remove blank first line from maint scripts	2024-10-07 02:36:38 +09:00
security	hardening fixes for v6.12-rc2	2024-10-05 10:19:14 -07:00
sound	sound fixes for 6.12-rc2	2024-10-04 11:29:46 -07:00
tools	ARM64:	2024-10-06 10:53:28 -07:00
usr	initramfs: shorten cmd_initfs in usr/Makefile	2024-07-16 01:07:52 +09:00
virt	sched/fair: Fix external p->on_rq users	2024-10-14 09:14:35 +02:00
.clang-format	clang-format: Update with v6.11-rc1's `for_each` macro list	2024-08-02 13:20:31 +02:00
.cocciconfig	scripts: add Linux .cocciconfig for coccinelle	2016-07-22 12:13:39 +02:00
.editorconfig	.editorconfig: remove trim_trailing_whitespace option	2024-06-13 16:47:52 +02:00
.get_maintainer.ignore	Add Jeff Kirsher to .get_maintainer.ignore	2024-03-08 11:36:54 +00:00
.gitattributes	.gitattributes: set diff driver for Rust source code files	2023-05-31 17:48:25 +02:00
.gitignore	Kbuild updates for v6.12	2024-09-24 13:02:06 -07:00
.mailmap	Summary	2024-09-24 11:08:40 -07:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: Mark powerpc spufs as orphaned	2024-08-19 21:27:56 +10:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	platform-drivers-x86 for v6.12-2	2024-10-06 11:11:01 -07:00
Makefile	Linux 6.12-rc2	2024-10-06 15:32:27 -07:00
README	README: Fix spelling	2024-03-18 03:36:32 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.