summaryrefslogtreecommitdiff
path: root/Documentation/scheduler
diff options
context:
space:
mode:
authorHuang Ying <ying.huang@intel.com>2022-02-10 08:25:14 +0300
committerPeter Zijlstra <peterz@infradead.org>2022-02-12 01:30:08 +0300
commit3624ba7b5e2acc02b01301ea5fd3534971eb9896 (patch)
tree1582433b0680361d724cfc81ba795fe2da31fd84 /Documentation/scheduler
parente496132ebedd870b67f1f6d2428f9bb9d7ae27fd (diff)
downloadlinux-3624ba7b5e2acc02b01301ea5fd3534971eb9896.tar.xz
sched/numa-balancing: Move some document to make it consistent with the code
After commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs"), some NUMA balancing sysctls enclosed with SCHED_DEBUG has been moved to debugfs. This patch move the document for these sysctls from Documentation/admin-guide/sysctl/kernel.rst to Documentation/scheduler/sched-debug.rst to make the document consistent with the code. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20220210052514.3038279-1-ying.huang@intel.com
Diffstat (limited to 'Documentation/scheduler')
-rw-r--r--Documentation/scheduler/index.rst1
-rw-r--r--Documentation/scheduler/sched-debug.rst54
2 files changed, 55 insertions, 0 deletions
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 88900aabdbf7..30cca8a37b3b 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -17,6 +17,7 @@ Linux Scheduler
sched-nice-design
sched-rt-group
sched-stats
+ sched-debug
text_files
diff --git a/Documentation/scheduler/sched-debug.rst b/Documentation/scheduler/sched-debug.rst
new file mode 100644
index 000000000000..4d3d24f2a439
--- /dev/null
+++ b/Documentation/scheduler/sched-debug.rst
@@ -0,0 +1,54 @@
+=================
+Scheduler debugfs
+=================
+
+Booting a kernel with CONFIG_SCHED_DEBUG=y will give access to
+scheduler specific debug files under /sys/kernel/debug/sched. Some of
+those files are described below.
+
+numa_balancing
+==============
+
+`numa_balancing` directory is used to hold files to control NUMA
+balancing feature. If the system overhead from the feature is too
+high then the rate the kernel samples for NUMA hinting faults may be
+controlled by the `scan_period_min_ms, scan_delay_ms,
+scan_period_max_ms, scan_size_mb` files.
+
+
+scan_period_min_ms, scan_delay_ms, scan_period_max_ms, scan_size_mb
+-------------------------------------------------------------------
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These files control the thresholds for scan delays and
+the number of pages scanned.
+
+``scan_period_min_ms`` is the minimum time in milliseconds to scan a
+tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
+
+``scan_delay_ms`` is the starting "scan delay" used for a task when it
+initially forks.
+
+``scan_period_max_ms`` is the maximum time in milliseconds to scan a
+tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
+
+``scan_size_mb`` is how many megabytes worth of pages are scanned for
+a given scan.