diff options
Diffstat (limited to 'Documentation/admin-guide/sysctl')
-rw-r--r-- | Documentation/admin-guide/sysctl/fs.rst | 37 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/kernel.rst | 87 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/vm.rst | 55 |
3 files changed, 141 insertions, 38 deletions
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst index 30c61474dec5..6c54718c9d04 100644 --- a/Documentation/admin-guide/sysctl/fs.rst +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -41,7 +41,7 @@ pre-allocation or re-sizing of any kernel data structures. dentry-negative ---------------------------- -Policy for negative dentries. Set to 1 to to always delete the dentry when a +Policy for negative dentries. Set to 1 to always delete the dentry when a file is removed, and 0 to disable it. By default, this behavior is disabled. dentry-state @@ -337,3 +337,38 @@ Each "watch" costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit one. The current default value for ``max_user_watches`` is 4% of the available low memory, divided by the "watch" cost in bytes. + +5. /proc/sys/fs/fuse - Configuration options for FUSE filesystems +===================================================================== + +This directory contains the following configuration options for FUSE +filesystems: + +``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for +setting/getting the maximum number of pages that can be used for servicing +requests in FUSE. + +``/proc/sys/fs/fuse/default_request_timeout`` is a read/write file for +setting/getting the default timeout (in seconds) for a fuse server to +reply to a kernel-issued request in the event where the server did not +specify a timeout at mount. If the server set a timeout, +then default_request_timeout will be ignored. The default +"default_request_timeout" is set to 0. 0 indicates no default timeout. +The maximum value that can be set is 65535. + +``/proc/sys/fs/fuse/max_request_timeout`` is a read/write file for +setting/getting the maximum timeout (in seconds) for a fuse server to +reply to a kernel-issued request. A value greater than 0 automatically opts +the server into a timeout that will be set to at most "max_request_timeout", +even if the server did not specify a timeout and default_request_timeout is +set to 0. If max_request_timeout is greater than 0 and the server set a timeout +greater than max_request_timeout or default_request_timeout is set to a value +greater than max_request_timeout, the system will use max_request_timeout as the +timeout. 0 indicates no max request timeout. The maximum value that can be set +is 65535. + +For timeouts, if the server does not respond to the request by the time +the set timeout elapses, then the connection to the fuse server will be aborted. +Please note that the timeouts are not 100% precise (eg you may set 60 seconds but +the timeout may kick in after 70 seconds). The upper margin of error for the +timeout is roughly FUSE_TIMEOUT_TIMER_FREQ seconds. diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index f8bc1630eba0..8b49eab937d0 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -177,6 +177,7 @@ core_pattern %E executable path %c maximum size of core file by resource limit RLIMIT_CORE %C CPU the task ran on + %F pidfd number %<OTHER> both are dropped ======== ========================================== @@ -212,6 +213,17 @@ pid>/``). This value defaults to 0. +core_sort_vma +============= + +The default coredump writes VMAs in address order. By setting +``core_sort_vma`` to 1, VMAs will be written from smallest size +to largest size. This is known to break at least elfutils, but +can be handy when dealing with very large (and truncated) +coredumps where the more useful debugging details are included +in the smaller VMAs. + + core_uses_pid ============= @@ -401,6 +413,15 @@ The upper bound on the number of tasks that are checked. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. +hung_task_detect_count +====================== + +Indicates the total number of tasks that have been detected as hung since +the system boot. + +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + + hung_task_timeout_secs ====================== @@ -869,7 +890,7 @@ bit 1 print system memory info bit 2 print timer info bit 3 print locks info if ``CONFIG_LOCKDEP`` is on bit 4 print ftrace buffer -bit 5 print all printk messages in buffer +bit 5 replay all messages on consoles at the end of panic bit 6 print all CPUs backtrace (if available in the arch) bit 7 print only tasks in uninterruptible (blocked) state ===== ============================================ @@ -879,6 +900,24 @@ So for example to print tasks and memory info on panic, user can:: echo 3 > /proc/sys/kernel/panic_print +panic_sys_info +============== + +A comma separated list of extra information to be dumped on panic, +for example, "tasks,mem,timers,...". It is a human readable alternative +to 'panic_print'. Possible values are: + +============= =================================================== +tasks print all tasks info +mem print system memory info +timer print timers info +lock print locks info if CONFIG_LOCKDEP is on +ftrace print ftrace buffer +all_bt print all CPUs backtrace (if available in the arch) +blocked_tasks print only tasks in uninterruptible (blocked) state +============= =================================================== + + panic_on_rcu_stall ================== @@ -994,30 +1033,26 @@ perf_user_access (arm64 and riscv only) Controls user space access for reading perf event counters. -arm64 -===== +* for arm64 + The default value is 0 (access disabled). -The default value is 0 (access disabled). + When set to 1, user space can read performance monitor counter registers + directly. -When set to 1, user space can read performance monitor counter registers -directly. + See Documentation/arch/arm64/perf.rst for more information. -See Documentation/arch/arm64/perf.rst for more information. +* for riscv + When set to 0, user space access is disabled. -riscv -===== + The default value is 1, user space can read performance monitor counter + registers through perf, any direct access without perf intervention will trigger + an illegal instruction. -When set to 0, user space access is disabled. + When set to 2, which enables legacy mode (user space has direct access to cycle + and insret CSRs only). Note that this legacy value is deprecated and will be + removed once all user space applications are fixed. -The default value is 1, user space can read performance monitor counter -registers through perf, any direct access without perf intervention will trigger -an illegal instruction. - -When set to 2, which enables legacy mode (user space has direct access to cycle -and insret CSRs only). Note that this legacy value is deprecated and will be -removed once all user space applications are fixed. - -Note that the time CSR is always directly accessible to all modes. + Note that the time CSR is always directly accessible to all modes. pid_max ======= @@ -1090,7 +1125,8 @@ printk_ratelimit_burst While long term we enforce one message per `printk_ratelimit`_ seconds, we do allow a burst of messages to pass through. ``printk_ratelimit_burst`` specifies the number of messages we can -send before ratelimiting kicks in. +send before ratelimiting kicks in. After `printk_ratelimit`_ seconds +have elapsed, another burst of messages may be sent. The default value is 10 messages. @@ -1445,7 +1481,7 @@ stack_erasing ============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. +of syscalls for kernels built with ``CONFIG_KSTACK_ERASE``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. @@ -1453,7 +1489,7 @@ The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. = ==================================================================== -0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +0 Kernel stack erasing is disabled, KSTACK_ERASE_METRICS are not updated. 1 Kernel stack erasing is enabled (default), it is performed before returning to the userspace at the end of syscalls. = ==================================================================== @@ -1535,6 +1571,13 @@ constant ``FUTEX_TID_MASK`` (0x3fffffff). If a value outside of this range is written to ``threads-max`` an ``EINVAL`` error occurs. +timer_migration +=============== + +When set to a non-zero value, attempt to migrate timers away from idle cpus to +allow them to remain in low power states longer. + +Default is set (1). traceoff_on_warning =================== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..4d71211fdad8 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - compact_memory - compaction_proactiveness - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -74,6 +75,7 @@ Currently, these files are in /proc/sys/vm: - unprivileged_userfaultfd - user_reserve_kbytes - vfs_cache_pressure +- vfs_cache_pressure_denom - watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -130,6 +132,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. @@ -145,6 +153,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. +defrag_mode +=========== + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. dirty_background_bytes ====================== @@ -449,8 +465,8 @@ The minimum value is 1 (1/1 -> 100%). The value less than 1 completely disables protection of the pages. -max_map_count: -============== +max_map_count +============= This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling @@ -479,8 +495,8 @@ memory allocations. The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. -memory_failure_early_kill: -========================== +memory_failure_early_kill +========================= Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware @@ -1008,19 +1024,28 @@ vfs_cache_pressure This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel +will attempt to reclaim dentries and inodes at a "fair" rate with respect to +pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, +the kernel will never reclaim dentries and inodes due to memory pressure and +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries +and inodes. + +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may +have negative performance impact. Reclaim code needs to take various locks to +find freeable directory and inode objects. When vfs_cache_pressure equals +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable +objects than there are. -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. +Note: This setting should always be used together with vfs_cache_pressure_denom. + +vfs_cache_pressure_denom +======================== +Defaults to 100 (minimum allowed value). Requires corresponding +vfs_cache_pressure setting to take effect. watermark_boost_factor ====================== |