summaryrefslogtreecommitdiff
path: root/Documentation/sysctl
diff options
context:
space:
mode:
authorMauro Carvalho Chehab <mchehab+samsung@kernel.org>2019-04-22 22:48:00 +0300
committerMauro Carvalho Chehab <mchehab+samsung@kernel.org>2019-07-15 17:03:01 +0300
commit570432470275c3da15b85362bc1461945b9c1919 (patch)
treeaa20d1689748f3c044b260d52ade1b801c8a5cc2 /Documentation/sysctl
parentec4b78a0e7dd4751423089b7cfd32168f9052377 (diff)
downloadlinux-570432470275c3da15b85362bc1461945b9c1919.tar.xz
docs: admin-guide: move sysctl directory to it
The stuff under sysctl describes /sys interface from userspace point of view. So, add it to the admin-guide and remove the :orphan: from its index file. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Diffstat (limited to 'Documentation/sysctl')
-rw-r--r--Documentation/sysctl/abi.rst67
-rw-r--r--Documentation/sysctl/fs.rst384
-rw-r--r--Documentation/sysctl/index.rst100
-rw-r--r--Documentation/sysctl/kernel.rst1177
-rw-r--r--Documentation/sysctl/net.rst461
-rw-r--r--Documentation/sysctl/sunrpc.rst25
-rw-r--r--Documentation/sysctl/user.rst78
-rw-r--r--Documentation/sysctl/vm.rst964
8 files changed, 0 insertions, 3256 deletions
diff --git a/Documentation/sysctl/abi.rst b/Documentation/sysctl/abi.rst
deleted file mode 100644
index 599bcde7f0b7..000000000000
--- a/Documentation/sysctl/abi.rst
+++ /dev/null
@@ -1,67 +0,0 @@
-================================
-Documentation for /proc/sys/abi/
-================================
-
-kernel version 2.6.0.test2
-
-Copyright (c) 2003, Fabian Frederick <ffrederick@users.sourceforge.net>
-
-For general info: index.rst.
-
-------------------------------------------------------------------------------
-
-This path is binary emulation relevant aka personality types aka abi.
-When a process is executed, it's linked to an exec_domain whose
-personality is defined using values available from /proc/sys/abi.
-You can find further details about abi in include/linux/personality.h.
-
-Here are the files featuring in 2.6 kernel:
-
-- defhandler_coff
-- defhandler_elf
-- defhandler_lcall7
-- defhandler_libcso
-- fake_utsname
-- trace
-
-defhandler_coff
----------------
-
-defined value:
- PER_SCOSVR3::
-
- 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE
-
-defhandler_elf
---------------
-
-defined value:
- PER_LINUX::
-
- 0
-
-defhandler_lcall7
------------------
-
-defined value :
- PER_SVR4::
-
- 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
-
-defhandler_libsco
------------------
-
-defined value:
- PER_SVR4::
-
- 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
-
-fake_utsname
-------------
-
-Unused
-
-trace
------
-
-Unused
diff --git a/Documentation/sysctl/fs.rst b/Documentation/sysctl/fs.rst
deleted file mode 100644
index 2a45119e3331..000000000000
--- a/Documentation/sysctl/fs.rst
+++ /dev/null
@@ -1,384 +0,0 @@
-===============================
-Documentation for /proc/sys/fs/
-===============================
-
-kernel version 2.2.10
-
-Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
-
-Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
-
-For general info and legal blurb, please look in intro.rst.
-
-------------------------------------------------------------------------------
-
-This file contains documentation for the sysctl files in
-/proc/sys/fs/ and is valid for Linux kernel version 2.2.
-
-The files in this directory can be used to tune and monitor
-miscellaneous and general things in the operation of the Linux
-kernel. Since some of the files _can_ be used to screw up your
-system, it is advisable to read both documentation and source
-before actually making adjustments.
-
-1. /proc/sys/fs
-===============
-
-Currently, these files are in /proc/sys/fs:
-
-- aio-max-nr
-- aio-nr
-- dentry-state
-- dquot-max
-- dquot-nr
-- file-max
-- file-nr
-- inode-max
-- inode-nr
-- inode-state
-- nr_open
-- overflowuid
-- overflowgid
-- pipe-user-pages-hard
-- pipe-user-pages-soft
-- protected_fifos
-- protected_hardlinks
-- protected_regular
-- protected_symlinks
-- suid_dumpable
-- super-max
-- super-nr
-
-
-aio-nr & aio-max-nr
--------------------
-
-aio-nr is the running total of the number of events specified on the
-io_setup system call for all currently active aio contexts. If aio-nr
-reaches aio-max-nr then io_setup will fail with EAGAIN. Note that
-raising aio-max-nr does not result in the pre-allocation or re-sizing
-of any kernel data structures.
-
-
-dentry-state
-------------
-
-From linux/include/linux/dcache.h::
-
- struct dentry_stat_t dentry_stat {
- int nr_dentry;
- int nr_unused;
- int age_limit; /* age in seconds */
- int want_pages; /* pages requested by system */
- int nr_negative; /* # of unused negative dentries */
- int dummy; /* Reserved for future use */
- };
-
-Dentries are dynamically allocated and deallocated.
-
-nr_dentry shows the total number of dentries allocated (active
-+ unused). nr_unused shows the number of dentries that are not
-actively used, but are saved in the LRU list for future reuse.
-
-Age_limit is the age in seconds after which dcache entries
-can be reclaimed when memory is short and want_pages is
-nonzero when shrink_dcache_pages() has been called and the
-dcache isn't pruned yet.
-
-nr_negative shows the number of unused dentries that are also
-negative dentries which do not map to any files. Instead,
-they help speeding up rejection of non-existing files provided
-by the users.
-
-
-dquot-max & dquot-nr
---------------------
-
-The file dquot-max shows the maximum number of cached disk
-quota entries.
-
-The file dquot-nr shows the number of allocated disk quota
-entries and the number of free disk quota entries.
-
-If the number of free cached disk quotas is very low and
-you have some awesome number of simultaneous system users,
-you might want to raise the limit.
-
-
-file-max & file-nr
-------------------
-
-The value in file-max denotes the maximum number of file-
-handles that the Linux kernel will allocate. When you get lots
-of error messages about running out of file handles, you might
-want to increase this limit.
-
-Historically,the kernel was able to allocate file handles
-dynamically, but not to free them again. The three values in
-file-nr denote the number of allocated file handles, the number
-of allocated but unused file handles, and the maximum number of
-file handles. Linux 2.6 always reports 0 as the number of free
-file handles -- this is not an error, it just means that the
-number of allocated file handles exactly matches the number of
-used file handles.
-
-Attempts to allocate more file descriptors than file-max are
-reported with printk, look for "VFS: file-max limit <number>
-reached".
-
-
-nr_open
--------
-
-This denotes the maximum number of file-handles a process can
-allocate. Default value is 1024*1024 (1048576) which should be
-enough for most machines. Actual limit depends on RLIMIT_NOFILE
-resource limit.
-
-
-inode-max, inode-nr & inode-state
----------------------------------
-
-As with file handles, the kernel allocates the inode structures
-dynamically, but can't free them yet.
-
-The value in inode-max denotes the maximum number of inode
-handlers. This value should be 3-4 times larger than the value
-in file-max, since stdin, stdout and network sockets also
-need an inode struct to handle them. When you regularly run
-out of inodes, you need to increase this value.
-
-The file inode-nr contains the first two items from
-inode-state, so we'll skip to that file...
-
-Inode-state contains three actual numbers and four dummies.
-The actual numbers are, in order of appearance, nr_inodes,
-nr_free_inodes and preshrink.
-
-Nr_inodes stands for the number of inodes the system has
-allocated, this can be slightly more than inode-max because
-Linux allocates them one pageful at a time.
-
-Nr_free_inodes represents the number of free inodes (?) and
-preshrink is nonzero when the nr_inodes > inode-max and the
-system needs to prune the inode list instead of allocating
-more.
-
-
-overflowgid & overflowuid
--------------------------
-
-Some filesystems only support 16-bit UIDs and GIDs, although in Linux
-UIDs and GIDs are 32 bits. When one of these filesystems is mounted
-with writes enabled, any UID or GID that would exceed 65535 is translated
-to a fixed value before being written to disk.
-
-These sysctls allow you to change the value of the fixed UID and GID.
-The default is 65534.
-
-
-pipe-user-pages-hard
---------------------
-
-Maximum total number of pages a non-privileged user may allocate for pipes.
-Once this limit is reached, no new pipes may be allocated until usage goes
-below the limit again. When set to 0, no limit is applied, which is the default
-setting.
-
-
-pipe-user-pages-soft
---------------------
-
-Maximum total number of pages a non-privileged user may allocate for pipes
-before the pipe size gets limited to a single page. Once this limit is reached,
-new pipes will be limited to a single page in size for this user in order to
-limit total memory usage, and trying to increase them using fcntl() will be
-denied until usage goes below the limit again. The default value allows to
-allocate up to 1024 pipes at their default size. When set to 0, no limit is
-applied.
-
-
-protected_fifos
----------------
-
-The intent of this protection is to avoid unintentional writes to
-an attacker-controlled FIFO, where a program expected to create a regular
-file.
-
-When set to "0", writing to FIFOs is unrestricted.
-
-When set to "1" don't allow O_CREAT open on FIFOs that we don't own
-in world writable sticky directories, unless they are owned by the
-owner of the directory.
-
-When set to "2" it also applies to group writable sticky directories.
-
-This protection is based on the restrictions in Openwall.
-
-
-protected_hardlinks
---------------------
-
-A long-standing class of security issues is the hardlink-based
-time-of-check-time-of-use race, most commonly seen in world-writable
-directories like /tmp. The common method of exploitation of this flaw
-is to cross privilege boundaries when following a given hardlink (i.e. a
-root process follows a hardlink created by another user). Additionally,
-on systems without separated partitions, this stops unauthorized users
-from "pinning" vulnerable setuid/setgid files against being upgraded by
-the administrator, or linking to special files.
-
-When set to "0", hardlink creation behavior is unrestricted.
-
-When set to "1" hardlinks cannot be created by users if they do not
-already own the source file, or do not have read/write access to it.
-
-This protection is based on the restrictions in Openwall and grsecurity.
-
-
-protected_regular
------------------
-
-This protection is similar to protected_fifos, but it
-avoids writes to an attacker-controlled regular file, where a program
-expected to create one.
-
-When set to "0", writing to regular files is unrestricted.
-
-When set to "1" don't allow O_CREAT open on regular files that we
-don't own in world writable sticky directories, unless they are
-owned by the owner of the directory.
-
-When set to "2" it also applies to group writable sticky directories.
-
-
-protected_symlinks
-------------------
-
-A long-standing class of security issues is the symlink-based
-time-of-check-time-of-use race, most commonly seen in world-writable
-directories like /tmp. The common method of exploitation of this flaw
-is to cross privilege boundaries when following a given symlink (i.e. a
-root process follows a symlink belonging to another user). For a likely
-incomplete list of hundreds of examples across the years, please see:
-http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
-
-When set to "0", symlink following behavior is unrestricted.
-
-When set to "1" symlinks are permitted to be followed only when outside
-a sticky world-writable directory, or when the uid of the symlink and
-follower match, or when the directory owner matches the symlink's owner.
-
-This protection is based on the restrictions in Openwall and grsecurity.
-
-
-suid_dumpable:
---------------
-
-This value can be used to query and set the core dump mode for setuid
-or otherwise protected/tainted binaries. The modes are
-
-= ========== ===============================================================
-0 (default) traditional behaviour. Any process which has changed
- privilege levels or is execute only will not be dumped.
-1 (debug) all processes dump core when possible. The core dump is
- owned by the current user and no security is applied. This is
- intended for system debugging situations only.
- Ptrace is unchecked.
- This is insecure as it allows regular users to examine the
- memory contents of privileged processes.
-2 (suidsafe) any binary which normally would not be dumped is dumped
- anyway, but only if the "core_pattern" kernel sysctl is set to
- either a pipe handler or a fully qualified path. (For more
- details on this limitation, see CVE-2006-2451.) This mode is
- appropriate when administrators are attempting to debug
- problems in a normal environment, and either have a core dump
- pipe handler that knows to treat privileged core dumps with
- care, or specific directory defined for catching core dumps.
- If a core dump happens without a pipe handler or fully
- qualified path, a message will be emitted to syslog warning
- about the lack of a correct setting.
-= ========== ===============================================================
-
-
-super-max & super-nr
---------------------
-
-These numbers control the maximum number of superblocks, and
-thus the maximum number of mounted filesystems the kernel
-can have. You only need to increase super-max if you need to
-mount more filesystems than the current value in super-max
-allows you to.
-
-
-aio-nr & aio-max-nr
--------------------
-
-aio-nr shows the current system-wide number of asynchronous io
-requests. aio-max-nr allows you to change the maximum value
-aio-nr can grow to.
-
-
-mount-max
----------
-
-This denotes the maximum number of mounts that may exist
-in a mount namespace.
-
-
-
-2. /proc/sys/fs/binfmt_misc
-===========================
-
-Documentation for the files in /proc/sys/fs/binfmt_misc is
-in Documentation/admin-guide/binfmt-misc.rst.
-
-
-3. /proc/sys/fs/mqueue - POSIX message queues filesystem
-========================================================
-
-
-The "mqueue" filesystem provides the necessary kernel features to enable the
-creation of a user space library that implements the POSIX message queues
-API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System
-Interfaces specification.)
-
-The "mqueue" filesystem contains values for determining/setting the amount of
-resources used by the file system.
-
-/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the
-maximum number of message queues allowed on the system.
-
-/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the
-maximum number of messages in a queue value. In fact it is the limiting value
-for another (user) limit which is set in mq_open invocation. This attribute of
-a queue must be less or equal then msg_max.
-
-/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the
-maximum message size value (it is every message queue's attribute set during
-its creation).
-
-/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the
-default number of messages in a queue value if attr parameter of mq_open(2) is
-NULL. If it exceed msg_max, the default value is initialized msg_max.
-
-/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting
-the default message size value if attr parameter of mq_open(2) is NULL. If it
-exceed msgsize_max, the default value is initialized msgsize_max.
-
-4. /proc/sys/fs/epoll - Configuration options for the epoll interface
-=====================================================================
-
-This directory contains configuration options for the epoll(7) interface.
-
-max_user_watches
-----------------
-
-Every epoll file descriptor can store a number of files to be monitored
-for event readiness. Each one of these monitored files constitutes a "watch".
-This configuration option sets the maximum number of "watches" that are
-allowed for each user.
-Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes
-on a 64bit one.
-The current default value for max_user_watches is the 1/32 of the available
-low memory, divided for the "watch" cost in bytes.
diff --git a/Documentation/sysctl/index.rst b/Documentation/sysctl/index.rst
deleted file mode 100644
index efbcde8c1c9c..000000000000
--- a/Documentation/sysctl/index.rst
+++ /dev/null
@@ -1,100 +0,0 @@
-:orphan:
-
-===========================
-Documentation for /proc/sys
-===========================
-
-Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
-
-------------------------------------------------------------------------------
-
-'Why', I hear you ask, 'would anyone even _want_ documentation
-for them sysctl files? If anybody really needs it, it's all in
-the source...'
-
-Well, this documentation is written because some people either
-don't know they need to tweak something, or because they don't
-have the time or knowledge to read the source code.
-
-Furthermore, the programmers who built sysctl have built it to
-be actually used, not just for the fun of programming it :-)
-
-------------------------------------------------------------------------------
-
-Legal blurb:
-
-As usual, there are two main things to consider:
-
-1. you get what you pay for
-2. it's free
-
-The consequences are that I won't guarantee the correctness of
-this document, and if you come to me complaining about how you
-screwed up your system because of wrong documentation, I won't
-feel sorry for you. I might even laugh at you...
-
-But of course, if you _do_ manage to screw up your system using
-only the sysctl options used in this file, I'd like to hear of
-it. Not only to have a great laugh, but also to make sure that
-you're the last RTFMing person to screw up.
-
-In short, e-mail your suggestions, corrections and / or horror
-stories to: <riel@nl.linux.org>
-
-Rik van Riel.
-
---------------------------------------------------------------
-
-Introduction
-============
-
-Sysctl is a means of configuring certain aspects of the kernel
-at run-time, and the /proc/sys/ directory is there so that you
-don't even need special tools to do it!
-In fact, there are only four things needed to use these config
-facilities:
-
-- a running Linux system
-- root access
-- common sense (this is especially hard to come by these days)
-- knowledge of what all those values mean
-
-As a quick 'ls /proc/sys' will show, the directory consists of
-several (arch-dependent?) subdirs. Each subdir is mainly about
-one part of the kernel, so you can do configuration on a piece
-by piece basis, or just some 'thematic frobbing'.
-
-This documentation is about:
-
-=============== ===============================================================
-abi/ execution domains & personalities
-debug/ <empty>
-dev/ device specific information (eg dev/cdrom/info)
-fs/ specific filesystems
- filehandle, inode, dentry and quota tuning
- binfmt_misc <Documentation/admin-guide/binfmt-misc.rst>
-kernel/ global kernel info / tuning
- miscellaneous stuff
-net/ networking stuff, for documentation look in:
- <Documentation/networking/>
-proc/ <empty>
-sunrpc/ SUN Remote Procedure Call (NFS)
-vm/ memory management tuning
- buffer and cache management
-user/ Per user per user namespace limits
-=============== ===============================================================
-
-These are the subdirs I have on my system. There might be more
-or other subdirs in another setup. If you see another dir, I'd
-really like to hear about it :-)
-
-.. toctree::
- :maxdepth: 1
-
- abi
- fs
- kernel
- net
- sunrpc
- user
- vm
diff --git a/Documentation/sysctl/kernel.rst b/Documentation/sysctl/kernel.rst
deleted file mode 100644
index a0c1d4ce403a..000000000000
--- a/Documentation/sysctl/kernel.rst
+++ /dev/null
@@ -1,1177 +0,0 @@
-===================================
-Documentation for /proc/sys/kernel/
-===================================
-
-kernel version 2.2.10
-
-Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
-
-Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
-
-For general info and legal blurb, please look in index.rst.
-
-------------------------------------------------------------------------------
-
-This file contains documentation for the sysctl files in
-/proc/sys/kernel/ and is valid for Linux kernel version 2.2.
-
-The files in this directory can be used to tune and monitor
-miscellaneous and general things in the operation of the Linux
-kernel. Since some of the files _can_ be used to screw up your
-system, it is advisable to read both documentation and source
-before actually making adjustments.
-
-Currently, these files might (depending on your configuration)
-show up in /proc/sys/kernel:
-
-- acct
-- acpi_video_flags
-- auto_msgmni
-- bootloader_type [ X86 only ]
-- bootloader_version [ X86 only ]
-- cap_last_cap
-- core_pattern
-- core_pipe_limit
-- core_uses_pid
-- ctrl-alt-del
-- dmesg_restrict
-- domainname
-- hostname
-- hotplug
-- hardlockup_all_cpu_backtrace
-- hardlockup_panic
-- hung_task_panic
-- hung_task_check_count
-- hung_task_timeout_secs
-- hung_task_check_interval_secs
-- hung_task_warnings
-- hyperv_record_panic_msg
-- kexec_load_disabled
-- kptr_restrict
-- l2cr [ PPC only ]
-- modprobe ==> Documentation/debugging-modules.txt
-- modules_disabled
-- msg_next_id [ sysv ipc ]
-- msgmax
-- msgmnb
-- msgmni
-- nmi_watchdog
-- osrelease
-- ostype
-- overflowgid
-- overflowuid
-- panic
-- panic_on_oops
-- panic_on_stackoverflow
-- panic_on_unrecovered_nmi
-- panic_on_warn
-- panic_print
-- panic_on_rcu_stall
-- perf_cpu_time_max_percent
-- perf_event_paranoid
-- perf_event_max_stack
-- perf_event_mlock_kb
-- perf_event_max_contexts_per_stack
-- pid_max
-- powersave-nap [ PPC only ]
-- printk
-- printk_delay
-- printk_ratelimit
-- printk_ratelimit_burst
-- pty ==> Documentation/filesystems/devpts.txt
-- randomize_va_space
-- real-root-dev ==> Documentation/admin-guide/initrd.rst
-- reboot-cmd [ SPARC only ]
-- rtsig-max
-- rtsig-nr
-- sched_energy_aware
-- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst
-- sem
-- sem_next_id [ sysv ipc ]
-- sg-big-buff [ generic SCSI device (sg) ]
-- shm_next_id [ sysv ipc ]
-- shm_rmid_forced
-- shmall
-- shmmax [ sysv ipc ]
-- shmmni
-- softlockup_all_cpu_backtrace
-- soft_watchdog
-- stack_erasing
-- stop-a [ SPARC only ]
-- sysrq ==> Documentation/admin-guide/sysrq.rst
-- sysctl_writes_strict
-- tainted ==> Documentation/admin-guide/tainted-kernels.rst
-- threads-max
-- unknown_nmi_panic
-- watchdog
-- watchdog_thresh
-- version
-
-
-acct:
-=====
-
-highwater lowwater frequency
-
-If BSD-style process accounting is enabled these values control
-its behaviour. If free space on filesystem where the log lives
-goes below <lowwater>% accounting suspends. If free space gets
-above <highwater>% accounting resumes. <Frequency> determines
-how often do we check the amount of free space (value is in
-seconds). Default:
-4 2 30
-That is, suspend accounting if there left <= 2% free; resume it
-if we got >=4%; consider information about amount of free space
-valid for 30 seconds.
-
-
-acpi_video_flags:
-=================
-
-flags
-
-See Doc*/kernel/power/video.txt, it allows mode of video boot to be
-set during run time.
-
-
-auto_msgmni:
-============
-
-This variable has no effect and may be removed in future kernel
-releases. Reading it always returns 0.
-Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni
-upon memory add/remove or upon ipc namespace creation/removal.
-Echoing "1" into this file enabled msgmni automatic recomputing.
-Echoing "0" turned it off. auto_msgmni default value was 1.
-
-
-bootloader_type:
-================
-
-x86 bootloader identification
-
-This gives the bootloader type number as indicated by the bootloader,
-shifted left by 4, and OR'd with the low four bits of the bootloader
-version. The reason for this encoding is that this used to match the
-type_of_loader field in the kernel header; the encoding is kept for
-backwards compatibility. That is, if the full bootloader type number
-is 0x15 and the full version number is 0x234, this file will contain
-the value 340 = 0x154.
-
-See the type_of_loader and ext_loader_type fields in
-Documentation/x86/boot.rst for additional information.
-
-
-bootloader_version:
-===================
-
-x86 bootloader version
-
-The complete bootloader version number. In the example above, this
-file will contain the value 564 = 0x234.
-
-See the type_of_loader and ext_loader_ver fields in
-Documentation/x86/boot.rst for additional information.
-
-
-cap_last_cap:
-=============
-
-Highest valid capability of the running kernel. Exports
-CAP_LAST_CAP from the kernel.
-
-
-core_pattern:
-=============
-
-core_pattern is used to specify a core dumpfile pattern name.
-
-* max length 127 characters; default value is "core"
-* core_pattern is used as a pattern template for the output filename;
- certain string patterns (beginning with '%') are substituted with
- their actual values.
-* backward compatibility with core_uses_pid:
-
- If core_pattern does not include "%p" (default does not)
- and core_uses_pid is set, then .PID will be appended to
- the filename.
-
-* corename format specifiers::
-
- %<NUL> '%' is dropped
- %% output one '%'
- %p pid
- %P global pid (init PID namespace)
- %i tid
- %I global tid (init PID namespace)
- %u uid (in initial user namespace)
- %g gid (in initial user namespace)
- %d dump mode, matches PR_SET_DUMPABLE and
- /proc/sys/fs/suid_dumpable
- %s signal number
- %t UNIX time of dump
- %h hostname
- %e executable filename (may be shortened)
- %E executable path
- %<OTHER> both are dropped
-
-* If the first character of the pattern is a '|', the kernel will treat
- the rest of the pattern as a command to run. The core dump will be
- written to the standard input of that program instead of to a file.
-
-
-core_pipe_limit:
-================
-
-This sysctl is only applicable when core_pattern is configured to pipe
-core files to a user space helper (when the first character of
-core_pattern is a '|', see above). When collecting cores via a pipe
-to an application, it is occasionally useful for the collecting
-application to gather data about the crashing process from its
-/proc/pid directory. In order to do this safely, the kernel must wait
-for the collecting process to exit, so as not to remove the crashing
-processes proc files prematurely. This in turn creates the
-possibility that a misbehaving userspace collecting process can block
-the reaping of a crashed process simply by never exiting. This sysctl
-defends against that. It defines how many concurrent crashing
-processes may be piped to user space applications in parallel. If
-this value is exceeded, then those crashing processes above that value
-are noted via the kernel log and their cores are skipped. 0 is a
-special value, indicating that unlimited processes may be captured in
-parallel, but that no waiting will take place (i.e. the collecting
-process is not guaranteed access to /proc/<crashing pid>/). This
-value defaults to 0.
-
-
-core_uses_pid:
-==============
-
-The default coredump filename is "core". By setting
-core_uses_pid to 1, the coredump filename becomes core.PID.
-If core_pattern does not include "%p" (default does not)
-and core_uses_pid is set, then .PID will be appended to
-the filename.
-
-
-ctrl-alt-del:
-=============
-
-When the value in this file is 0, ctrl-alt-del is trapped and
-sent to the init(1) program to handle a graceful restart.
-When, however, the value is > 0, Linux's reaction to a Vulcan
-Nerve Pinch (tm) will be an immediate reboot, without even
-syncing its dirty buffers.
-
-Note:
- when a program (like dosemu) has the keyboard in 'raw'
- mode, the ctrl-alt-del is intercepted by the program before it
- ever reaches the kernel tty layer, and it's up to the program
- to decide what to do with it.
-
-
-dmesg_restrict:
-===============
-
-This toggle indicates whether unprivileged users are prevented
-from using dmesg(8) to view messages from the kernel's log buffer.
-When dmesg_restrict is set to (0) there are no restrictions. When
-dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use
-dmesg(8).
-
-The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the
-default value of dmesg_restrict.
-
-
-domainname & hostname:
-======================
-
-These files can be used to set the NIS/YP domainname and the
-hostname of your box in exactly the same way as the commands
-domainname and hostname, i.e.::
-
- # echo "darkstar" > /proc/sys/kernel/hostname
- # echo "mydomain" > /proc/sys/kernel/domainname
-
-has the same effect as::
-
- # hostname "darkstar"
- # domainname "mydomain"
-
-Note, however, that the classic darkstar.frop.org has the
-hostname "darkstar" and DNS (Internet Domain Name Server)
-domainname "frop.org", not to be confused with the NIS (Network
-Information Service) or YP (Yellow Pages) domainname. These two
-domain names are in general different. For a detailed discussion
-see the hostname(1) man page.
-
-
-hardlockup_all_cpu_backtrace:
-=============================
-
-This value controls the hard lockup detector behavior when a hard
-lockup condition is detected as to whether or not to gather further
-debug information. If enabled, arch-specific all-CPU stack dumping
-will be initiated.
-
-0: do nothing. This is the default behavior.
-
-1: on detection capture more debug information.
-
-
-hardlockup_panic:
-=================
-
-This parameter can be used to control whether the kernel panics
-when a hard lockup is detected.
-
- 0 - don't panic on hard lockup
- 1 - panic on hard lockup
-
-See Documentation/lockup-watchdogs.txt for more information. This can
-also be set using the nmi_watchdog kernel parameter.
-
-
-hotplug:
-========
-
-Path for the hotplug policy agent.
-Default value is "/sbin/hotplug".
-
-
-hung_task_panic:
-================
-
-Controls the kernel's behavior when a hung task is detected.
-This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
-
-0: continue operation. This is the default behavior.
-
-1: panic immediately.
-
-
-hung_task_check_count:
-======================
-
-The upper bound on the number of tasks that are checked.
-This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
-
-
-hung_task_timeout_secs:
-=======================
-
-When a task in D state did not get scheduled
-for more than this value report a warning.
-This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
-
-0: means infinite timeout - no checking done.
-
-Possible values to set are in range {0..LONG_MAX/HZ}.
-
-
-hung_task_check_interval_secs:
-==============================
-
-Hung task check interval. If hung task checking is enabled
-(see hung_task_timeout_secs), the check is done every
-hung_task_check_interval_secs seconds.
-This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
-
-0 (default): means use hung_task_timeout_secs as checking interval.
-Possible values to set are in range {0..LONG_MAX/HZ}.
-
-
-hung_task_warnings:
-===================
-
-The maximum number of warnings to report. During a check interval
-if a hung task is detected, this value is decreased by 1.
-When this value reaches 0, no more warnings will be reported.
-This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
-
--1: report an infinite number of warnings.
-
-
-hyperv_record_panic_msg:
-========================
-
-Controls whether the panic kmsg data should be reported to Hyper-V.
-
-0: do not report panic kmsg data.
-
-1: report the panic kmsg data. This is the default behavior.
-
-
-kexec_load_disabled:
-====================
-
-A toggle indicating if the kexec_load syscall has been disabled. This
-value defaults to 0 (false: kexec_load enabled), but can be set to 1
-(true: kexec_load disabled). Once true, kexec can no longer be used, and
-the toggle cannot be set back to false. This allows a kexec image to be
-loaded before disabling the syscall, allowing a system to set up (and
-later use) an image without it being altered. Generally used together
-with the "modules_disabled" sysctl.
-
-
-kptr_restrict:
-==============
-
-This toggle indicates whether restrictions are placed on
-exposing kernel addresses via /proc and other interfaces.
-
-When kptr_restrict is set to 0 (the default) the address is hashed before
-printing. (This is the equivalent to %p.)
-
-When kptr_restrict is set to (1), kernel pointers printed using the %pK
-format specifier will be replaced with 0's unless the user has CAP_SYSLOG
-and effective user and group ids are equal to the real ids. This is
-because %pK checks are done at read() time rather than open() time, so
-if permissions are elevated between the open() and the read() (e.g via
-a setuid binary) then %pK will not leak kernel pointers to unprivileged
-users. Note, this is a temporary solution only. The correct long-term
-solution is to do the permission checks at open() time. Consider removing
-world read permissions from files that use %pK, and using dmesg_restrict
-to protect against uses of %pK in dmesg(8) if leaking kernel pointer
-values to unprivileged users is a concern.
-
-When kptr_restrict is set to (2), kernel pointers printed using
-%pK will be replaced with 0's regardless of privileges.
-
-
-l2cr: (PPC only)
-================
-
-This flag controls the L2 cache of G3 processor boards. If
-0, the cache is disabled. Enabled if nonzero.
-
-
-modules_disabled:
-=================
-
-A toggle value indicating if modules are allowed to be loaded
-in an otherwise modular kernel. This toggle defaults to off
-(0), but can be set true (1). Once true, modules can be
-neither loaded nor unloaded, and the toggle cannot be set back
-to false. Generally used with the "kexec_load_disabled" toggle.
-
-
-msg_next_id, sem_next_id, and shm_next_id:
-==========================================
-
-These three toggles allows to specify desired id for next allocated IPC
-object: message, semaphore or shared memory respectively.
-
-By default they are equal to -1, which means generic allocation logic.
-Possible values to set are in range {0..INT_MAX}.
-
-Notes:
- 1) kernel doesn't guarantee, that new object will have desired id. So,
- it's up to userspace, how to handle an object with "wrong" id.
- 2) Toggle with non-default value will be set back to -1 by kernel after
- successful IPC object allocation. If an IPC object allocation syscall
- fails, it is undefined if the value remains unmodified or is reset to -1.
-
-
-nmi_watchdog:
-=============
-
-This parameter can be used to control the NMI watchdog
-(i.e. the hard lockup detector) on x86 systems.
-
-0 - disable the hard lockup detector
-
-1 - enable the hard lockup detector
-
-The hard lockup detector monitors each CPU for its ability to respond to
-timer interrupts. The mechanism utilizes CPU performance counter registers
-that are programmed to generate Non-Maskable Interrupts (NMIs) periodically
-while a CPU is busy. Hence, the alternative name 'NMI watchdog'.
-
-The NMI watchdog is disabled by default if the kernel is running as a guest
-in a KVM virtual machine. This default can be overridden by adding::
-
- nmi_watchdog=1
-
-to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst).
-
-
-numa_balancing:
-===============
-
-Enables/disables automatic page fault based NUMA memory
-balancing. Memory is moved automatically to nodes
-that access it often.
-
-Enables/disables automatic NUMA memory balancing. On NUMA machines, there
-is a performance penalty if remote memory is accessed by a CPU. When this
-feature is enabled the kernel samples what task thread is accessing memory
-by periodically unmapping pages and later trapping a page fault. At the
-time of the page fault, it is determined if the data being accessed should
-be migrated to a local memory node.
-
-The unmapping of pages and trapping faults incur additional overhead that
-ideally is offset by improved memory locality but there is no universal
-guarantee. If the target workload is already bound to NUMA nodes then this
-feature should be disabled. Otherwise, if the system overhead from the
-feature is too high then the rate the kernel samples for NUMA hinting
-faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls.
-
-numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
-===============================================================================================================================
-
-
-Automatic NUMA balancing scans tasks address space and unmaps pages to
-detect if pages are properly placed or if the data should be migrated to a
-memory node local to where the task is running. Every "scan delay" the task
-scans the next "scan size" number of pages in its address space. When the
-end of the address space is reached the scanner restarts from the beginning.
-
-In combination, the "scan delay" and "scan size" determine the scan rate.
-When "scan delay" decreases, the scan rate increases. The scan delay and
-hence the scan rate of every task is adaptive and depends on historical
-behaviour. If pages are properly placed then the scan delay increases,
-otherwise the scan delay decreases. The "scan size" is not adaptive but
-the higher the "scan size", the higher the scan rate.
-
-Higher scan rates incur higher system overhead as page faults must be
-trapped and potentially data must be migrated. However, the higher the scan
-rate, the more quickly a tasks memory is migrated to a local node if the
-workload pattern changes and minimises performance impact due to remote
-memory accesses. These sysctls control the thresholds for scan delays and
-the number of pages scanned.
-
-numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
-scan a tasks virtual memory. It effectively controls the maximum scanning
-rate for each task.
-
-numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
-when it initially forks.
-
-numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
-scan a tasks virtual memory. It effectively controls the minimum scanning
-rate for each task.
-
-numa_balancing_scan_size_mb is how many megabytes worth of pages are
-scanned for a given scan.
-
-
-osrelease, ostype & version:
-============================
-
-::
-
- # cat osrelease
- 2.1.88
- # cat ostype
- Linux
- # cat version
- #5 Wed Feb 25 21:49:24 MET 1998
-
-The files osrelease and ostype should be clear enough. Version
-needs a little more clarification however. The '#5' means that
-this is the fifth kernel built from this source base and the
-date behind it indicates the time the kernel was built.
-The only way to tune these values is to rebuild the kernel :-)
-
-
-overflowgid & overflowuid:
-==========================
-
-if your architecture did not always support 32-bit UIDs (i.e. arm,
-i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to
-applications that use the old 16-bit UID/GID system calls, if the
-actual UID or GID would exceed 65535.
-
-These sysctls allow you to change the value of the fixed UID and GID.
-The default is 65534.
-
-
-panic:
-======
-
-The value in this file represents the number of seconds the kernel
-waits before rebooting on a panic. When you use the software watchdog,
-the recommended setting is 60.
-
-
-panic_on_io_nmi:
-================
-
-Controls the kernel's behavior when a CPU receives an NMI caused by
-an IO error.
-
-0: try to continue operation (default)
-
-1: panic immediately. The IO error triggered an NMI. This indicates a
- serious system condition which could result in IO data corruption.
- Rather than continuing, panicking might be a better choice. Some
- servers issue this sort of NMI when the dump button is pushed,
- and you can use this option to take a crash dump.
-
-
-panic_on_oops:
-==============
-
-Controls the kernel's behaviour when an oops or BUG is encountered.
-
-0: try to continue operation
-
-1: panic immediately. If the `panic` sysctl is also non-zero then the
- machine will be rebooted.
-
-
-panic_on_stackoverflow:
-=======================
-
-Controls the kernel's behavior when detecting the overflows of
-kernel, IRQ and exception stacks except a user stack.
-This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
-
-0: try to continue operation.
-
-1: panic immediately.
-
-
-panic_on_unrecovered_nmi:
-=========================
-
-The default Linux behaviour on an NMI of either memory or unknown is
-to continue operation. For many environments such as scientific
-computing it is preferable that the box is taken out and the error
-dealt with than an uncorrected parity/ECC error get propagated.
-
-A small number of systems do generate NMI's for bizarre random reasons
-such as power management so the default is off. That sysctl works like
-the existing panic controls already in that directory.
-
-
-panic_on_warn:
-==============
-
-Calls panic() in the WARN() path when set to 1. This is useful to avoid
-a kernel rebuild when attempting to kdump at the location of a WARN().
-
-0: only WARN(), default behaviour.
-
-1: call panic() after printing out WARN() location.
-
-
-panic_print:
-============
-
-Bitmask for printing system info when panic happens. User can chose
-combination of the following bits:
-
-===== ========================================
-bit 0 print all tasks info
-bit 1 print system memory info
-bit 2 print timer info
-bit 3 print locks info if CONFIG_LOCKDEP is on
-bit 4 print ftrace buffer
-===== ========================================
-
-So for example to print tasks and memory info on panic, user can::
-
- echo 3 > /proc/sys/kernel/panic_print
-
-
-panic_on_rcu_stall:
-===================
-
-When set to 1, calls panic() after RCU stall detection messages. This
-is useful to define the root cause of RCU stalls using a vmcore.
-
-0: do not panic() when RCU stall takes place, default behavior.
-
-1: panic() after printing RCU stall messages.
-
-
-perf_cpu_time_max_percent:
-==========================
-
-Hints to the kernel how much CPU time it should be allowed to
-use to handle perf sampling events. If the perf subsystem
-is informed that its samples are exceeding this limit, it
-will drop its sampling frequency to attempt to reduce its CPU
-usage.
-
-Some perf sampling happens in NMIs. If these samples
-unexpectedly take too long to execute, the NMIs can become
-stacked up next to each other so much that nothing else is
-allowed to execute.
-
-0:
- disable the mechanism. Do not monitor or correct perf's
- sampling rate no matter how CPU time it takes.
-
-1-100:
- attempt to throttle perf's sample rate to this
- percentage of CPU. Note: the kernel calculates an
- "expected" length of each sample event. 100 here means
- 100% of that expected length. Even if this is set to
- 100, you may still see sample throttling if this
- length is exceeded. Set to 0 if you truly do not care
- how much CPU is consumed.
-
-
-perf_event_paranoid:
-====================
-
-Controls use of the performance events system by unprivileged
-users (without CAP_SYS_ADMIN). The default value is 2.
-
-=== ==================================================================
- -1 Allow use of (almost) all events by all users
-
- Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
-
->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
-
- Disallow raw tracepoint access by users without CAP_SYS_ADMIN
-
->=1 Disallow CPU event access by users without CAP_SYS_ADMIN
-
->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
-=== ==================================================================
-
-
-perf_event_max_stack:
-=====================
-
-Controls maximum number of stack frames to copy for (attr.sample_type &
-PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using
-'perf record -g' or 'perf trace --call-graph fp'.
-
-This can only be done when no events are in use that have callchains
-enabled, otherwise writing to this file will return -EBUSY.
-
-The default value is 127.
-
-
-perf_event_mlock_kb:
-====================
-
-Control size of per-cpu ring buffer not counted agains mlock limit.
-
-The default value is 512 + 1 page
-
-
-perf_event_max_contexts_per_stack:
-==================================
-
-Controls maximum number of stack frame context entries for
-(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for
-instance, when using 'perf record -g' or 'perf trace --call-graph fp'.
-
-This can only be done when no events are in use that have callchains
-enabled, otherwise writing to this file will return -EBUSY.
-
-The default value is 8.
-
-
-pid_max:
-========
-
-PID allocation wrap value. When the kernel's next PID value
-reaches this value, it wraps back to a minimum PID value.
-PIDs of value pid_max or larger are not allocated.
-
-
-ns_last_pid:
-============
-
-The last pid allocated in the current (the one task using this sysctl
-lives in) pid namespace. When selecting a pid for a next task on fork
-kernel tries to allocate a number starting from this one.
-
-
-powersave-nap: (PPC only)
-=========================
-
-If set, Linux-PPC will use the 'nap' mode of powersaving,
-otherwise the 'doze' mode will be used.
-
-==============================================================
-
-printk:
-=======
-
-The four values in printk denote: console_loglevel,
-default_message_loglevel, minimum_console_loglevel and
-default_console_loglevel respectively.
-
-These values influence printk() behavior when printing or
-logging error messages. See 'man 2 syslog' for more info on
-the different loglevels.
-
-- console_loglevel:
- messages with a higher priority than
- this will be printed to the console
-- default_message_loglevel:
- messages without an explicit priority
- will be printed with this priority
-- minimum_console_loglevel:
- minimum (highest) value to which
- console_loglevel can be set
-- default_console_loglevel:
- default value for console_loglevel
-
-
-printk_delay:
-=============
-
-Delay each printk message in printk_delay milliseconds
-
-Value from 0 - 10000 is allowed.
-
-
-printk_ratelimit:
-=================
-
-Some warning messages are rate limited. printk_ratelimit specifies
-the minimum length of time between these messages (in jiffies), by
-default we allow one every 5 seconds.
-
-A value of 0 will disable rate limiting.
-
-
-printk_ratelimit_burst:
-=======================
-
-While long term we enforce one message per printk_ratelimit
-seconds, we do allow a burst of messages to pass through.
-printk_ratelimit_burst specifies the number of messages we can
-send before ratelimiting kicks in.
-
-
-printk_devkmsg:
-===============
-
-Control the logging to /dev/kmsg from userspace:
-
-ratelimit:
- default, ratelimited
-
-on: unlimited logging to /dev/kmsg from userspace
-
-off: logging to /dev/kmsg disabled
-
-The kernel command line parameter printk.devkmsg= overrides this and is
-a one-time setting until next reboot: once set, it cannot be changed by
-this sysctl interface anymore.
-
-
-randomize_va_space:
-===================
-
-This option can be used to select the type of process address
-space randomization that is used in the system, for architectures
-that support this feature.
-
-== ===========================================================================
-0 Turn the process address space randomization off. This is the
- default for architectures that do not support this feature anyways,
- and kernels that are booted with the "norandmaps" parameter.
-
-1 Make the addresses of mmap base, stack and VDSO page randomized.
- This, among other things, implies that shared libraries will be
- loaded to random addresses. Also for PIE-linked binaries, the
- location of code start is randomized. This is the default if the
- CONFIG_COMPAT_BRK option is enabled.
-
-2 Additionally enable heap randomization. This is the default if
- CONFIG_COMPAT_BRK is disabled.
-
- There are a few legacy applications out there (such as some ancient
- versions of libc.so.5 from 1996) that assume that brk area starts
- just after the end of the code+bss. These applications break when
- start of the brk area is randomized. There are however no known
- non-legacy applications that would be broken this way, so for most
- systems it is safe to choose full randomization.
-
- Systems with ancient and/or broken binaries should be configured
- with CONFIG_COMPAT_BRK enabled, which excludes the heap from process
- address space randomization.
-== ===========================================================================
-
-
-reboot-cmd: (Sparc only)
-========================
-
-??? This seems to be a way to give an argument to the Sparc
-ROM/Flash boot loader. Maybe to tell it what to do after
-rebooting. ???
-
-
-rtsig-max & rtsig-nr:
-=====================
-
-The file rtsig-max can be used to tune the maximum number
-of POSIX realtime (queued) signals that can be outstanding
-in the system.
-
-rtsig-nr shows the number of RT signals currently queued.
-
-
-sched_energy_aware:
-===================
-
-Enables/disables Energy Aware Scheduling (EAS). EAS starts
-automatically on platforms where it can run (that is,
-platforms with asymmetric CPU topologies and having an Energy
-Model available). If your platform happens to meet the
-requirements for EAS but you do not want to use it, change
-this value to 0.
-
-
-sched_schedstats:
-=================
-
-Enables/disables scheduler statistics. Enabling this feature
-incurs a small amount of overhead in the scheduler but is
-useful for debugging and performance tuning.
-
-
-sg-big-buff:
-============
-
-This file shows the size of the generic SCSI (sg) buffer.
-You can't tune it just yet, but you could change it on
-compile time by editing include/scsi/sg.h and changing
-the value of SG_BIG_BUFF.
-
-There shouldn't be any reason to change this value. If
-you can come up with one, you probably know what you
-are doing anyway :)
-
-
-shmall:
-=======
-
-This parameter sets the total amount of shared memory pages that
-can be used system wide. Hence, SHMALL should always be at least
-ceil(shmmax/PAGE_SIZE).
-
-If you are not sure what the default PAGE_SIZE is on your Linux
-system, you can run the following command:
-
- # getconf PAGE_SIZE
-
-
-shmmax:
-=======
-
-This value can be used to query and set the run time limit
-on the maximum shared memory segment size that can be created.
-Shared memory segments up to 1Gb are now supported in the
-kernel. This value defaults to SHMMAX.
-
-
-shm_rmid_forced:
-================
-
-Linux lets you set resource limits, including how much memory one
-process can consume, via setrlimit(2). Unfortunately, shared memory
-segments are allowed to exist without association with any process, and
-thus might not be counted against any resource limits. If enabled,
-shared memory segments are automatically destroyed when their attach
-count becomes zero after a detach or a process termination. It will
-also destroy segments that were created, but never attached to, on exit
-from the process. The only use left for IPC_RMID is to immediately
-destroy an unattached segment. Of course, this breaks the way things are
-defined, so some applications might stop working. Note that this
-feature will do you no good unless you also configure your resource
-limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't
-need this.
-
-Note that if you change this from 0 to 1, already created segments
-without users and with a dead originative process will be destroyed.
-
-
-sysctl_writes_strict:
-=====================
-
-Control how file position affects the behavior of updating sysctl values
-via the /proc/sys interface:
-
- == ======================================================================
- -1 Legacy per-write sysctl value handling, with no printk warnings.
- Each write syscall must fully contain the sysctl value to be
- written, and multiple writes on the same sysctl file descriptor
- will rewrite the sysctl value, regardless of file position.
- 0 Same behavior as above, but warn about processes that perform writes
- to a sysctl file descriptor when the file position is not 0.
- 1 (default) Respect file position when writing sysctl strings. Multiple
- writes will append to the sysctl value buffer. Anything past the max
- length of the sysctl value buffer will be ignored. Writes to numeric
- sysctl entries must always be at file position 0 and the value must
- be fully contained in the buffer sent in the write syscall.
- == ======================================================================
-
-
-softlockup_all_cpu_backtrace:
-=============================
-
-This value controls the soft lockup detector thread's behavior
-when a soft lockup condition is detected as to whether or not
-to gather further debug information. If enabled, each cpu will
-be issued an NMI and instructed to capture stack trace.
-
-This feature is only applicable for architectures which support
-NMI.
-
-0: do nothing. This is the default behavior.
-
-1: on detection capture more debug information.
-
-
-soft_watchdog:
-==============
-
-This parameter can be used to control the soft lockup detector.
-
- 0 - disable the soft lockup detector
-
- 1 - enable the soft lockup detector
-
-The soft lockup detector monitors CPUs for threads that are hogging the CPUs
-without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads
-from running. The mechanism depends on the CPUs ability to respond to timer
-interrupts which are needed for the 'watchdog/N' threads to be woken up by
-the watchdog timer function, otherwise the NMI watchdog - if enabled - can
-detect a hard lockup condition.
-
-
-stack_erasing:
-==============
-
-This parameter can be used to control kernel stack erasing at the end
-of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK.
-
-That erasing reduces the information which kernel stack leak bugs
-can reveal and blocks some uninitialized stack variable attacks.
-The tradeoff is the performance impact: on a single CPU system kernel
-compilation sees a 1% slowdown, other systems and workloads may vary.
-
- 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated.
-
- 1: kernel stack erasing is enabled (default), it is performed before
- returning to the userspace at the end of syscalls.
-
-
-tainted
-=======
-
-Non-zero if the kernel has been tainted. Numeric values, which can be
-ORed together. The letters are seen in "Tainted" line of Oops reports.
-
-====== ===== ==============================================================
- 1 `(P)` proprietary module was loaded
- 2 `(F)` module was force loaded
- 4 `(S)` SMP kernel oops on an officially SMP incapable processor
- 8 `(R)` module was force unloaded
- 16 `(M)` processor reported a Machine Check Exception (MCE)
- 32 `(B)` bad page referenced or some unexpected page flags
- 64 `(U)` taint requested by userspace application
- 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG
- 256 `(A)` an ACPI table was overridden by user
- 512 `(W)` kernel issued warning
- 1024 `(C)` staging driver was loaded
- 2048 `(I)` workaround for bug in platform firmware applied
- 4096 `(O)` externally-built ("out-of-tree") module was loaded
- 8192 `(E)` unsigned module was loaded
- 16384 `(L)` soft lockup occurred
- 32768 `(K)` kernel has been live patched
- 65536 `(X)` Auxiliary taint, defined and used by for distros
-131072 `(T)` The kernel was built with the struct randomization plugin
-====== ===== ==============================================================
-
-See Documentation/admin-guide/tainted-kernels.rst for more information.
-
-
-threads-max:
-============
-
-This value controls the maximum number of threads that can be created
-using fork().
-
-During initialization the kernel sets this value such that even if the
-maximum number of threads is created, the thread structures occupy only
-a part (1/8th) of the available RAM pages.
-
-The minimum value that can be written to threads-max is 20.
-
-The maximum value that can be written to threads-max is given by the
-constant FUTEX_TID_MASK (0x3fffffff).
-
-If a value outside of this range is written to threads-max an error
-EINVAL occurs.
-
-The value written is checked against the available RAM pages. If the
-thread structures would occupy too much (more than 1/8th) of the
-available RAM pages threads-max is reduced accordingly.
-
-
-unknown_nmi_panic:
-==================
-
-The value in this file affects behavior of handling NMI. When the
-value is non-zero, unknown NMI is trapped and then panic occurs. At
-that time, kernel debugging information is displayed on console.
-
-NMI switch that most IA32 servers have fires unknown NMI up, for
-example. If a system hangs up, try pressing the NMI switch.
-
-
-watchdog:
-=========
-
-This parameter can be used to disable or enable the soft lockup detector
-_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time.
-
- 0 - disable both lockup detectors
-
- 1 - enable both lockup detectors
-
-The soft lockup detector and the NMI watchdog can also be disabled or
-enabled individually, using the soft_watchdog and nmi_watchdog parameters.
-If the watchdog parameter is read, for example by executing::
-
- cat /proc/sys/kernel/watchdog
-
-the output of this command (0 or 1) shows the logical OR of soft_watchdog
-and nmi_watchdog.
-
-
-watchdog_cpumask:
-=================
-
-This value can be used to control on which cpus the watchdog may run.
-The default cpumask is all possible cores, but if NO_HZ_FULL is
-enabled in the kernel config, and cores are specified with the
-nohz_full= boot argument, those cores are excluded by default.
-Offline cores can be included in this mask, and if the core is later
-brought online, the watchdog will be started based on the mask value.
-
-Typically this value would only be touched in the nohz_full case
-to re-enable cores that by default were not running the watchdog,
-if a kernel lockup was suspected on those cores.
-
-The argument value is the standard cpulist format for cpumasks,
-so for example to enable the watchdog on cores 0, 2, 3, and 4 you
-might say::
-
- echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask
-
-
-watchdog_thresh:
-================
-
-This value can be used to control the frequency of hrtimer and NMI
-events and the soft and hard lockup thresholds. The default threshold
-is 10 seconds.
-
-The softlockup threshold is (2 * watchdog_thresh). Setting this
-tunable to zero will disable lockup detection altogether.
diff --git a/Documentation/sysctl/net.rst b/Documentation/sysctl/net.rst
deleted file mode 100644
index a7d44e71019d..000000000000
--- a/Documentation/sysctl/net.rst
+++ /dev/null
@@ -1,461 +0,0 @@
-================================
-Documentation for /proc/sys/net/
-================================
-
-Copyright
-
-Copyright (c) 1999
-
- - Terrehon Bowden <terrehon@pacbell.net>
- - Bodo Bauer <bb@ricochet.net>
-
-Copyright (c) 2000
-
- - Jorge Nerin <comandante@zaralinux.com>
-
-Copyright (c) 2009
-
- - Shen Feng <shen@cn.fujitsu.com>
-
-For general info and legal blurb, please look in index.rst.
-
-------------------------------------------------------------------------------
-
-This file contains the documentation for the sysctl files in
-/proc/sys/net
-
-The interface to the networking parts of the kernel is located in
-/proc/sys/net. The following table shows all possible subdirectories. You may
-see only some of them, depending on your kernel's configuration.
-
-
-Table : Subdirectories in /proc/sys/net
-
- ========= =================== = ========== ==================
- Directory Content Directory Content
- ========= =================== = ========== ==================
- core General parameter appletalk Appletalk protocol
- unix Unix domain sockets netrom NET/ROM
- 802 E802 protocol ax25 AX25
- ethernet Ethernet protocol rose X.25 PLP layer
- ipv4 IP version 4 x25 X.25 protocol
- ipx IPX token-ring IBM token ring
- bridge Bridging decnet DEC net
- ipv6 IP version 6 tipc TIPC
- ========= =================== = ========== ==================
-
-1. /proc/sys/net/core - Network core options
-============================================
-
-bpf_jit_enable
---------------
-
-This enables the BPF Just in Time (JIT) compiler. BPF is a flexible
-and efficient infrastructure allowing to execute bytecode at various
-hook points. It is used in a number of Linux kernel subsystems such
-as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints)
-and security (e.g. seccomp). LLVM has a BPF back end that can compile
-restricted C into a sequence of BPF instructions. After program load
-through bpf(2) and passing a verifier in the kernel, a JIT will then
-translate these BPF proglets into native CPU instructions. There are
-two flavors of JITs, the newer eBPF JIT currently supported on:
-
- - x86_64
- - x86_32
- - arm64
- - arm32
- - ppc64
- - sparc64
- - mips64
- - s390x
- - riscv
-
-And the older cBPF JIT supported on the following archs:
-
- - mips
- - ppc
- - sparc
-
-eBPF JITs are a superset of cBPF JITs, meaning the kernel will
-migrate cBPF instructions into eBPF instructions and then JIT
-compile them transparently. Older cBPF JITs can only translate
-tcpdump filters, seccomp rules, etc, but not mentioned eBPF
-programs loaded through bpf(2).
-
-Values:
-
- - 0 - disable the JIT (default value)
- - 1 - enable the JIT
- - 2 - enable the JIT and ask the compiler to emit traces on kernel log.
-
-bpf_jit_harden
---------------
-
-This enables hardening for the BPF JIT compiler. Supported are eBPF
-JIT backends. Enabling hardening trades off performance, but can
-mitigate JIT spraying.
-
-Values:
-
- - 0 - disable JIT hardening (default value)
- - 1 - enable JIT hardening for unprivileged users only
- - 2 - enable JIT hardening for all users
-
-bpf_jit_kallsyms
-----------------
-
-When BPF JIT compiler is enabled, then compiled images are unknown
-addresses to the kernel, meaning they neither show up in traces nor
-in /proc/kallsyms. This enables export of these addresses, which can
-be used for debugging/tracing. If bpf_jit_harden is enabled, this
-feature is disabled.
-
-Values :
-
- - 0 - disable JIT kallsyms export (default value)
- - 1 - enable JIT kallsyms export for privileged users only
-
-bpf_jit_limit
--------------
-
-This enforces a global limit for memory allocations to the BPF JIT
-compiler in order to reject unprivileged JIT requests once it has
-been surpassed. bpf_jit_limit contains the value of the global limit
-in bytes.
-
-dev_weight
-----------
-
-The maximum number of packets that kernel can handle on a NAPI interrupt,
-it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware
-aggregated packet is counted as one packet in this context.
-
-Default: 64
-
-dev_weight_rx_bias
-------------------
-
-RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function
-of the driver for the per softirq cycle netdev_budget. This parameter influences
-the proportion of the configured netdev_budget that is spent on RPS based packet
-processing during RX softirq cycles. It is further meant for making current
-dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack.
-(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based
-on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias).
-
-Default: 1
-
-dev_weight_tx_bias
-------------------
-
-Scales the maximum number of packets that can be processed during a TX softirq cycle.
-Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric
-net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog.
-
-Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias).
-
-Default: 1
-
-default_qdisc
--------------
-
-The default queuing discipline to use for network devices. This allows
-overriding the default of pfifo_fast with an alternative. Since the default
-queuing discipline is created without additional parameters so is best suited
-to queuing disciplines that work well without configuration like stochastic
-fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use
-queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin
-which require setting up classes and bandwidths. Note that physical multiqueue
-interfaces still use mq as root qdisc, which in turn uses this default for its
-leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead
-default to noqueue.
-
-Default: pfifo_fast
-
-busy_read
----------
-
-Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL)
-Approximate time in us to busy loop waiting for packets on the device queue.
-This sets the default value of the SO_BUSY_POLL socket option.
-Can be set or overridden per socket by setting socket option SO_BUSY_POLL,
-which is the preferred method of enabling. If you need to enable the feature
-globally via sysctl, a value of 50 is recommended.
-
-Will increase power usage.
-
-Default: 0 (off)
-
-busy_poll
-----------------
-Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL)
-Approximate time in us to busy loop waiting for events.
-Recommended value depends on the number of sockets you poll on.
-For several sockets 50, for several hundreds 100.
-For more than that you probably want to use epoll.
-Note that only sockets with SO_BUSY_POLL set will be busy polled,
-so you want to either selectively set SO_BUSY_POLL on those sockets or set
-sysctl.net.busy_read globally.
-
-Will increase power usage.
-
-Default: 0 (off)
-
-rmem_default
-------------
-
-The default setting of the socket receive buffer in bytes.
-
-rmem_max
---------
-
-The maximum receive socket buffer size in bytes.
-
-tstamp_allow_data
------------------
-Allow processes to receive tx timestamps looped together with the original
-packet contents. If disabled, transmit timestamp requests from unprivileged
-processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set.
-
-Default: 1 (on)
-
-
-wmem_default
-------------
-
-The default setting (in bytes) of the socket send buffer.
-
-wmem_max
---------
-
-The maximum send socket buffer size in bytes.
-
-message_burst and message_cost
-------------------------------
-
-These parameters are used to limit the warning messages written to the kernel
-log from the networking code. They enforce a rate limit to make a
-denial-of-service attack impossible. A higher message_cost factor, results in
-fewer messages that will be written. Message_burst controls when messages will
-be dropped. The default settings limit warning messages to one every five
-seconds.
-
-warnings
---------
-
-This sysctl is now unused.
-
-This was used to control console messages from the networking stack that
-occur because of problems on the network like duplicate address or bad
-checksums.
-
-These messages are now emitted at KERN_DEBUG and can generally be enabled
-and controlled by the dynamic_debug facility.
-
-netdev_budget
--------------
-
-Maximum number of packets taken from all interfaces in one polling cycle (NAPI
-poll). In one polling cycle interfaces which are registered to polling are
-probed in a round-robin manner. Also, a polling cycle may not exceed
-netdev_budget_usecs microseconds, even if netdev_budget has not been
-exhausted.
-
-netdev_budget_usecs
----------------------
-
-Maximum number of microseconds in one NAPI polling cycle. Polling
-will exit when either netdev_budget_usecs have elapsed during the
-poll cycle or the number of packets processed reaches netdev_budget.
-
-netdev_max_backlog
-------------------
-
-Maximum number of packets, queued on the INPUT side, when the interface
-receives packets faster than kernel can process them.
-
-netdev_rss_key
---------------
-
-RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is
-randomly generated.
-Some user space might need to gather its content even if drivers do not
-provide ethtool -x support yet.
-
-::
-
- myhost:~# cat /proc/sys/net/core/netdev_rss_key
- 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total)
-
-File contains nul bytes if no driver ever called netdev_rss_key_fill() function.
-
-Note:
- /proc/sys/net/core/netdev_rss_key contains 52 bytes of key,
- but most drivers only use 40 bytes of it.
-
-::
-
- myhost:~# ethtool -x eth0
- RX flow hash indirection table for eth0 with 8 RX ring(s):
- 0: 0 1 2 3 4 5 6 7
- RSS hash key:
- 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89
-
-netdev_tstamp_prequeue
-----------------------
-
-If set to 0, RX packet timestamps can be sampled after RPS processing, when
-the target CPU processes packets. It might give some delay on timestamps, but
-permit to distribute the load on several cpus.
-
-If set to 1 (default), timestamps are sampled as soon as possible, before
-queueing.
-
-optmem_max
-----------
-
-Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
-of struct cmsghdr structures with appended data.
-
-fb_tunnels_only_for_init_net
-----------------------------
-
-Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0,
-sit0, ip6tnl0, ip6gre0) are automatically created when a new
-network namespace is created, if corresponding tunnel is present
-in initial network namespace.
-If set to 1, these devices are not automatically created, and
-user space is responsible for creating them if needed.
-
-Default : 0 (for compatibility reasons)
-
-devconf_inherit_init_net
-------------------------
-
-Controls if a new network namespace should inherit all current
-settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By
-default, we keep the current behavior: for IPv4 we inherit all current
-settings from init_net and for IPv6 we reset all settings to default.
-
-If set to 1, both IPv4 and IPv6 settings are forced to inherit from
-current ones in init_net. If set to 2, both IPv4 and IPv6 settings are
-forced to reset to their default values.
-
-Default : 0 (for compatibility reasons)
-
-2. /proc/sys/net/unix - Parameters for Unix domain sockets
-----------------------------------------------------------
-
-There is only one file in this directory.
-unix_dgram_qlen limits the max number of datagrams queued in Unix domain
-socket's buffer. It will not take effect unless PF_UNIX flag is specified.
-
-
-3. /proc/sys/net/ipv4 - IPV4 settings
--------------------------------------
-Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for
-descriptions of these entries.
-
-
-4. Appletalk
-------------
-
-The /proc/sys/net/appletalk directory holds the Appletalk configuration data
-when Appletalk is loaded. The configurable parameters are:
-
-aarp-expiry-time
-----------------
-
-The amount of time we keep an ARP entry before expiring it. Used to age out
-old hosts.
-
-aarp-resolve-time
------------------
-
-The amount of time we will spend trying to resolve an Appletalk address.
-
-aarp-retransmit-limit
----------------------
-
-The number of times we will retransmit a query before giving up.
-
-aarp-tick-time
---------------
-
-Controls the rate at which expires are checked.
-
-The directory /proc/net/appletalk holds the list of active Appletalk sockets
-on a machine.
-
-The fields indicate the DDP type, the local address (in network:node format)
-the remote address, the size of the transmit pending queue, the size of the
-received queue (bytes waiting for applications to read) the state and the uid
-owning the socket.
-
-/proc/net/atalk_iface lists all the interfaces configured for appletalk.It
-shows the name of the interface, its Appletalk address, the network range on
-that address (or network number for phase 1 networks), and the status of the
-interface.
-
-/proc/net/atalk_route lists each known network route. It lists the target
-(network) that the route leads to, the router (may be directly connected), the
-route flags, and the device the route is using.
-
-
-5. IPX
-------
-
-The IPX protocol has no tunable values in proc/sys/net.
-
-The IPX protocol does, however, provide proc/net/ipx. This lists each IPX
-socket giving the local and remote addresses in Novell format (that is
-network:node:port). In accordance with the strange Novell tradition,
-everything but the port is in hex. Not_Connected is displayed for sockets that
-are not tied to a specific remote address. The Tx and Rx queue sizes indicate
-the number of bytes pending for transmission and reception. The state
-indicates the state the socket is in and the uid is the owning uid of the
-socket.
-
-The /proc/net/ipx_interface file lists all IPX interfaces. For each interface
-it gives the network number, the node number, and indicates if the network is
-the primary network. It also indicates which device it is bound to (or
-Internal for internal networks) and the Frame Type if appropriate. Linux
-supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for
-IPX.
-
-The /proc/net/ipx_route table holds a list of IPX routes. For each route it
-gives the destination network, the router node (or Directly) and the network
-address of the router (or Connected) for internal networks.
-
-6. TIPC
--------
-
-tipc_rmem
----------
-
-The TIPC protocol now has a tunable for the receive memory, similar to the
-tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max)
-
-::
-
- # cat /proc/sys/net/tipc/tipc_rmem
- 4252725 34021800 68043600
- #
-
-The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values
-are scaled (shifted) versions of that same value. Note that the min value
-is not at this point in time used in any meaningful way, but the triplet is
-preserved in order to be consistent with things like tcp_rmem.
-
-named_timeout
--------------
-
-TIPC name table updates are distributed asynchronously in a cluster, without
-any form of transaction handling. This means that different race scenarios are
-possible. One such is that a name withdrawal sent out by one node and received
-by another node may arrive after a second, overlapping name publication already
-has been accepted from a third node, although the conflicting updates
-originally may have been issued in the correct sequential order.
-If named_timeout is nonzero, failed topology updates will be placed on a defer
-queue until another event arrives that clears the error, or until the timeout
-expires. Value is in milliseconds.
diff --git a/Documentation/sysctl/sunrpc.rst b/Documentation/sysctl/sunrpc.rst
deleted file mode 100644
index 09780a682afd..000000000000
--- a/Documentation/sysctl/sunrpc.rst
+++ /dev/null
@@ -1,25 +0,0 @@
-===================================
-Documentation for /proc/sys/sunrpc/
-===================================
-
-kernel version 2.2.10
-
-Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
-
-For general info and legal blurb, please look in index.rst.
-
-------------------------------------------------------------------------------
-
-This file contains the documentation for the sysctl files in
-/proc/sys/sunrpc and is valid for Linux kernel version 2.2.
-
-The files in this directory can be used to (re)set the debug
-flags of the SUN Remote Procedure Call (RPC) subsystem in
-the Linux kernel. This stuff is used for NFS, KNFSD and
-maybe a few other things as well.
-
-The files in there are used to control the debugging flags:
-rpc_debug, nfs_debug, nfsd_debug and nlm_debug.
-
-These flags are for kernel hackers only. You should read the
-source code in net/sunrpc/ for more information.
diff --git a/Documentation/sysctl/user.rst b/Documentation/sysctl/user.rst
deleted file mode 100644
index 650eaa03f15e..000000000000
--- a/Documentation/sysctl/user.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-=================================
-Documentation for /proc/sys/user/
-=================================
-
-kernel version 4.9.0
-
-Copyright (c) 2016 Eric Biederman <ebiederm@xmission.com>
-
-------------------------------------------------------------------------------
-
-This file contains the documentation for the sysctl files in
-/proc/sys/user.
-
-The files in this directory can be used to override the default
-limits on the number of namespaces and other objects that have
-per user per user namespace limits.
-
-The primary purpose of these limits is to stop programs that
-malfunction and attempt to create a ridiculous number of objects,
-before the malfunction becomes a system wide problem. It is the
-intention that the defaults of these limits are set high enough that
-no program in normal operation should run into these limits.
-
-The creation of per user per user namespace objects are charged to
-the user in the user namespace who created the object and
-verified to be below the per user limit in that user namespace.
-
-The creation of objects is also charged to all of the users
-who created user namespaces the creation of the object happens
-in (user namespaces can be nested) and verified to be below the per user
-limits in the user namespaces of those users.
-
-This recursive counting of created objects ensures that creating a
-user namespace does not allow a user to escape their current limits.
-
-Currently, these files are in /proc/sys/user:
-
-max_cgroup_namespaces
-=====================
-
- The maximum number of cgroup namespaces that any user in the current
- user namespace may create.
-
-max_ipc_namespaces
-==================
-
- The maximum number of ipc namespaces that any user in the current
- user namespace may create.
-
-max_mnt_namespaces
-==================
-
- The maximum number of mount namespaces that any user in the current
- user namespace may create.
-
-max_net_namespaces
-==================
-
- The maximum number of network namespaces that any user in the
- current user namespace may create.
-
-max_pid_namespaces
-==================
-
- The maximum number of pid namespaces that any user in the current
- user namespace may create.
-
-max_user_namespaces
-===================
-
- The maximum number of user namespaces that any user in the current
- user namespace may create.
-
-max_uts_namespaces
-==================
-
- The maximum number of user namespaces that any user in the current
- user namespace may create.
diff --git a/Documentation/sysctl/vm.rst b/Documentation/sysctl/vm.rst
deleted file mode 100644
index 5aceb5cd5ce7..000000000000
--- a/Documentation/sysctl/vm.rst
+++ /dev/null
@@ -1,964 +0,0 @@
-===============================
-Documentation for /proc/sys/vm/
-===============================
-
-kernel version 2.6.29
-
-Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
-
-Copyright (c) 2008 Peter W. Morreale <pmorreale@novell.com>
-
-For general info and legal blurb, please look in index.rst.
-
-------------------------------------------------------------------------------
-
-This file contains the documentation for the sysctl files in
-/proc/sys/vm and is valid for Linux kernel version 2.6.29.
-
-The files in this directory can be used to tune the operation
-of the virtual memory (VM) subsystem of the Linux kernel and
-the writeout of dirty data to disk.
-
-Default values and initialization routines for most of these
-files can be found in mm/swap.c.
-
-Currently, these files are in /proc/sys/vm:
-
-- admin_reserve_kbytes
-- block_dump
-- compact_memory
-- compact_unevictable_allowed
-- dirty_background_bytes
-- dirty_background_ratio
-- dirty_bytes
-- dirty_expire_centisecs
-- dirty_ratio
-- dirtytime_expire_seconds
-- dirty_writeback_centisecs
-- drop_caches
-- extfrag_threshold
-- hugetlb_shm_group
-- laptop_mode
-- legacy_va_layout
-- lowmem_reserve_ratio
-- max_map_count
-- memory_failure_early_kill
-- memory_failure_recovery
-- min_free_kbytes
-- min_slab_ratio
-- min_unmapped_ratio
-- mmap_min_addr
-- mmap_rnd_bits
-- mmap_rnd_compat_bits
-- nr_hugepages
-- nr_hugepages_mempolicy
-- nr_overcommit_hugepages
-- nr_trim_pages (only if CONFIG_MMU=n)
-- numa_zonelist_order
-- oom_dump_tasks
-- oom_kill_allocating_task
-- overcommit_kbytes
-- overcommit_memory
-- overcommit_ratio
-- page-cluster
-- panic_on_oom
-- percpu_pagelist_fraction
-- stat_interval
-- stat_refresh
-- numa_stat
-- swappiness
-- unprivileged_userfaultfd
-- user_reserve_kbytes
-- vfs_cache_pressure
-- watermark_boost_factor
-- watermark_scale_factor
-- zone_reclaim_mode
-
-
-admin_reserve_kbytes
-====================
-
-The amount of free memory in the system that should be reserved for users
-with the capability cap_sys_admin.
-
-admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
-
-That should provide enough for the admin to log in and kill a process,
-if necessary, under the default overcommit 'guess' mode.
-
-Systems running under overcommit 'never' should increase this to account
-for the full Virtual Memory Size of programs used to recover. Otherwise,
-root may not be able to log in to recover the system.
-
-How do you calculate a minimum useful reserve?
-
-sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
-
-For overcommit 'guess', we can sum resident set sizes (RSS).
-On x86_64 this is about 8MB.
-
-For overcommit 'never', we can take the max of their virtual sizes (VSZ)
-and add the sum of their RSS.
-On x86_64 this is about 128MB.
-
-Changing this takes effect whenever an application requests memory.
-
-
-block_dump
-==========
-
-block_dump enables block I/O debugging when set to a nonzero value. More
-information on block I/O debugging is in Documentation/laptops/laptop-mode.rst.
-
-
-compact_memory
-==============
-
-Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
-all zones are compacted such that free memory is available in contiguous
-blocks where possible. This can be important for example in the allocation of
-huge pages although processes will also directly compact memory as required.
-
-
-compact_unevictable_allowed
-===========================
-
-Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
-allowed to examine the unevictable lru (mlocked pages) for pages to compact.
-This should be used on systems where stalls for minor page faults are an
-acceptable trade for large contiguous free memory. Set to 0 to prevent
-compaction from moving pages that are unevictable. Default value is 1.
-
-
-dirty_background_bytes
-======================
-
-Contains the amount of dirty memory at which the background kernel
-flusher threads will start writeback.
-
-Note:
- dirty_background_bytes is the counterpart of dirty_background_ratio. Only
- one of them may be specified at a time. When one sysctl is written it is
- immediately taken into account to evaluate the dirty memory limits and the
- other appears as 0 when read.
-
-
-dirty_background_ratio
-======================
-
-Contains, as a percentage of total available memory that contains free pages
-and reclaimable pages, the number of pages at which the background kernel
-flusher threads will start writing out dirty data.
-
-The total available memory is not equal to total system memory.
-
-
-dirty_bytes
-===========
-
-Contains the amount of dirty memory at which a process generating disk writes
-will itself start writeback.
-
-Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
-specified at a time. When one sysctl is written it is immediately taken into
-account to evaluate the dirty memory limits and the other appears as 0 when
-read.
-
-Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
-value lower than this limit will be ignored and the old configuration will be
-retained.
-
-
-dirty_expire_centisecs
-======================
-
-This tunable is used to define when dirty data is old enough to be eligible
-for writeout by the kernel flusher threads. It is expressed in 100'ths
-of a second. Data which has been dirty in-memory for longer than this
-interval will be written out next time a flusher thread wakes up.
-
-
-dirty_ratio
-===========
-
-Contains, as a percentage of total available memory that contains free pages
-and reclaimable pages, the number of pages at which a process which is
-generating disk writes will itself start writing out dirty data.
-
-The total available memory is not equal to total system memory.
-
-
-dirtytime_expire_seconds
-========================
-
-When a lazytime inode is constantly having its pages dirtied, the inode with
-an updated timestamp will never get chance to be written out. And, if the
-only thing that has happened on the file system is a dirtytime inode caused
-by an atime update, a worker will be scheduled to make sure that inode
-eventually gets pushed out to disk. This tunable is used to define when dirty
-inode is old enough to be eligible for writeback by the kernel flusher threads.
-And, it is also used as the interval to wakeup dirtytime_writeback thread.
-
-
-dirty_writeback_centisecs
-=========================
-
-The kernel flusher threads will periodically wake up and write `old` data
-out to disk. This tunable expresses the interval between those wakeups, in
-100'ths of a second.
-
-Setting this to zero disables periodic writeback altogether.
-
-
-drop_caches
-===========
-
-Writing to this will cause the kernel to drop clean caches, as well as
-reclaimable slab objects like dentries and inodes. Once dropped, their
-memory becomes free.
-
-To free pagecache::
-
- echo 1 > /proc/sys/vm/drop_caches
-
-To free reclaimable slab objects (includes dentries and inodes)::
-
- echo 2 > /proc/sys/vm/drop_caches
-
-To free slab objects and pagecache::
-
- echo 3 > /proc/sys/vm/drop_caches
-
-This is a non-destructive operation and will not free any dirty objects.
-To increase the number of objects freed by this operation, the user may run
-`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the
-number of dirty objects on the system and create more candidates to be
-dropped.
-
-This file is not a means to control the growth of the various kernel caches
-(inodes, dentries, pagecache, etc...) These objects are automatically
-reclaimed by the kernel when memory is needed elsewhere on the system.
-
-Use of this file can cause performance problems. Since it discards cached
-objects, it may cost a significant amount of I/O and CPU to recreate the
-dropped objects, especially if they were under heavy use. Because of this,
-use outside of a testing or debugging environment is not recommended.
-
-You may see informational messages in your kernel log when this file is
-used::
-
- cat (1234): drop_caches: 3
-
-These are informational only. They do not mean that anything is wrong
-with your system. To disable them, echo 4 (bit 2) into drop_caches.
-
-
-extfrag_threshold
-=================
-
-This parameter affects whether the kernel will compact memory or direct
-reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
-debugfs shows what the fragmentation index for each order is in each zone in
-the system. Values tending towards 0 imply allocations would fail due to lack
-of memory, values towards 1000 imply failures are due to fragmentation and -1
-implies that the allocation will succeed as long as watermarks are met.
-
-The kernel will not compact memory in a zone if the
-fragmentation index is <= extfrag_threshold. The default value is 500.
-
-
-highmem_is_dirtyable
-====================
-
-Available only for systems with CONFIG_HIGHMEM enabled (32b systems).
-
-This parameter controls whether the high memory is considered for dirty
-writers throttling. This is not the case by default which means that
-only the amount of memory directly visible/usable by the kernel can
-be dirtied. As a result, on systems with a large amount of memory and
-lowmem basically depleted writers might be throttled too early and
-streaming writes can get very slow.
-
-Changing the value to non zero would allow more memory to be dirtied
-and thus allow writers to write more data which can be flushed to the
-storage more effectively. Note this also comes with a risk of pre-mature
-OOM killer because some writers (e.g. direct block device writes) can
-only use the low memory and they can fill it up with dirty data without
-any throttling.
-
-
-hugetlb_shm_group
-=================
-
-hugetlb_shm_group contains group id that is allowed to create SysV
-shared memory segment using hugetlb page.
-
-
-laptop_mode
-===========
-
-laptop_mode is a knob that controls "laptop mode". All the things that are
-controlled by this knob are discussed in Documentation/laptops/laptop-mode.rst.
-
-
-legacy_va_layout
-================
-
-If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel
-will use the legacy (2.4) layout for all processes.
-
-
-lowmem_reserve_ratio
-====================
-
-For some specialised workloads on highmem machines it is dangerous for
-the kernel to allow process memory to be allocated from the "lowmem"
-zone. This is because that memory could then be pinned via the mlock()
-system call, or by unavailability of swapspace.
-
-And on large highmem machines this lack of reclaimable lowmem memory
-can be fatal.
-
-So the Linux page allocator has a mechanism which prevents allocations
-which *could* use highmem from using too much lowmem. This means that
-a certain amount of lowmem is defended from the possibility of being
-captured into pinned user memory.
-
-(The same argument applies to the old 16 megabyte ISA DMA region. This
-mechanism will also defend that region from allocations which could use
-highmem or lowmem).
-
-The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is
-in defending these lower zones.
-
-If you have a machine which uses highmem or ISA DMA and your
-applications are using mlock(), or if you are running with no swap then
-you probably should change the lowmem_reserve_ratio setting.
-
-The lowmem_reserve_ratio is an array. You can see them by reading this file::
-
- % cat /proc/sys/vm/lowmem_reserve_ratio
- 256 256 32
-
-But, these values are not used directly. The kernel calculates # of protection
-pages for each zones from them. These are shown as array of protection pages
-in /proc/zoneinfo like followings. (This is an example of x86-64 box).
-Each zone has an array of protection pages like this::
-
- Node 0, zone DMA
- pages free 1355
- min 3
- low 3
- high 4
- :
- :
- numa_other 0
- protection: (0, 2004, 2004, 2004)
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- pagesets
- cpu: 0 pcp: 0
- :
-
-These protections are added to score to judge whether this zone should be used
-for page allocation or should be reclaimed.
-
-In this example, if normal pages (index=2) are required to this DMA zone and
-watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
-not be used because pages_free(1355) is smaller than watermark + protection[2]
-(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
-normal page requirement. If requirement is DMA zone(index=0), protection[0]
-(=0) is used.
-
-zone[i]'s protection[j] is calculated by following expression::
-
- (i < j):
- zone[i]->protection[j]
- = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
- / lowmem_reserve_ratio[i];
- (i = j):
- (should not be protected. = 0;
- (i > j):
- (not necessary, but looks 0)
-
-The default values of lowmem_reserve_ratio[i] are
-
- === ====================================
- 256 (if zone[i] means DMA or DMA32 zone)
- 32 (others)
- === ====================================
-
-As above expression, they are reciprocal number of ratio.
-256 means 1/256. # of protection pages becomes about "0.39%" of total managed
-pages of higher zones on the node.
-
-If you would like to protect more pages, smaller values are effective.
-The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
-disables protection of the pages.
-
-
-max_map_count:
-==============
-
-This file contains the maximum number of memory map areas a process
-may have. Memory map areas are used as a side-effect of calling
-malloc, directly by mmap, mprotect, and madvise, and also when loading
-shared libraries.
-
-While most applications need less than a thousand maps, certain
-programs, particularly malloc debuggers, may consume lots of them,
-e.g., up to one or two maps per allocation.
-
-The default value is 65536.
-
-
-memory_failure_early_kill:
-==========================
-
-Control how to kill processes when uncorrected memory error (typically
-a 2bit error in a memory module) is detected in the background by hardware
-that cannot be handled by the kernel. In some cases (like the page
-still having a valid copy on disk) the kernel will handle the failure
-transparently without affecting any applications. But if there is
-no other uptodate copy of the data it will kill to prevent any data
-corruptions from propagating.
-
-1: Kill all processes that have the corrupted and not reloadable page mapped
-as soon as the corruption is detected. Note this is not supported
-for a few types of pages, like kernel internally allocated data or
-the swap cache, but works for the majority of user pages.
-
-0: Only unmap the corrupted page from all processes and only kill a process
-who tries to access it.
-
-The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
-handle this if they want to.
-
-This is only active on architectures/platforms with advanced machine
-check handling and depends on the hardware capabilities.
-
-Applications can override this setting individually with the PR_MCE_KILL prctl
-
-
-memory_failure_recovery
-=======================
-
-Enable memory failure recovery (when supported by the platform)
-
-1: Attempt recovery.
-
-0: Always panic on a memory failure.
-
-
-min_free_kbytes
-===============
-
-This is used to force the Linux VM to keep a minimum number
-of kilobytes free. The VM uses this number to compute a
-watermark[WMARK_MIN] value for each lowmem zone in the system.
-Each lowmem zone gets a number of reserved free pages based
-proportionally on its size.
-
-Some minimal amount of memory is needed to satisfy PF_MEMALLOC
-allocations; if you set this to lower than 1024KB, your system will
-become subtly broken, and prone to deadlock under high loads.
-
-Setting this too high will OOM your machine instantly.
-
-
-min_slab_ratio
-==============
-
-This is available only on NUMA kernels.
-
-A percentage of the total pages in each zone. On Zone reclaim
-(fallback from the local zone occurs) slabs will be reclaimed if more
-than this percentage of pages in a zone are reclaimable slab pages.
-This insures that the slab growth stays under control even in NUMA
-systems that rarely perform global reclaim.
-
-The default is 5 percent.
-
-Note that slab reclaim is triggered in a per zone / node fashion.
-The process of reclaiming slab memory is currently not node specific
-and may not be fast.
-
-
-min_unmapped_ratio
-==================
-
-This is available only on NUMA kernels.
-
-This is a percentage of the total pages in each zone. Zone reclaim will
-only occur if more than this percentage of pages are in a state that
-zone_reclaim_mode allows to be reclaimed.
-
-If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
-against all file-backed unmapped pages including swapcache pages and tmpfs
-files. Otherwise, only unmapped pages backed by normal files but not tmpfs
-files and similar are considered.
-
-The default is 1 percent.
-
-
-mmap_min_addr
-=============
-
-This file indicates the amount of address space which a user process will
-be restricted from mmapping. Since kernel null dereference bugs could
-accidentally operate based on the information in the first couple of pages
-of memory userspace processes should not be allowed to write to them. By
-default this value is set to 0 and no protections will be enforced by the
-security module. Setting this value to something like 64k will allow the
-vast majority of applications to work correctly and provide defense in depth
-against future potential kernel bugs.
-
-
-mmap_rnd_bits
-=============
-
-This value can be used to select the number of bits to use to
-determine the random offset to the base address of vma regions
-resulting from mmap allocations on architectures which support
-tuning address space randomization. This value will be bounded
-by the architecture's minimum and maximum supported values.
-
-This value can be changed after boot using the
-/proc/sys/vm/mmap_rnd_bits tunable
-
-
-mmap_rnd_compat_bits
-====================
-
-This value can be used to select the number of bits to use to
-determine the random offset to the base address of vma regions
-resulting from mmap allocations for applications run in
-compatibility mode on architectures which support tuning address
-space randomization. This value will be bounded by the
-architecture's minimum and maximum supported values.
-
-This value can be changed after boot using the
-/proc/sys/vm/mmap_rnd_compat_bits tunable
-
-
-nr_hugepages
-============
-
-Change the minimum size of the hugepage pool.
-
-See Documentation/admin-guide/mm/hugetlbpage.rst
-
-
-nr_hugepages_mempolicy
-======================
-
-Change the size of the hugepage pool at run-time on a specific
-set of NUMA nodes.
-
-See Documentation/admin-guide/mm/hugetlbpage.rst
-
-
-nr_overcommit_hugepages
-=======================
-
-Change the maximum size of the hugepage pool. The maximum is
-nr_hugepages + nr_overcommit_hugepages.
-
-See Documentation/admin-guide/mm/hugetlbpage.rst
-
-
-nr_trim_pages
-=============
-
-This is available only on NOMMU kernels.
-
-This value adjusts the excess page trimming behaviour of power-of-2 aligned
-NOMMU mmap allocations.
-
-A value of 0 disables trimming of allocations entirely, while a value of 1
-trims excess pages aggressively. Any value >= 1 acts as the watermark where
-trimming of allocations is initiated.
-
-The default value is 1.
-
-See Documentation/nommu-mmap.txt for more information.
-
-
-numa_zonelist_order
-===================
-
-This sysctl is only for NUMA and it is deprecated. Anything but
-Node order will fail!
-
-'where the memory is allocated from' is controlled by zonelists.
-
-(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
-you may be able to read ZONE_DMA as ZONE_DMA32...)
-
-In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
-ZONE_NORMAL -> ZONE_DMA
-This means that a memory allocation request for GFP_KERNEL will
-get memory from ZONE_DMA only when ZONE_NORMAL is not available.
-
-In NUMA case, you can think of following 2 types of order.
-Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL::
-
- (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
- (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
-
-Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
-will be used before ZONE_NORMAL exhaustion. This increases possibility of
-out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
-
-Type(B) cannot offer the best locality but is more robust against OOM of
-the DMA zone.
-
-Type(A) is called as "Node" order. Type (B) is "Zone" order.
-
-"Node order" orders the zonelists by node, then by zone within each node.
-Specify "[Nn]ode" for node order
-
-"Zone Order" orders the zonelists by zone type, then by node within each
-zone. Specify "[Zz]one" for zone order.
-
-Specify "[Dd]efault" to request automatic configuration.
-
-On 32-bit, the Normal zone needs to be preserved for allocations accessible
-by the kernel, so "zone" order will be selected.
-
-On 64-bit, devices that require DMA32/DMA are relatively rare, so "node"
-order will be selected.
-
-Default order is recommended unless this is causing problems for your
-system/application.
-
-
-oom_dump_tasks
-==============
-
-Enables a system-wide task dump (excluding kernel threads) to be produced
-when the kernel performs an OOM-killing and includes such information as
-pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
-score, and name. This is helpful to determine why the OOM killer was
-invoked, to identify the rogue task that caused it, and to determine why
-the OOM killer chose the task it did to kill.
-
-If this is set to zero, this information is suppressed. On very
-large systems with thousands of tasks it may not be feasible to dump
-the memory state information for each one. Such systems should not
-be forced to incur a performance penalty in OOM conditions when the
-information may not be desired.
-
-If this is set to non-zero, this information is shown whenever the
-OOM killer actually kills a memory-hogging task.
-
-The default value is 1 (enabled).
-
-
-oom_kill_allocating_task
-========================
-
-This enables or disables killing the OOM-triggering task in
-out-of-memory situations.
-
-If this is set to zero, the OOM killer will scan through the entire
-tasklist and select a task based on heuristics to kill. This normally
-selects a rogue memory-hogging task that frees up a large amount of
-memory when killed.
-
-If this is set to non-zero, the OOM killer simply kills the task that
-triggered the out-of-memory condition. This avoids the expensive
-tasklist scan.
-
-If panic_on_oom is selected, it takes precedence over whatever value
-is used in oom_kill_allocating_task.
-
-The default value is 0.
-
-
-overcommit_kbytes
-=================
-
-When overcommit_memory is set to 2, the committed address space is not
-permitted to exceed swap plus this amount of physical RAM. See below.
-
-Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
-of them may be specified at a time. Setting one disables the other (which
-then appears as 0 when read).
-
-
-overcommit_memory
-=================
-
-This value contains a flag that enables memory overcommitment.
-
-When this flag is 0, the kernel attempts to estimate the amount
-of free memory left when userspace requests more memory.
-
-When this flag is 1, the kernel pretends there is always enough
-memory until it actually runs out.
-
-When this flag is 2, the kernel uses a "never overcommit"
-policy that attempts to prevent any overcommit of memory.
-Note that user_reserve_kbytes affects this policy.
-
-This feature can be very useful because there are a lot of
-programs that malloc() huge amounts of memory "just-in-case"
-and don't use much of it.
-
-The default value is 0.
-
-See Documentation/vm/overcommit-accounting.rst and
-mm/util.c::__vm_enough_memory() for more information.
-
-
-overcommit_ratio
-================
-
-When overcommit_memory is set to 2, the committed address
-space is not permitted to exceed swap plus this percentage
-of physical RAM. See above.
-
-
-page-cluster
-============
-
-page-cluster controls the number of pages up to which consecutive pages
-are read in from swap in a single attempt. This is the swap counterpart
-to page cache readahead.
-The mentioned consecutivity is not in terms of virtual/physical addresses,
-but consecutive on swap space - that means they were swapped out together.
-
-It is a logarithmic value - setting it to zero means "1 page", setting
-it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
-Zero disables swap readahead completely.
-
-The default value is three (eight pages at a time). There may be some
-small benefits in tuning this to a different value if your workload is
-swap-intensive.
-
-Lower values mean lower latencies for initial faults, but at the same time
-extra faults and I/O delays for following faults if they would have been part of
-that consecutive pages readahead would have brought in.
-
-
-panic_on_oom
-============
-
-This enables or disables panic on out-of-memory feature.
-
-If this is set to 0, the kernel will kill some rogue process,
-called oom_killer. Usually, oom_killer can kill rogue processes and
-system will survive.
-
-If this is set to 1, the kernel panics when out-of-memory happens.
-However, if a process limits using nodes by mempolicy/cpusets,
-and those nodes become memory exhaustion status, one process
-may be killed by oom-killer. No panic occurs in this case.
-Because other nodes' memory may be free. This means system total status
-may be not fatal yet.
-
-If this is set to 2, the kernel panics compulsorily even on the
-above-mentioned. Even oom happens under memory cgroup, the whole
-system panics.
-
-The default value is 0.
-
-1 and 2 are for failover of clustering. Please select either
-according to your policy of failover.
-
-panic_on_oom=2+kdump gives you very strong tool to investigate
-why oom happens. You can get snapshot.
-
-
-percpu_pagelist_fraction
-========================
-
-This is the fraction of pages at most (high mark pcp->high) in each zone that
-are allocated for each per cpu page list. The min value for this is 8. It
-means that we don't allow more than 1/8th of pages in each zone to be
-allocated in any single per_cpu_pagelist. This entry only changes the value
-of hot per cpu pagelists. User can specify a number like 100 to allocate
-1/100th of each zone to each per cpu page list.
-
-The batch value of each per cpu pagelist is also updated as a result. It is
-set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
-
-The initial value is zero. Kernel does not use this value at boot time to set
-the high water marks for each per cpu page list. If the user writes '0' to this
-sysctl, it will revert to this default behavior.
-
-
-stat_interval
-=============
-
-The time interval between which vm statistics are updated. The default
-is 1 second.
-
-
-stat_refresh
-============
-
-Any read or write (by root only) flushes all the per-cpu vm statistics
-into their global totals, for more accurate reports when testing
-e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
-
-As a side-effect, it also checks for negative totals (elsewhere reported
-as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
-(At time of writing, a few stats are known sometimes to be found negative,
-with no ill effects: errors and warnings on these stats are suppressed.)
-
-
-numa_stat
-=========
-
-This interface allows runtime configuration of numa statistics.
-
-When page allocation performance becomes a bottleneck and you can tolerate
-some possible tool breakage and decreased numa counter precision, you can
-do::
-
- echo 0 > /proc/sys/vm/numa_stat
-
-When page allocation performance is not a bottleneck and you want all
-tooling to work, you can do::
-
- echo 1 > /proc/sys/vm/numa_stat
-
-
-swappiness
-==========
-
-This control is used to define how aggressive the kernel will swap
-memory pages. Higher values will increase aggressiveness, lower values
-decrease the amount of swap. A value of 0 instructs the kernel not to
-initiate swap until the amount of free and file-backed pages is less
-than the high water mark in a zone.
-
-The default value is 60.
-
-
-unprivileged_userfaultfd
-========================
-
-This flag controls whether unprivileged users can use the userfaultfd
-system calls. Set this to 1 to allow unprivileged users to use the
-userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
-privileged users (with SYS_CAP_PTRACE capability).
-
-The default value is 1.
-
-
-user_reserve_kbytes
-===================
-
-When overcommit_memory is set to 2, "never overcommit" mode, reserve
-min(3% of current process size, user_reserve_kbytes) of free memory.
-This is intended to prevent a user from starting a single memory hogging
-process, such that they cannot recover (kill the hog).
-
-user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
-
-If this is reduced to zero, then the user will be allowed to allocate
-all free memory with a single process, minus admin_reserve_kbytes.
-Any subsequent attempts to execute a command will result in
-"fork: Cannot allocate memory".
-
-Changing this takes effect whenever an application requests memory.
-
-
-vfs_cache_pressure
-==================
-
-This percentage value controls the tendency of the kernel to reclaim
-the memory which is used for caching of directory and inode objects.
-
-At the default value of vfs_cache_pressure=100 the kernel will attempt to
-reclaim dentries and inodes at a "fair" rate with respect to pagecache and
-swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
-to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
-never reclaim dentries and inodes due to memory pressure and this can easily
-lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
-causes the kernel to prefer to reclaim dentries and inodes.
-
-Increasing vfs_cache_pressure significantly beyond 100 may have negative
-performance impact. Reclaim code needs to take various locks to find freeable
-directory and inode objects. With vfs_cache_pressure=1000, it will look for
-ten times more freeable objects than there are.
-
-
-watermark_boost_factor
-======================
-
-This factor controls the level of reclaim when memory is being fragmented.
-It defines the percentage of the high watermark of a zone that will be
-reclaimed if pages of different mobility are being mixed within pageblocks.
-The intent is that compaction has less work to do in the future and to
-increase the success rate of future high-order allocations such as SLUB
-allocations, THP and hugetlbfs pages.
-
-To make it sensible with respect to the watermark_scale_factor
-parameter, the unit is in fractions of 10,000. The default value of
-15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
-watermark will be reclaimed in the event of a pageblock being mixed due
-to fragmentation. The level of reclaim is determined by the number of
-fragmentation events that occurred in the recent past. If this value is
-smaller than a pageblock then a pageblocks worth of pages will be reclaimed
-(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature.
-
-
-watermark_scale_factor
-======================
-
-This factor controls the aggressiveness of kswapd. It defines the
-amount of memory left in a node/system before kswapd is woken up and
-how much memory needs to be free before kswapd goes back to sleep.
-
-The unit is in fractions of 10,000. The default value of 10 means the
-distances between watermarks are 0.1% of the available memory in the
-node/system. The maximum value is 1000, or 10% of memory.
-
-A high rate of threads entering direct reclaim (allocstall) or kswapd
-going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
-that the number of free pages kswapd maintains for latency reasons is
-too small for the allocation bursts occurring in the system. This knob
-can then be used to tune kswapd aggressiveness accordingly.
-
-
-zone_reclaim_mode
-=================
-
-Zone_reclaim_mode allows someone to set more or less aggressive approaches to
-reclaim memory when a zone runs out of memory. If it is set to zero then no
-zone reclaim occurs. Allocations will be satisfied from other zones / nodes
-in the system.
-
-This is value OR'ed together of
-
-= ===================================
-1 Zone reclaim on
-2 Zone reclaim writes dirty pages out
-4 Zone reclaim swaps pages
-= ===================================
-
-zone_reclaim_mode is disabled by default. For file servers or workloads
-that benefit from having their data cached, zone_reclaim_mode should be
-left disabled as the caching effect is likely to be more important than
-data locality.
-
-zone_reclaim may be enabled if it's known that the workload is partitioned
-such that each partition fits within a NUMA node and that accessing remote
-memory would cause a measurable performance reduction. The page allocator
-will then reclaim easily reusable pages (those page cache pages that are
-currently not used) before allocating off node pages.
-
-Allowing zone reclaim to write out pages stops processes that are
-writing large amounts of data from dirtying pages on other nodes. Zone
-reclaim will write out dirty pages if a zone fills up and so effectively
-throttle the process. This may decrease the performance of a single process
-since it cannot use all of system memory to buffer the outgoing writes
-anymore but it preserve the memory on other nodes so that the performance
-of other processes running on other nodes will not be affected.
-
-Allowing regular swap effectively restricts allocations to the local
-node unless explicitly overridden by memory policies or cpuset
-configurations.