diff options
Diffstat (limited to 'Documentation/sysctl')
-rw-r--r-- | Documentation/sysctl/README | 76 | ||||
-rw-r--r-- | Documentation/sysctl/abi.txt | 54 | ||||
-rw-r--r-- | Documentation/sysctl/fs.txt | 372 | ||||
-rw-r--r-- | Documentation/sysctl/kernel.txt | 1139 | ||||
-rw-r--r-- | Documentation/sysctl/net.txt | 407 | ||||
-rw-r--r-- | Documentation/sysctl/sunrpc.txt | 20 | ||||
-rw-r--r-- | Documentation/sysctl/user.txt | 66 | ||||
-rw-r--r-- | Documentation/sysctl/vm.txt | 934 |
8 files changed, 0 insertions, 3068 deletions
diff --git a/Documentation/sysctl/README b/Documentation/sysctl/README deleted file mode 100644 index d5f24ab0ecc3..000000000000 --- a/Documentation/sysctl/README +++ /dev/null @@ -1,76 +0,0 @@ -Documentation for /proc/sys/ kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> - -'Why', I hear you ask, 'would anyone even _want_ documentation -for them sysctl files? If anybody really needs it, it's all in -the source...' - -Well, this documentation is written because some people either -don't know they need to tweak something, or because they don't -have the time or knowledge to read the source code. - -Furthermore, the programmers who built sysctl have built it to -be actually used, not just for the fun of programming it :-) - -============================================================== - -Legal blurb: - -As usual, there are two main things to consider: -1. you get what you pay for -2. it's free - -The consequences are that I won't guarantee the correctness of -this document, and if you come to me complaining about how you -screwed up your system because of wrong documentation, I won't -feel sorry for you. I might even laugh at you... - -But of course, if you _do_ manage to screw up your system using -only the sysctl options used in this file, I'd like to hear of -it. Not only to have a great laugh, but also to make sure that -you're the last RTFMing person to screw up. - -In short, e-mail your suggestions, corrections and / or horror -stories to: <riel@nl.linux.org> - -Rik van Riel. - -============================================================== - -Introduction: - -Sysctl is a means of configuring certain aspects of the kernel -at run-time, and the /proc/sys/ directory is there so that you -don't even need special tools to do it! -In fact, there are only four things needed to use these config -facilities: -- a running Linux system -- root access -- common sense (this is especially hard to come by these days) -- knowledge of what all those values mean - -As a quick 'ls /proc/sys' will show, the directory consists of -several (arch-dependent?) subdirs. Each subdir is mainly about -one part of the kernel, so you can do configuration on a piece -by piece basis, or just some 'thematic frobbing'. - -The subdirs are about: -abi/ execution domains & personalities -debug/ <empty> -dev/ device specific information (eg dev/cdrom/info) -fs/ specific filesystems - filehandle, inode, dentry and quota tuning - binfmt_misc <Documentation/admin-guide/binfmt-misc.rst> -kernel/ global kernel info / tuning - miscellaneous stuff -net/ networking stuff, for documentation look in: - <Documentation/networking/> -proc/ <empty> -sunrpc/ SUN Remote Procedure Call (NFS) -vm/ memory management tuning - buffer and cache management -user/ Per user per user namespace limits - -These are the subdirs I have on my system. There might be more -or other subdirs in another setup. If you see another dir, I'd -really like to hear about it :-) diff --git a/Documentation/sysctl/abi.txt b/Documentation/sysctl/abi.txt deleted file mode 100644 index 63f4ebcf652c..000000000000 --- a/Documentation/sysctl/abi.txt +++ /dev/null @@ -1,54 +0,0 @@ -Documentation for /proc/sys/abi/* kernel version 2.6.0.test2 - (c) 2003, Fabian Frederick <ffrederick@users.sourceforge.net> - -For general info : README. - -============================================================== - -This path is binary emulation relevant aka personality types aka abi. -When a process is executed, it's linked to an exec_domain whose -personality is defined using values available from /proc/sys/abi. -You can find further details about abi in include/linux/personality.h. - -Here are the files featuring in 2.6 kernel : - -- defhandler_coff -- defhandler_elf -- defhandler_lcall7 -- defhandler_libcso -- fake_utsname -- trace - -=========================================================== -defhandler_coff: -defined value : -PER_SCOSVR3 -0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE - -=========================================================== -defhandler_elf: -defined value : -PER_LINUX -0 - -=========================================================== -defhandler_lcall7: -defined value : -PER_SVR4 -0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, - -=========================================================== -defhandler_libsco: -defined value: -PER_SVR4 -0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, - -=========================================================== -fake_utsname: -Unused - -=========================================================== -trace: -Unused - -=========================================================== diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt deleted file mode 100644 index 58649bd4fcfc..000000000000 --- a/Documentation/sysctl/fs.txt +++ /dev/null @@ -1,372 +0,0 @@ -Documentation for /proc/sys/fs/* kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> - (c) 2009, Shen Feng<shen@cn.fujitsu.com> - -For general info and legal blurb, please look in README. - -============================================================== - -This file contains documentation for the sysctl files in -/proc/sys/fs/ and is valid for Linux kernel version 2.2. - -The files in this directory can be used to tune and monitor -miscellaneous and general things in the operation of the Linux -kernel. Since some of the files _can_ be used to screw up your -system, it is advisable to read both documentation and source -before actually making adjustments. - -1. /proc/sys/fs ----------------------------------------------------------- - -Currently, these files are in /proc/sys/fs: -- aio-max-nr -- aio-nr -- dentry-state -- dquot-max -- dquot-nr -- file-max -- file-nr -- inode-max -- inode-nr -- inode-state -- nr_open -- overflowuid -- overflowgid -- pipe-user-pages-hard -- pipe-user-pages-soft -- protected_fifos -- protected_hardlinks -- protected_regular -- protected_symlinks -- suid_dumpable -- super-max -- super-nr - -============================================================== - -aio-nr & aio-max-nr: - -aio-nr is the running total of the number of events specified on the -io_setup system call for all currently active aio contexts. If aio-nr -reaches aio-max-nr then io_setup will fail with EAGAIN. Note that -raising aio-max-nr does not result in the pre-allocation or re-sizing -of any kernel data structures. - -============================================================== - -dentry-state: - -From linux/include/linux/dcache.h: --------------------------------------------------------------- -struct dentry_stat_t dentry_stat { - int nr_dentry; - int nr_unused; - int age_limit; /* age in seconds */ - int want_pages; /* pages requested by system */ - int nr_negative; /* # of unused negative dentries */ - int dummy; /* Reserved for future use */ -}; --------------------------------------------------------------- - -Dentries are dynamically allocated and deallocated. - -nr_dentry shows the total number of dentries allocated (active -+ unused). nr_unused shows the number of dentries that are not -actively used, but are saved in the LRU list for future reuse. - -Age_limit is the age in seconds after which dcache entries -can be reclaimed when memory is short and want_pages is -nonzero when shrink_dcache_pages() has been called and the -dcache isn't pruned yet. - -nr_negative shows the number of unused dentries that are also -negative dentries which do not mapped to actual files. - -============================================================== - -dquot-max & dquot-nr: - -The file dquot-max shows the maximum number of cached disk -quota entries. - -The file dquot-nr shows the number of allocated disk quota -entries and the number of free disk quota entries. - -If the number of free cached disk quotas is very low and -you have some awesome number of simultaneous system users, -you might want to raise the limit. - -============================================================== - -file-max & file-nr: - -The value in file-max denotes the maximum number of file- -handles that the Linux kernel will allocate. When you get lots -of error messages about running out of file handles, you might -want to increase this limit. - -Historically,the kernel was able to allocate file handles -dynamically, but not to free them again. The three values in -file-nr denote the number of allocated file handles, the number -of allocated but unused file handles, and the maximum number of -file handles. Linux 2.6 always reports 0 as the number of free -file handles -- this is not an error, it just means that the -number of allocated file handles exactly matches the number of -used file handles. - -Attempts to allocate more file descriptors than file-max are -reported with printk, look for "VFS: file-max limit <number> -reached". -============================================================== - -nr_open: - -This denotes the maximum number of file-handles a process can -allocate. Default value is 1024*1024 (1048576) which should be -enough for most machines. Actual limit depends on RLIMIT_NOFILE -resource limit. - -============================================================== - -inode-max, inode-nr & inode-state: - -As with file handles, the kernel allocates the inode structures -dynamically, but can't free them yet. - -The value in inode-max denotes the maximum number of inode -handlers. This value should be 3-4 times larger than the value -in file-max, since stdin, stdout and network sockets also -need an inode struct to handle them. When you regularly run -out of inodes, you need to increase this value. - -The file inode-nr contains the first two items from -inode-state, so we'll skip to that file... - -Inode-state contains three actual numbers and four dummies. -The actual numbers are, in order of appearance, nr_inodes, -nr_free_inodes and preshrink. - -Nr_inodes stands for the number of inodes the system has -allocated, this can be slightly more than inode-max because -Linux allocates them one pageful at a time. - -Nr_free_inodes represents the number of free inodes (?) and -preshrink is nonzero when the nr_inodes > inode-max and the -system needs to prune the inode list instead of allocating -more. - -============================================================== - -overflowgid & overflowuid: - -Some filesystems only support 16-bit UIDs and GIDs, although in Linux -UIDs and GIDs are 32 bits. When one of these filesystems is mounted -with writes enabled, any UID or GID that would exceed 65535 is translated -to a fixed value before being written to disk. - -These sysctls allow you to change the value of the fixed UID and GID. -The default is 65534. - -============================================================== - -pipe-user-pages-hard: - -Maximum total number of pages a non-privileged user may allocate for pipes. -Once this limit is reached, no new pipes may be allocated until usage goes -below the limit again. When set to 0, no limit is applied, which is the default -setting. - -============================================================== - -pipe-user-pages-soft: - -Maximum total number of pages a non-privileged user may allocate for pipes -before the pipe size gets limited to a single page. Once this limit is reached, -new pipes will be limited to a single page in size for this user in order to -limit total memory usage, and trying to increase them using fcntl() will be -denied until usage goes below the limit again. The default value allows to -allocate up to 1024 pipes at their default size. When set to 0, no limit is -applied. - -============================================================== - -protected_fifos: - -The intent of this protection is to avoid unintentional writes to -an attacker-controlled FIFO, where a program expected to create a regular -file. - -When set to "0", writing to FIFOs is unrestricted. - -When set to "1" don't allow O_CREAT open on FIFOs that we don't own -in world writable sticky directories, unless they are owned by the -owner of the directory. - -When set to "2" it also applies to group writable sticky directories. - -This protection is based on the restrictions in Openwall. - -============================================================== - -protected_hardlinks: - -A long-standing class of security issues is the hardlink-based -time-of-check-time-of-use race, most commonly seen in world-writable -directories like /tmp. The common method of exploitation of this flaw -is to cross privilege boundaries when following a given hardlink (i.e. a -root process follows a hardlink created by another user). Additionally, -on systems without separated partitions, this stops unauthorized users -from "pinning" vulnerable setuid/setgid files against being upgraded by -the administrator, or linking to special files. - -When set to "0", hardlink creation behavior is unrestricted. - -When set to "1" hardlinks cannot be created by users if they do not -already own the source file, or do not have read/write access to it. - -This protection is based on the restrictions in Openwall and grsecurity. - -============================================================== - -protected_regular: - -This protection is similar to protected_fifos, but it -avoids writes to an attacker-controlled regular file, where a program -expected to create one. - -When set to "0", writing to regular files is unrestricted. - -When set to "1" don't allow O_CREAT open on regular files that we -don't own in world writable sticky directories, unless they are -owned by the owner of the directory. - -When set to "2" it also applies to group writable sticky directories. - -============================================================== - -protected_symlinks: - -A long-standing class of security issues is the symlink-based -time-of-check-time-of-use race, most commonly seen in world-writable -directories like /tmp. The common method of exploitation of this flaw -is to cross privilege boundaries when following a given symlink (i.e. a -root process follows a symlink belonging to another user). For a likely -incomplete list of hundreds of examples across the years, please see: -http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp - -When set to "0", symlink following behavior is unrestricted. - -When set to "1" symlinks are permitted to be followed only when outside -a sticky world-writable directory, or when the uid of the symlink and -follower match, or when the directory owner matches the symlink's owner. - -This protection is based on the restrictions in Openwall and grsecurity. - -============================================================== - -suid_dumpable: - -This value can be used to query and set the core dump mode for setuid -or otherwise protected/tainted binaries. The modes are - -0 - (default) - traditional behaviour. Any process which has changed - privilege levels or is execute only will not be dumped. -1 - (debug) - all processes dump core when possible. The core dump is - owned by the current user and no security is applied. This is - intended for system debugging situations only. Ptrace is unchecked. - This is insecure as it allows regular users to examine the memory - contents of privileged processes. -2 - (suidsafe) - any binary which normally would not be dumped is dumped - anyway, but only if the "core_pattern" kernel sysctl is set to - either a pipe handler or a fully qualified path. (For more details - on this limitation, see CVE-2006-2451.) This mode is appropriate - when administrators are attempting to debug problems in a normal - environment, and either have a core dump pipe handler that knows - to treat privileged core dumps with care, or specific directory - defined for catching core dumps. If a core dump happens without - a pipe handler or fully qualifid path, a message will be emitted - to syslog warning about the lack of a correct setting. - -============================================================== - -super-max & super-nr: - -These numbers control the maximum number of superblocks, and -thus the maximum number of mounted filesystems the kernel -can have. You only need to increase super-max if you need to -mount more filesystems than the current value in super-max -allows you to. - -============================================================== - -aio-nr & aio-max-nr: - -aio-nr shows the current system-wide number of asynchronous io -requests. aio-max-nr allows you to change the maximum value -aio-nr can grow to. - -============================================================== - -mount-max: - -This denotes the maximum number of mounts that may exist -in a mount namespace. - -============================================================== - - -2. /proc/sys/fs/binfmt_misc ----------------------------------------------------------- - -Documentation for the files in /proc/sys/fs/binfmt_misc is -in Documentation/admin-guide/binfmt-misc.rst. - - -3. /proc/sys/fs/mqueue - POSIX message queues filesystem ----------------------------------------------------------- - -The "mqueue" filesystem provides the necessary kernel features to enable the -creation of a user space library that implements the POSIX message queues -API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System -Interfaces specification.) - -The "mqueue" filesystem contains values for determining/setting the amount of -resources used by the file system. - -/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the -maximum number of message queues allowed on the system. - -/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the -maximum number of messages in a queue value. In fact it is the limiting value -for another (user) limit which is set in mq_open invocation. This attribute of -a queue must be less or equal then msg_max. - -/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the -maximum message size value (it is every message queue's attribute set during -its creation). - -/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the -default number of messages in a queue value if attr parameter of mq_open(2) is -NULL. If it exceed msg_max, the default value is initialized msg_max. - -/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting -the default message size value if attr parameter of mq_open(2) is NULL. If it -exceed msgsize_max, the default value is initialized msgsize_max. - -4. /proc/sys/fs/epoll - Configuration options for the epoll interface --------------------------------------------------------- - -This directory contains configuration options for the epoll(7) interface. - -max_user_watches ----------------- - -Every epoll file descriptor can store a number of files to be monitored -for event readiness. Each one of these monitored files constitutes a "watch". -This configuration option sets the maximum number of "watches" that are -allowed for each user. -Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes -on a 64bit one. -The current default value for max_user_watches is the 1/32 of the available -low memory, divided for the "watch" cost in bytes. - diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt deleted file mode 100644 index c0527d8a468a..000000000000 --- a/Documentation/sysctl/kernel.txt +++ /dev/null @@ -1,1139 +0,0 @@ -Documentation for /proc/sys/kernel/* kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> - (c) 2009, Shen Feng<shen@cn.fujitsu.com> - -For general info and legal blurb, please look in README. - -============================================================== - -This file contains documentation for the sysctl files in -/proc/sys/kernel/ and is valid for Linux kernel version 2.2. - -The files in this directory can be used to tune and monitor -miscellaneous and general things in the operation of the Linux -kernel. Since some of the files _can_ be used to screw up your -system, it is advisable to read both documentation and source -before actually making adjustments. - -Currently, these files might (depending on your configuration) -show up in /proc/sys/kernel: - -- acct -- acpi_video_flags -- auto_msgmni -- bootloader_type [ X86 only ] -- bootloader_version [ X86 only ] -- callhome [ S390 only ] -- cap_last_cap -- core_pattern -- core_pipe_limit -- core_uses_pid -- ctrl-alt-del -- dmesg_restrict -- domainname -- hostname -- hotplug -- hardlockup_all_cpu_backtrace -- hardlockup_panic -- hung_task_panic -- hung_task_check_count -- hung_task_timeout_secs -- hung_task_check_interval_secs -- hung_task_warnings -- hyperv_record_panic_msg -- kexec_load_disabled -- kptr_restrict -- l2cr [ PPC only ] -- modprobe ==> Documentation/debugging-modules.txt -- modules_disabled -- msg_next_id [ sysv ipc ] -- msgmax -- msgmnb -- msgmni -- nmi_watchdog -- osrelease -- ostype -- overflowgid -- overflowuid -- panic -- panic_on_oops -- panic_on_stackoverflow -- panic_on_unrecovered_nmi -- panic_on_warn -- panic_print -- panic_on_rcu_stall -- perf_cpu_time_max_percent -- perf_event_paranoid -- perf_event_max_stack -- perf_event_mlock_kb -- perf_event_max_contexts_per_stack -- pid_max -- powersave-nap [ PPC only ] -- printk -- printk_delay -- printk_ratelimit -- printk_ratelimit_burst -- pty ==> Documentation/filesystems/devpts.txt -- randomize_va_space -- real-root-dev ==> Documentation/admin-guide/initrd.rst -- reboot-cmd [ SPARC only ] -- rtsig-max -- rtsig-nr -- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst -- sem -- sem_next_id [ sysv ipc ] -- sg-big-buff [ generic SCSI device (sg) ] -- shm_next_id [ sysv ipc ] -- shm_rmid_forced -- shmall -- shmmax [ sysv ipc ] -- shmmni -- softlockup_all_cpu_backtrace -- soft_watchdog -- stack_erasing -- stop-a [ SPARC only ] -- sysrq ==> Documentation/admin-guide/sysrq.rst -- sysctl_writes_strict -- tainted -- threads-max -- unknown_nmi_panic -- watchdog -- watchdog_thresh -- version - -============================================================== - -acct: - -highwater lowwater frequency - -If BSD-style process accounting is enabled these values control -its behaviour. If free space on filesystem where the log lives -goes below <lowwater>% accounting suspends. If free space gets -above <highwater>% accounting resumes. <Frequency> determines -how often do we check the amount of free space (value is in -seconds). Default: -4 2 30 -That is, suspend accounting if there left <= 2% free; resume it -if we got >=4%; consider information about amount of free space -valid for 30 seconds. - -============================================================== - -acpi_video_flags: - -flags - -See Doc*/kernel/power/video.txt, it allows mode of video boot to be -set during run time. - -============================================================== - -auto_msgmni: - -This variable has no effect and may be removed in future kernel -releases. Reading it always returns 0. -Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni -upon memory add/remove or upon ipc namespace creation/removal. -Echoing "1" into this file enabled msgmni automatic recomputing. -Echoing "0" turned it off. auto_msgmni default value was 1. - - -============================================================== - -bootloader_type: - -x86 bootloader identification - -This gives the bootloader type number as indicated by the bootloader, -shifted left by 4, and OR'd with the low four bits of the bootloader -version. The reason for this encoding is that this used to match the -type_of_loader field in the kernel header; the encoding is kept for -backwards compatibility. That is, if the full bootloader type number -is 0x15 and the full version number is 0x234, this file will contain -the value 340 = 0x154. - -See the type_of_loader and ext_loader_type fields in -Documentation/x86/boot.txt for additional information. - -============================================================== - -bootloader_version: - -x86 bootloader version - -The complete bootloader version number. In the example above, this -file will contain the value 564 = 0x234. - -See the type_of_loader and ext_loader_ver fields in -Documentation/x86/boot.txt for additional information. - -============================================================== - -callhome: - -Controls the kernel's callhome behavior in case of a kernel panic. - -The s390 hardware allows an operating system to send a notification -to a service organization (callhome) in case of an operating system panic. - -When the value in this file is 0 (which is the default behavior) -nothing happens in case of a kernel panic. If this value is set to "1" -the complete kernel oops message is send to the IBM customer service -organization in case the mainframe the Linux operating system is running -on has a service contract with IBM. - -============================================================== - -cap_last_cap - -Highest valid capability of the running kernel. Exports -CAP_LAST_CAP from the kernel. - -============================================================== - -core_pattern: - -core_pattern is used to specify a core dumpfile pattern name. -. max length 128 characters; default value is "core" -. core_pattern is used as a pattern template for the output filename; - certain string patterns (beginning with '%') are substituted with - their actual values. -. backward compatibility with core_uses_pid: - If core_pattern does not include "%p" (default does not) - and core_uses_pid is set, then .PID will be appended to - the filename. -. corename format specifiers: - %<NUL> '%' is dropped - %% output one '%' - %p pid - %P global pid (init PID namespace) - %i tid - %I global tid (init PID namespace) - %u uid (in initial user namespace) - %g gid (in initial user namespace) - %d dump mode, matches PR_SET_DUMPABLE and - /proc/sys/fs/suid_dumpable - %s signal number - %t UNIX time of dump - %h hostname - %e executable filename (may be shortened) - %E executable path - %<OTHER> both are dropped -. If the first character of the pattern is a '|', the kernel will treat - the rest of the pattern as a command to run. The core dump will be - written to the standard input of that program instead of to a file. - -============================================================== - -core_pipe_limit: - -This sysctl is only applicable when core_pattern is configured to pipe -core files to a user space helper (when the first character of -core_pattern is a '|', see above). When collecting cores via a pipe -to an application, it is occasionally useful for the collecting -application to gather data about the crashing process from its -/proc/pid directory. In order to do this safely, the kernel must wait -for the collecting process to exit, so as not to remove the crashing -processes proc files prematurely. This in turn creates the -possibility that a misbehaving userspace collecting process can block -the reaping of a crashed process simply by never exiting. This sysctl -defends against that. It defines how many concurrent crashing -processes may be piped to user space applications in parallel. If -this value is exceeded, then those crashing processes above that value -are noted via the kernel log and their cores are skipped. 0 is a -special value, indicating that unlimited processes may be captured in -parallel, but that no waiting will take place (i.e. the collecting -process is not guaranteed access to /proc/<crashing pid>/). This -value defaults to 0. - -============================================================== - -core_uses_pid: - -The default coredump filename is "core". By setting -core_uses_pid to 1, the coredump filename becomes core.PID. -If core_pattern does not include "%p" (default does not) -and core_uses_pid is set, then .PID will be appended to -the filename. - -============================================================== - -ctrl-alt-del: - -When the value in this file is 0, ctrl-alt-del is trapped and -sent to the init(1) program to handle a graceful restart. -When, however, the value is > 0, Linux's reaction to a Vulcan -Nerve Pinch (tm) will be an immediate reboot, without even -syncing its dirty buffers. - -Note: when a program (like dosemu) has the keyboard in 'raw' -mode, the ctrl-alt-del is intercepted by the program before it -ever reaches the kernel tty layer, and it's up to the program -to decide what to do with it. - -============================================================== - -dmesg_restrict: - -This toggle indicates whether unprivileged users are prevented -from using dmesg(8) to view messages from the kernel's log buffer. -When dmesg_restrict is set to (0) there are no restrictions. When -dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use -dmesg(8). - -The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the -default value of dmesg_restrict. - -============================================================== - -domainname & hostname: - -These files can be used to set the NIS/YP domainname and the -hostname of your box in exactly the same way as the commands -domainname and hostname, i.e.: -# echo "darkstar" > /proc/sys/kernel/hostname -# echo "mydomain" > /proc/sys/kernel/domainname -has the same effect as -# hostname "darkstar" -# domainname "mydomain" - -Note, however, that the classic darkstar.frop.org has the -hostname "darkstar" and DNS (Internet Domain Name Server) -domainname "frop.org", not to be confused with the NIS (Network -Information Service) or YP (Yellow Pages) domainname. These two -domain names are in general different. For a detailed discussion -see the hostname(1) man page. - -============================================================== -hardlockup_all_cpu_backtrace: - -This value controls the hard lockup detector behavior when a hard -lockup condition is detected as to whether or not to gather further -debug information. If enabled, arch-specific all-CPU stack dumping -will be initiated. - -0: do nothing. This is the default behavior. - -1: on detection capture more debug information. -============================================================== - -hardlockup_panic: - -This parameter can be used to control whether the kernel panics -when a hard lockup is detected. - - 0 - don't panic on hard lockup - 1 - panic on hard lockup - -See Documentation/lockup-watchdogs.txt for more information. This can -also be set using the nmi_watchdog kernel parameter. - -============================================================== - -hotplug: - -Path for the hotplug policy agent. -Default value is "/sbin/hotplug". - -============================================================== - -hung_task_panic: - -Controls the kernel's behavior when a hung task is detected. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0: continue operation. This is the default behavior. - -1: panic immediately. - -============================================================== - -hung_task_check_count: - -The upper bound on the number of tasks that are checked. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -============================================================== - -hung_task_timeout_secs: - -When a task in D state did not get scheduled -for more than this value report a warning. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0: means infinite timeout - no checking done. -Possible values to set are in range {0..LONG_MAX/HZ}. - -============================================================== - -hung_task_check_interval_secs: - -Hung task check interval. If hung task checking is enabled -(see hung_task_timeout_secs), the check is done every -hung_task_check_interval_secs seconds. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0 (default): means use hung_task_timeout_secs as checking interval. -Possible values to set are in range {0..LONG_MAX/HZ}. - -============================================================== - -hung_task_warnings: - -The maximum number of warnings to report. During a check interval -if a hung task is detected, this value is decreased by 1. -When this value reaches 0, no more warnings will be reported. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - --1: report an infinite number of warnings. - -============================================================== - -hyperv_record_panic_msg: - -Controls whether the panic kmsg data should be reported to Hyper-V. - -0: do not report panic kmsg data. - -1: report the panic kmsg data. This is the default behavior. - -============================================================== - -kexec_load_disabled: - -A toggle indicating if the kexec_load syscall has been disabled. This -value defaults to 0 (false: kexec_load enabled), but can be set to 1 -(true: kexec_load disabled). Once true, kexec can no longer be used, and -the toggle cannot be set back to false. This allows a kexec image to be -loaded before disabling the syscall, allowing a system to set up (and -later use) an image without it being altered. Generally used together -with the "modules_disabled" sysctl. - -============================================================== - -kptr_restrict: - -This toggle indicates whether restrictions are placed on -exposing kernel addresses via /proc and other interfaces. - -When kptr_restrict is set to 0 (the default) the address is hashed before -printing. (This is the equivalent to %p.) - -When kptr_restrict is set to (1), kernel pointers printed using the %pK -format specifier will be replaced with 0's unless the user has CAP_SYSLOG -and effective user and group ids are equal to the real ids. This is -because %pK checks are done at read() time rather than open() time, so -if permissions are elevated between the open() and the read() (e.g via -a setuid binary) then %pK will not leak kernel pointers to unprivileged -users. Note, this is a temporary solution only. The correct long-term -solution is to do the permission checks at open() time. Consider removing -world read permissions from files that use %pK, and using dmesg_restrict -to protect against uses of %pK in dmesg(8) if leaking kernel pointer -values to unprivileged users is a concern. - -When kptr_restrict is set to (2), kernel pointers printed using -%pK will be replaced with 0's regardless of privileges. - -============================================================== - -l2cr: (PPC only) - -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. - -============================================================== - -modules_disabled: - -A toggle value indicating if modules are allowed to be loaded -in an otherwise modular kernel. This toggle defaults to off -(0), but can be set true (1). Once true, modules can be -neither loaded nor unloaded, and the toggle cannot be set back -to false. Generally used with the "kexec_load_disabled" toggle. - -============================================================== - -msg_next_id, sem_next_id, and shm_next_id: - -These three toggles allows to specify desired id for next allocated IPC -object: message, semaphore or shared memory respectively. - -By default they are equal to -1, which means generic allocation logic. -Possible values to set are in range {0..INT_MAX}. - -Notes: -1) kernel doesn't guarantee, that new object will have desired id. So, -it's up to userspace, how to handle an object with "wrong" id. -2) Toggle with non-default value will be set back to -1 by kernel after -successful IPC object allocation. If an IPC object allocation syscall -fails, it is undefined if the value remains unmodified or is reset to -1. - -============================================================== - -nmi_watchdog: - -This parameter can be used to control the NMI watchdog -(i.e. the hard lockup detector) on x86 systems. - - 0 - disable the hard lockup detector - 1 - enable the hard lockup detector - -The hard lockup detector monitors each CPU for its ability to respond to -timer interrupts. The mechanism utilizes CPU performance counter registers -that are programmed to generate Non-Maskable Interrupts (NMIs) periodically -while a CPU is busy. Hence, the alternative name 'NMI watchdog'. - -The NMI watchdog is disabled by default if the kernel is running as a guest -in a KVM virtual machine. This default can be overridden by adding - - nmi_watchdog=1 - -to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). - -============================================================== - -numa_balancing - -Enables/disables automatic page fault based NUMA memory -balancing. Memory is moved automatically to nodes -that access it often. - -Enables/disables automatic NUMA memory balancing. On NUMA machines, there -is a performance penalty if remote memory is accessed by a CPU. When this -feature is enabled the kernel samples what task thread is accessing memory -by periodically unmapping pages and later trapping a page fault. At the -time of the page fault, it is determined if the data being accessed should -be migrated to a local memory node. - -The unmapping of pages and trapping faults incur additional overhead that -ideally is offset by improved memory locality but there is no universal -guarantee. If the target workload is already bound to NUMA nodes then this -feature should be disabled. Otherwise, if the system overhead from the -feature is too high then the rate the kernel samples for NUMA hinting -faults may be controlled by the numa_balancing_scan_period_min_ms, -numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, -numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. - -============================================================== - -numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, -numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb - -Automatic NUMA balancing scans tasks address space and unmaps pages to -detect if pages are properly placed or if the data should be migrated to a -memory node local to where the task is running. Every "scan delay" the task -scans the next "scan size" number of pages in its address space. When the -end of the address space is reached the scanner restarts from the beginning. - -In combination, the "scan delay" and "scan size" determine the scan rate. -When "scan delay" decreases, the scan rate increases. The scan delay and -hence the scan rate of every task is adaptive and depends on historical -behaviour. If pages are properly placed then the scan delay increases, -otherwise the scan delay decreases. The "scan size" is not adaptive but -the higher the "scan size", the higher the scan rate. - -Higher scan rates incur higher system overhead as page faults must be -trapped and potentially data must be migrated. However, the higher the scan -rate, the more quickly a tasks memory is migrated to a local node if the -workload pattern changes and minimises performance impact due to remote -memory accesses. These sysctls control the thresholds for scan delays and -the number of pages scanned. - -numa_balancing_scan_period_min_ms is the minimum time in milliseconds to -scan a tasks virtual memory. It effectively controls the maximum scanning -rate for each task. - -numa_balancing_scan_delay_ms is the starting "scan delay" used for a task -when it initially forks. - -numa_balancing_scan_period_max_ms is the maximum time in milliseconds to -scan a tasks virtual memory. It effectively controls the minimum scanning -rate for each task. - -numa_balancing_scan_size_mb is how many megabytes worth of pages are -scanned for a given scan. - -============================================================== - -osrelease, ostype & version: - -# cat osrelease -2.1.88 -# cat ostype -Linux -# cat version -#5 Wed Feb 25 21:49:24 MET 1998 - -The files osrelease and ostype should be clear enough. Version -needs a little more clarification however. The '#5' means that -this is the fifth kernel built from this source base and the -date behind it indicates the time the kernel was built. -The only way to tune these values is to rebuild the kernel :-) - -============================================================== - -overflowgid & overflowuid: - -if your architecture did not always support 32-bit UIDs (i.e. arm, -i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to -applications that use the old 16-bit UID/GID system calls, if the -actual UID or GID would exceed 65535. - -These sysctls allow you to change the value of the fixed UID and GID. -The default is 65534. - -============================================================== - -panic: - -The value in this file represents the number of seconds the kernel -waits before rebooting on a panic. When you use the software watchdog, -the recommended setting is 60. - -============================================================== - -panic_on_io_nmi: - -Controls the kernel's behavior when a CPU receives an NMI caused by -an IO error. - -0: try to continue operation (default) - -1: panic immediately. The IO error triggered an NMI. This indicates a - serious system condition which could result in IO data corruption. - Rather than continuing, panicking might be a better choice. Some - servers issue this sort of NMI when the dump button is pushed, - and you can use this option to take a crash dump. - -============================================================== - -panic_on_oops: - -Controls the kernel's behaviour when an oops or BUG is encountered. - -0: try to continue operation - -1: panic immediately. If the `panic' sysctl is also non-zero then the - machine will be rebooted. - -============================================================== - -panic_on_stackoverflow: - -Controls the kernel's behavior when detecting the overflows of -kernel, IRQ and exception stacks except a user stack. -This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. - -0: try to continue operation. - -1: panic immediately. - -============================================================== - -panic_on_unrecovered_nmi: - -The default Linux behaviour on an NMI of either memory or unknown is -to continue operation. For many environments such as scientific -computing it is preferable that the box is taken out and the error -dealt with than an uncorrected parity/ECC error get propagated. - -A small number of systems do generate NMI's for bizarre random reasons -such as power management so the default is off. That sysctl works like -the existing panic controls already in that directory. - -============================================================== - -panic_on_warn: - -Calls panic() in the WARN() path when set to 1. This is useful to avoid -a kernel rebuild when attempting to kdump at the location of a WARN(). - -0: only WARN(), default behaviour. - -1: call panic() after printing out WARN() location. - -============================================================== - -panic_print: - -Bitmask for printing system info when panic happens. User can chose -combination of the following bits: - -bit 0: print all tasks info -bit 1: print system memory info -bit 2: print timer info -bit 3: print locks info if CONFIG_LOCKDEP is on -bit 4: print ftrace buffer - -So for example to print tasks and memory info on panic, user can: - echo 3 > /proc/sys/kernel/panic_print - -============================================================== - -panic_on_rcu_stall: - -When set to 1, calls panic() after RCU stall detection messages. This -is useful to define the root cause of RCU stalls using a vmcore. - -0: do not panic() when RCU stall takes place, default behavior. - -1: panic() after printing RCU stall messages. - -============================================================== - -perf_cpu_time_max_percent: - -Hints to the kernel how much CPU time it should be allowed to -use to handle perf sampling events. If the perf subsystem -is informed that its samples are exceeding this limit, it -will drop its sampling frequency to attempt to reduce its CPU -usage. - -Some perf sampling happens in NMIs. If these samples -unexpectedly take too long to execute, the NMIs can become -stacked up next to each other so much that nothing else is -allowed to execute. - -0: disable the mechanism. Do not monitor or correct perf's - sampling rate no matter how CPU time it takes. - -1-100: attempt to throttle perf's sample rate to this - percentage of CPU. Note: the kernel calculates an - "expected" length of each sample event. 100 here means - 100% of that expected length. Even if this is set to - 100, you may still see sample throttling if this - length is exceeded. Set to 0 if you truly do not care - how much CPU is consumed. - -============================================================== - -perf_event_paranoid: - -Controls use of the performance events system by unprivileged -users (without CAP_SYS_ADMIN). The default value is 2. - - -1: Allow use of (almost) all events by all users - Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK ->=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN - Disallow raw tracepoint access by users without CAP_SYS_ADMIN ->=1: Disallow CPU event access by users without CAP_SYS_ADMIN ->=2: Disallow kernel profiling by users without CAP_SYS_ADMIN - -============================================================== - -perf_event_max_stack: - -Controls maximum number of stack frames to copy for (attr.sample_type & -PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using -'perf record -g' or 'perf trace --call-graph fp'. - -This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. - -The default value is 127. - -============================================================== - -perf_event_mlock_kb: - -Control size of per-cpu ring buffer not counted agains mlock limit. - -The default value is 512 + 1 page - -============================================================== - -perf_event_max_contexts_per_stack: - -Controls maximum number of stack frame context entries for -(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for -instance, when using 'perf record -g' or 'perf trace --call-graph fp'. - -This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. - -The default value is 8. - -============================================================== - -pid_max: - -PID allocation wrap value. When the kernel's next PID value -reaches this value, it wraps back to a minimum PID value. -PIDs of value pid_max or larger are not allocated. - -============================================================== - -ns_last_pid: - -The last pid allocated in the current (the one task using this sysctl -lives in) pid namespace. When selecting a pid for a next task on fork -kernel tries to allocate a number starting from this one. - -============================================================== - -powersave-nap: (PPC only) - -If set, Linux-PPC will use the 'nap' mode of powersaving, -otherwise the 'doze' mode will be used. - -============================================================== - -printk: - -The four values in printk denote: console_loglevel, -default_message_loglevel, minimum_console_loglevel and -default_console_loglevel respectively. - -These values influence printk() behavior when printing or -logging error messages. See 'man 2 syslog' for more info on -the different loglevels. - -- console_loglevel: messages with a higher priority than - this will be printed to the console -- default_message_loglevel: messages without an explicit priority - will be printed with this priority -- minimum_console_loglevel: minimum (highest) value to which - console_loglevel can be set -- default_console_loglevel: default value for console_loglevel - -============================================================== - -printk_delay: - -Delay each printk message in printk_delay milliseconds - -Value from 0 - 10000 is allowed. - -============================================================== - -printk_ratelimit: - -Some warning messages are rate limited. printk_ratelimit specifies -the minimum length of time between these messages (in jiffies), by -default we allow one every 5 seconds. - -A value of 0 will disable rate limiting. - -============================================================== - -printk_ratelimit_burst: - -While long term we enforce one message per printk_ratelimit -seconds, we do allow a burst of messages to pass through. -printk_ratelimit_burst specifies the number of messages we can -send before ratelimiting kicks in. - -============================================================== - -printk_devkmsg: - -Control the logging to /dev/kmsg from userspace: - -ratelimit: default, ratelimited -on: unlimited logging to /dev/kmsg from userspace -off: logging to /dev/kmsg disabled - -The kernel command line parameter printk.devkmsg= overrides this and is -a one-time setting until next reboot: once set, it cannot be changed by -this sysctl interface anymore. - -============================================================== - -randomize_va_space: - -This option can be used to select the type of process address -space randomization that is used in the system, for architectures -that support this feature. - -0 - Turn the process address space randomization off. This is the - default for architectures that do not support this feature anyways, - and kernels that are booted with the "norandmaps" parameter. - -1 - Make the addresses of mmap base, stack and VDSO page randomized. - This, among other things, implies that shared libraries will be - loaded to random addresses. Also for PIE-linked binaries, the - location of code start is randomized. This is the default if the - CONFIG_COMPAT_BRK option is enabled. - -2 - Additionally enable heap randomization. This is the default if - CONFIG_COMPAT_BRK is disabled. - - There are a few legacy applications out there (such as some ancient - versions of libc.so.5 from 1996) that assume that brk area starts - just after the end of the code+bss. These applications break when - start of the brk area is randomized. There are however no known - non-legacy applications that would be broken this way, so for most - systems it is safe to choose full randomization. - - Systems with ancient and/or broken binaries should be configured - with CONFIG_COMPAT_BRK enabled, which excludes the heap from process - address space randomization. - -============================================================== - -reboot-cmd: (Sparc only) - -??? This seems to be a way to give an argument to the Sparc -ROM/Flash boot loader. Maybe to tell it what to do after -rebooting. ??? - -============================================================== - -rtsig-max & rtsig-nr: - -The file rtsig-max can be used to tune the maximum number -of POSIX realtime (queued) signals that can be outstanding -in the system. - -rtsig-nr shows the number of RT signals currently queued. - -============================================================== - -sched_schedstats: - -Enables/disables scheduler statistics. Enabling this feature -incurs a small amount of overhead in the scheduler but is -useful for debugging and performance tuning. - -============================================================== - -sg-big-buff: - -This file shows the size of the generic SCSI (sg) buffer. -You can't tune it just yet, but you could change it on -compile time by editing include/scsi/sg.h and changing -the value of SG_BIG_BUFF. - -There shouldn't be any reason to change this value. If -you can come up with one, you probably know what you -are doing anyway :) - -============================================================== - -shmall: - -This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, SHMALL should always be at least -ceil(shmmax/PAGE_SIZE). - -If you are not sure what the default PAGE_SIZE is on your Linux -system, you can run the following command: - -# getconf PAGE_SIZE - -============================================================== - -shmmax: - -This value can be used to query and set the run time limit -on the maximum shared memory segment size that can be created. -Shared memory segments up to 1Gb are now supported in the -kernel. This value defaults to SHMMAX. - -============================================================== - -shm_rmid_forced: - -Linux lets you set resource limits, including how much memory one -process can consume, via setrlimit(2). Unfortunately, shared memory -segments are allowed to exist without association with any process, and -thus might not be counted against any resource limits. If enabled, -shared memory segments are automatically destroyed when their attach -count becomes zero after a detach or a process termination. It will -also destroy segments that were created, but never attached to, on exit -from the process. The only use left for IPC_RMID is to immediately -destroy an unattached segment. Of course, this breaks the way things are -defined, so some applications might stop working. Note that this -feature will do you no good unless you also configure your resource -limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't -need this. - -Note that if you change this from 0 to 1, already created segments -without users and with a dead originative process will be destroyed. - -============================================================== - -sysctl_writes_strict: - -Control how file position affects the behavior of updating sysctl values -via the /proc/sys interface: - - -1 - Legacy per-write sysctl value handling, with no printk warnings. - Each write syscall must fully contain the sysctl value to be - written, and multiple writes on the same sysctl file descriptor - will rewrite the sysctl value, regardless of file position. - 0 - Same behavior as above, but warn about processes that perform writes - to a sysctl file descriptor when the file position is not 0. - 1 - (default) Respect file position when writing sysctl strings. Multiple - writes will append to the sysctl value buffer. Anything past the max - length of the sysctl value buffer will be ignored. Writes to numeric - sysctl entries must always be at file position 0 and the value must - be fully contained in the buffer sent in the write syscall. - -============================================================== - -softlockup_all_cpu_backtrace: - -This value controls the soft lockup detector thread's behavior -when a soft lockup condition is detected as to whether or not -to gather further debug information. If enabled, each cpu will -be issued an NMI and instructed to capture stack trace. - -This feature is only applicable for architectures which support -NMI. - -0: do nothing. This is the default behavior. - -1: on detection capture more debug information. - -============================================================== - -soft_watchdog - -This parameter can be used to control the soft lockup detector. - - 0 - disable the soft lockup detector - 1 - enable the soft lockup detector - -The soft lockup detector monitors CPUs for threads that are hogging the CPUs -without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads -from running. The mechanism depends on the CPUs ability to respond to timer -interrupts which are needed for the 'watchdog/N' threads to be woken up by -the watchdog timer function, otherwise the NMI watchdog - if enabled - can -detect a hard lockup condition. - -============================================================== - -stack_erasing - -This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. - -That erasing reduces the information which kernel stack leak bugs -can reveal and blocks some uninitialized stack variable attacks. -The tradeoff is the performance impact: on a single CPU system kernel -compilation sees a 1% slowdown, other systems and workloads may vary. - - 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. - - 1: kernel stack erasing is enabled (default), it is performed before - returning to the userspace at the end of syscalls. - -============================================================== - -tainted: - -Non-zero if the kernel has been tainted. Numeric values, which can be -ORed together. The letters are seen in "Tainted" line of Oops reports. - - 1 (P): A module with a non-GPL license has been loaded, this - includes modules with no license. - Set by modutils >= 2.4.9 and module-init-tools. - 2 (F): A module was force loaded by insmod -f. - Set by modutils >= 2.4.9 and module-init-tools. - 4 (S): Unsafe SMP processors: SMP with CPUs not designed for SMP. - 8 (R): A module was forcibly unloaded from the system by rmmod -f. - 16 (M): A hardware machine check error occurred on the system. - 32 (B): A bad page was discovered on the system. - 64 (U): The user has asked that the system be marked "tainted". This - could be because they are running software that directly modifies - the hardware, or for other reasons. - 128 (D): The system has died. - 256 (A): The ACPI DSDT has been overridden with one supplied by the user - instead of using the one provided by the hardware. - 512 (W): A kernel warning has occurred. - 1024 (C): A module from drivers/staging was loaded. - 2048 (I): The system is working around a severe firmware bug. - 4096 (O): An out-of-tree module has been loaded. - 8192 (E): An unsigned module has been loaded in a kernel supporting module - signature. - 16384 (L): A soft lockup has previously occurred on the system. - 32768 (K): The kernel has been live patched. - 65536 (X): Auxiliary taint, defined and used by for distros. -131072 (T): The kernel was built with the struct randomization plugin. - -============================================================== - -threads-max - -This value controls the maximum number of threads that can be created -using fork(). - -During initialization the kernel sets this value such that even if the -maximum number of threads is created, the thread structures occupy only -a part (1/8th) of the available RAM pages. - -The minimum value that can be written to threads-max is 20. -The maximum value that can be written to threads-max is given by the -constant FUTEX_TID_MASK (0x3fffffff). -If a value outside of this range is written to threads-max an error -EINVAL occurs. - -The value written is checked against the available RAM pages. If the -thread structures would occupy too much (more than 1/8th) of the -available RAM pages threads-max is reduced accordingly. - -============================================================== - -unknown_nmi_panic: - -The value in this file affects behavior of handling NMI. When the -value is non-zero, unknown NMI is trapped and then panic occurs. At -that time, kernel debugging information is displayed on console. - -NMI switch that most IA32 servers have fires unknown NMI up, for -example. If a system hangs up, try pressing the NMI switch. - -============================================================== - -watchdog: - -This parameter can be used to disable or enable the soft lockup detector -_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. - - 0 - disable both lockup detectors - 1 - enable both lockup detectors - -The soft lockup detector and the NMI watchdog can also be disabled or -enabled individually, using the soft_watchdog and nmi_watchdog parameters. -If the watchdog parameter is read, for example by executing - - cat /proc/sys/kernel/watchdog - -the output of this command (0 or 1) shows the logical OR of soft_watchdog -and nmi_watchdog. - -============================================================== - -watchdog_cpumask: - -This value can be used to control on which cpus the watchdog may run. -The default cpumask is all possible cores, but if NO_HZ_FULL is -enabled in the kernel config, and cores are specified with the -nohz_full= boot argument, those cores are excluded by default. -Offline cores can be included in this mask, and if the core is later -brought online, the watchdog will be started based on the mask value. - -Typically this value would only be touched in the nohz_full case -to re-enable cores that by default were not running the watchdog, -if a kernel lockup was suspected on those cores. - -The argument value is the standard cpulist format for cpumasks, -so for example to enable the watchdog on cores 0, 2, 3, and 4 you -might say: - - echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask - -============================================================== - -watchdog_thresh: - -This value can be used to control the frequency of hrtimer and NMI -events and the soft and hard lockup thresholds. The default threshold -is 10 seconds. - -The softlockup threshold is (2 * watchdog_thresh). Setting this -tunable to zero will disable lockup detection altogether. - -============================================================== diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt deleted file mode 100644 index 2793d4eac55f..000000000000 --- a/Documentation/sysctl/net.txt +++ /dev/null @@ -1,407 +0,0 @@ -Documentation for /proc/sys/net/* - (c) 1999 Terrehon Bowden <terrehon@pacbell.net> - Bodo Bauer <bb@ricochet.net> - (c) 2000 Jorge Nerin <comandante@zaralinux.com> - (c) 2009 Shen Feng <shen@cn.fujitsu.com> - -For general info and legal blurb, please look in README. - -============================================================== - -This file contains the documentation for the sysctl files in -/proc/sys/net - -The interface to the networking parts of the kernel is located in -/proc/sys/net. The following table shows all possible subdirectories. You may -see only some of them, depending on your kernel's configuration. - - -Table : Subdirectories in /proc/sys/net -.............................................................................. - Directory Content Directory Content - core General parameter appletalk Appletalk protocol - unix Unix domain sockets netrom NET/ROM - 802 E802 protocol ax25 AX25 - ethernet Ethernet protocol rose X.25 PLP layer - ipv4 IP version 4 x25 X.25 protocol - ipx IPX token-ring IBM token ring - bridge Bridging decnet DEC net - ipv6 IP version 6 tipc TIPC -.............................................................................. - -1. /proc/sys/net/core - Network core options -------------------------------------------------------- - -bpf_jit_enable --------------- - -This enables the BPF Just in Time (JIT) compiler. BPF is a flexible -and efficient infrastructure allowing to execute bytecode at various -hook points. It is used in a number of Linux kernel subsystems such -as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints) -and security (e.g. seccomp). LLVM has a BPF back end that can compile -restricted C into a sequence of BPF instructions. After program load -through bpf(2) and passing a verifier in the kernel, a JIT will then -translate these BPF proglets into native CPU instructions. There are -two flavors of JITs, the newer eBPF JIT currently supported on: - - x86_64 - - x86_32 - - arm64 - - arm32 - - ppc64 - - sparc64 - - mips64 - - s390x - -And the older cBPF JIT supported on the following archs: - - mips - - ppc - - sparc - -eBPF JITs are a superset of cBPF JITs, meaning the kernel will -migrate cBPF instructions into eBPF instructions and then JIT -compile them transparently. Older cBPF JITs can only translate -tcpdump filters, seccomp rules, etc, but not mentioned eBPF -programs loaded through bpf(2). - -Values : - 0 - disable the JIT (default value) - 1 - enable the JIT - 2 - enable the JIT and ask the compiler to emit traces on kernel log. - -bpf_jit_harden --------------- - -This enables hardening for the BPF JIT compiler. Supported are eBPF -JIT backends. Enabling hardening trades off performance, but can -mitigate JIT spraying. -Values : - 0 - disable JIT hardening (default value) - 1 - enable JIT hardening for unprivileged users only - 2 - enable JIT hardening for all users - -bpf_jit_kallsyms ----------------- - -When BPF JIT compiler is enabled, then compiled images are unknown -addresses to the kernel, meaning they neither show up in traces nor -in /proc/kallsyms. This enables export of these addresses, which can -be used for debugging/tracing. If bpf_jit_harden is enabled, this -feature is disabled. -Values : - 0 - disable JIT kallsyms export (default value) - 1 - enable JIT kallsyms export for privileged users only - -bpf_jit_limit -------------- - -This enforces a global limit for memory allocations to the BPF JIT -compiler in order to reject unprivileged JIT requests once it has -been surpassed. bpf_jit_limit contains the value of the global limit -in bytes. - -dev_weight --------------- - -The maximum number of packets that kernel can handle on a NAPI interrupt, -it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware -aggregated packet is counted as one packet in this context. - -Default: 64 - -dev_weight_rx_bias --------------- - -RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function -of the driver for the per softirq cycle netdev_budget. This parameter influences -the proportion of the configured netdev_budget that is spent on RPS based packet -processing during RX softirq cycles. It is further meant for making current -dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack. -(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based -on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias). -Default: 1 - -dev_weight_tx_bias --------------- - -Scales the maximum number of packets that can be processed during a TX softirq cycle. -Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric -net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog. -Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias). -Default: 1 - -default_qdisc --------------- - -The default queuing discipline to use for network devices. This allows -overriding the default of pfifo_fast with an alternative. Since the default -queuing discipline is created without additional parameters so is best suited -to queuing disciplines that work well without configuration like stochastic -fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use -queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin -which require setting up classes and bandwidths. Note that physical multiqueue -interfaces still use mq as root qdisc, which in turn uses this default for its -leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead -default to noqueue. -Default: pfifo_fast - -busy_read ----------------- -Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) -Approximate time in us to busy loop waiting for packets on the device queue. -This sets the default value of the SO_BUSY_POLL socket option. -Can be set or overridden per socket by setting socket option SO_BUSY_POLL, -which is the preferred method of enabling. If you need to enable the feature -globally via sysctl, a value of 50 is recommended. -Will increase power usage. -Default: 0 (off) - -busy_poll ----------------- -Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL) -Approximate time in us to busy loop waiting for events. -Recommended value depends on the number of sockets you poll on. -For several sockets 50, for several hundreds 100. -For more than that you probably want to use epoll. -Note that only sockets with SO_BUSY_POLL set will be busy polled, -so you want to either selectively set SO_BUSY_POLL on those sockets or set -sysctl.net.busy_read globally. -Will increase power usage. -Default: 0 (off) - -rmem_default ------------- - -The default setting of the socket receive buffer in bytes. - -rmem_max --------- - -The maximum receive socket buffer size in bytes. - -tstamp_allow_data ------------------ -Allow processes to receive tx timestamps looped together with the original -packet contents. If disabled, transmit timestamp requests from unprivileged -processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set. -Default: 1 (on) - - -wmem_default ------------- - -The default setting (in bytes) of the socket send buffer. - -wmem_max --------- - -The maximum send socket buffer size in bytes. - -message_burst and message_cost ------------------------------- - -These parameters are used to limit the warning messages written to the kernel -log from the networking code. They enforce a rate limit to make a -denial-of-service attack impossible. A higher message_cost factor, results in -fewer messages that will be written. Message_burst controls when messages will -be dropped. The default settings limit warning messages to one every five -seconds. - -warnings --------- - -This sysctl is now unused. - -This was used to control console messages from the networking stack that -occur because of problems on the network like duplicate address or bad -checksums. - -These messages are now emitted at KERN_DEBUG and can generally be enabled -and controlled by the dynamic_debug facility. - -netdev_budget -------------- - -Maximum number of packets taken from all interfaces in one polling cycle (NAPI -poll). In one polling cycle interfaces which are registered to polling are -probed in a round-robin manner. Also, a polling cycle may not exceed -netdev_budget_usecs microseconds, even if netdev_budget has not been -exhausted. - -netdev_budget_usecs ---------------------- - -Maximum number of microseconds in one NAPI polling cycle. Polling -will exit when either netdev_budget_usecs have elapsed during the -poll cycle or the number of packets processed reaches netdev_budget. - -netdev_max_backlog ------------------- - -Maximum number of packets, queued on the INPUT side, when the interface -receives packets faster than kernel can process them. - -netdev_rss_key --------------- - -RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is -randomly generated. -Some user space might need to gather its content even if drivers do not -provide ethtool -x support yet. - -myhost:~# cat /proc/sys/net/core/netdev_rss_key -84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) - -File contains nul bytes if no driver ever called netdev_rss_key_fill() function. -Note: -/proc/sys/net/core/netdev_rss_key contains 52 bytes of key, -but most drivers only use 40 bytes of it. - -myhost:~# ethtool -x eth0 -RX flow hash indirection table for eth0 with 8 RX ring(s): - 0: 0 1 2 3 4 5 6 7 -RSS hash key: -84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 - -netdev_tstamp_prequeue ----------------------- - -If set to 0, RX packet timestamps can be sampled after RPS processing, when -the target CPU processes packets. It might give some delay on timestamps, but -permit to distribute the load on several cpus. - -If set to 1 (default), timestamps are sampled as soon as possible, before -queueing. - -optmem_max ----------- - -Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence -of struct cmsghdr structures with appended data. - -fb_tunnels_only_for_init_net ----------------------------- - -Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0, -sit0, ip6tnl0, ip6gre0) are automatically created when a new -network namespace is created, if corresponding tunnel is present -in initial network namespace. -If set to 1, these devices are not automatically created, and -user space is responsible for creating them if needed. - -Default : 0 (for compatibility reasons) - -2. /proc/sys/net/unix - Parameters for Unix domain sockets -------------------------------------------------------- - -There is only one file in this directory. -unix_dgram_qlen limits the max number of datagrams queued in Unix domain -socket's buffer. It will not take effect unless PF_UNIX flag is specified. - - -3. /proc/sys/net/ipv4 - IPV4 settings -------------------------------------------------------- -Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for -descriptions of these entries. - - -4. Appletalk -------------------------------------------------------- - -The /proc/sys/net/appletalk directory holds the Appletalk configuration data -when Appletalk is loaded. The configurable parameters are: - -aarp-expiry-time ----------------- - -The amount of time we keep an ARP entry before expiring it. Used to age out -old hosts. - -aarp-resolve-time ------------------ - -The amount of time we will spend trying to resolve an Appletalk address. - -aarp-retransmit-limit ---------------------- - -The number of times we will retransmit a query before giving up. - -aarp-tick-time --------------- - -Controls the rate at which expires are checked. - -The directory /proc/net/appletalk holds the list of active Appletalk sockets -on a machine. - -The fields indicate the DDP type, the local address (in network:node format) -the remote address, the size of the transmit pending queue, the size of the -received queue (bytes waiting for applications to read) the state and the uid -owning the socket. - -/proc/net/atalk_iface lists all the interfaces configured for appletalk.It -shows the name of the interface, its Appletalk address, the network range on -that address (or network number for phase 1 networks), and the status of the -interface. - -/proc/net/atalk_route lists each known network route. It lists the target -(network) that the route leads to, the router (may be directly connected), the -route flags, and the device the route is using. - - -5. IPX -------------------------------------------------------- - -The IPX protocol has no tunable values in proc/sys/net. - -The IPX protocol does, however, provide proc/net/ipx. This lists each IPX -socket giving the local and remote addresses in Novell format (that is -network:node:port). In accordance with the strange Novell tradition, -everything but the port is in hex. Not_Connected is displayed for sockets that -are not tied to a specific remote address. The Tx and Rx queue sizes indicate -the number of bytes pending for transmission and reception. The state -indicates the state the socket is in and the uid is the owning uid of the -socket. - -The /proc/net/ipx_interface file lists all IPX interfaces. For each interface -it gives the network number, the node number, and indicates if the network is -the primary network. It also indicates which device it is bound to (or -Internal for internal networks) and the Frame Type if appropriate. Linux -supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for -IPX. - -The /proc/net/ipx_route table holds a list of IPX routes. For each route it -gives the destination network, the router node (or Directly) and the network -address of the router (or Connected) for internal networks. - -6. TIPC -------------------------------------------------------- - -tipc_rmem ----------- - -The TIPC protocol now has a tunable for the receive memory, similar to the -tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) - - # cat /proc/sys/net/tipc/tipc_rmem - 4252725 34021800 68043600 - # - -The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values -are scaled (shifted) versions of that same value. Note that the min value -is not at this point in time used in any meaningful way, but the triplet is -preserved in order to be consistent with things like tcp_rmem. - -named_timeout --------------- - -TIPC name table updates are distributed asynchronously in a cluster, without -any form of transaction handling. This means that different race scenarios are -possible. One such is that a name withdrawal sent out by one node and received -by another node may arrive after a second, overlapping name publication already -has been accepted from a third node, although the conflicting updates -originally may have been issued in the correct sequential order. -If named_timeout is nonzero, failed topology updates will be placed on a defer -queue until another event arrives that clears the error, or until the timeout -expires. Value is in milliseconds. diff --git a/Documentation/sysctl/sunrpc.txt b/Documentation/sysctl/sunrpc.txt deleted file mode 100644 index ae1ecac6f85a..000000000000 --- a/Documentation/sysctl/sunrpc.txt +++ /dev/null @@ -1,20 +0,0 @@ -Documentation for /proc/sys/sunrpc/* kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> - -For general info and legal blurb, please look in README. - -============================================================== - -This file contains the documentation for the sysctl files in -/proc/sys/sunrpc and is valid for Linux kernel version 2.2. - -The files in this directory can be used to (re)set the debug -flags of the SUN Remote Procedure Call (RPC) subsystem in -the Linux kernel. This stuff is used for NFS, KNFSD and -maybe a few other things as well. - -The files in there are used to control the debugging flags: -rpc_debug, nfs_debug, nfsd_debug and nlm_debug. - -These flags are for kernel hackers only. You should read the -source code in net/sunrpc/ for more information. diff --git a/Documentation/sysctl/user.txt b/Documentation/sysctl/user.txt deleted file mode 100644 index a5882865836e..000000000000 --- a/Documentation/sysctl/user.txt +++ /dev/null @@ -1,66 +0,0 @@ -Documentation for /proc/sys/user/* kernel version 4.9.0 - (c) 2016 Eric Biederman <ebiederm@xmission.com> - -============================================================== - -This file contains the documentation for the sysctl files in -/proc/sys/user. - -The files in this directory can be used to override the default -limits on the number of namespaces and other objects that have -per user per user namespace limits. - -The primary purpose of these limits is to stop programs that -malfunction and attempt to create a ridiculous number of objects, -before the malfunction becomes a system wide problem. It is the -intention that the defaults of these limits are set high enough that -no program in normal operation should run into these limits. - -The creation of per user per user namespace objects are charged to -the user in the user namespace who created the object and -verified to be below the per user limit in that user namespace. - -The creation of objects is also charged to all of the users -who created user namespaces the creation of the object happens -in (user namespaces can be nested) and verified to be below the per user -limits in the user namespaces of those users. - -This recursive counting of created objects ensures that creating a -user namespace does not allow a user to escape their current limits. - -Currently, these files are in /proc/sys/user: - -- max_cgroup_namespaces - - The maximum number of cgroup namespaces that any user in the current - user namespace may create. - -- max_ipc_namespaces - - The maximum number of ipc namespaces that any user in the current - user namespace may create. - -- max_mnt_namespaces - - The maximum number of mount namespaces that any user in the current - user namespace may create. - -- max_net_namespaces - - The maximum number of network namespaces that any user in the - current user namespace may create. - -- max_pid_namespaces - - The maximum number of pid namespaces that any user in the current - user namespace may create. - -- max_user_namespaces - - The maximum number of user namespaces that any user in the current - user namespace may create. - -- max_uts_namespaces - - The maximum number of user namespaces that any user in the current - user namespace may create. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt deleted file mode 100644 index 187ce4f599a2..000000000000 --- a/Documentation/sysctl/vm.txt +++ /dev/null @@ -1,934 +0,0 @@ -Documentation for /proc/sys/vm/* kernel version 2.6.29 - (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> - (c) 2008 Peter W. Morreale <pmorreale@novell.com> - -For general info and legal blurb, please look in README. - -============================================================== - -This file contains the documentation for the sysctl files in -/proc/sys/vm and is valid for Linux kernel version 2.6.29. - -The files in this directory can be used to tune the operation -of the virtual memory (VM) subsystem of the Linux kernel and -the writeout of dirty data to disk. - -Default values and initialization routines for most of these -files can be found in mm/swap.c. - -Currently, these files are in /proc/sys/vm: - -- admin_reserve_kbytes -- block_dump -- compact_memory -- compact_unevictable_allowed -- dirty_background_bytes -- dirty_background_ratio -- dirty_bytes -- dirty_expire_centisecs -- dirty_ratio -- dirtytime_expire_seconds -- dirty_writeback_centisecs -- drop_caches -- extfrag_threshold -- hugetlb_shm_group -- laptop_mode -- legacy_va_layout -- lowmem_reserve_ratio -- max_map_count -- memory_failure_early_kill -- memory_failure_recovery -- min_free_kbytes -- min_slab_ratio -- min_unmapped_ratio -- mmap_min_addr -- mmap_rnd_bits -- mmap_rnd_compat_bits -- nr_hugepages -- nr_hugepages_mempolicy -- nr_overcommit_hugepages -- nr_trim_pages (only if CONFIG_MMU=n) -- numa_zonelist_order -- oom_dump_tasks -- oom_kill_allocating_task -- overcommit_kbytes -- overcommit_memory -- overcommit_ratio -- page-cluster -- panic_on_oom -- percpu_pagelist_fraction -- stat_interval -- stat_refresh -- numa_stat -- swappiness -- user_reserve_kbytes -- vfs_cache_pressure -- watermark_boost_factor -- watermark_scale_factor -- zone_reclaim_mode - -============================================================== - -admin_reserve_kbytes - -The amount of free memory in the system that should be reserved for users -with the capability cap_sys_admin. - -admin_reserve_kbytes defaults to min(3% of free pages, 8MB) - -That should provide enough for the admin to log in and kill a process, -if necessary, under the default overcommit 'guess' mode. - -Systems running under overcommit 'never' should increase this to account -for the full Virtual Memory Size of programs used to recover. Otherwise, -root may not be able to log in to recover the system. - -How do you calculate a minimum useful reserve? - -sshd or login + bash (or some other shell) + top (or ps, kill, etc.) - -For overcommit 'guess', we can sum resident set sizes (RSS). -On x86_64 this is about 8MB. - -For overcommit 'never', we can take the max of their virtual sizes (VSZ) -and add the sum of their RSS. -On x86_64 this is about 128MB. - -Changing this takes effect whenever an application requests memory. - -============================================================== - -block_dump - -block_dump enables block I/O debugging when set to a nonzero value. More -information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. - -============================================================== - -compact_memory - -Available only when CONFIG_COMPACTION is set. When 1 is written to the file, -all zones are compacted such that free memory is available in contiguous -blocks where possible. This can be important for example in the allocation of -huge pages although processes will also directly compact memory as required. - -============================================================== - -compact_unevictable_allowed - -Available only when CONFIG_COMPACTION is set. When set to 1, compaction is -allowed to examine the unevictable lru (mlocked pages) for pages to compact. -This should be used on systems where stalls for minor page faults are an -acceptable trade for large contiguous free memory. Set to 0 to prevent -compaction from moving pages that are unevictable. Default value is 1. - -============================================================== - -dirty_background_bytes - -Contains the amount of dirty memory at which the background kernel -flusher threads will start writeback. - -Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only -one of them may be specified at a time. When one sysctl is written it is -immediately taken into account to evaluate the dirty memory limits and the -other appears as 0 when read. - -============================================================== - -dirty_background_ratio - -Contains, as a percentage of total available memory that contains free pages -and reclaimable pages, the number of pages at which the background kernel -flusher threads will start writing out dirty data. - -The total available memory is not equal to total system memory. - -============================================================== - -dirty_bytes - -Contains the amount of dirty memory at which a process generating disk writes -will itself start writeback. - -Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be -specified at a time. When one sysctl is written it is immediately taken into -account to evaluate the dirty memory limits and the other appears as 0 when -read. - -Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any -value lower than this limit will be ignored and the old configuration will be -retained. - -============================================================== - -dirty_expire_centisecs - -This tunable is used to define when dirty data is old enough to be eligible -for writeout by the kernel flusher threads. It is expressed in 100'ths -of a second. Data which has been dirty in-memory for longer than this -interval will be written out next time a flusher thread wakes up. - -============================================================== - -dirty_ratio - -Contains, as a percentage of total available memory that contains free pages -and reclaimable pages, the number of pages at which a process which is -generating disk writes will itself start writing out dirty data. - -The total available memory is not equal to total system memory. - -============================================================== - -dirtytime_expire_seconds - -When a lazytime inode is constantly having its pages dirtied, the inode with -an updated timestamp will never get chance to be written out. And, if the -only thing that has happened on the file system is a dirtytime inode caused -by an atime update, a worker will be scheduled to make sure that inode -eventually gets pushed out to disk. This tunable is used to define when dirty -inode is old enough to be eligible for writeback by the kernel flusher threads. -And, it is also used as the interval to wakeup dirtytime_writeback thread. - -============================================================== - -dirty_writeback_centisecs - -The kernel flusher threads will periodically wake up and write `old' data -out to disk. This tunable expresses the interval between those wakeups, in -100'ths of a second. - -Setting this to zero disables periodic writeback altogether. - -============================================================== - -drop_caches - -Writing to this will cause the kernel to drop clean caches, as well as -reclaimable slab objects like dentries and inodes. Once dropped, their -memory becomes free. - -To free pagecache: - echo 1 > /proc/sys/vm/drop_caches -To free reclaimable slab objects (includes dentries and inodes): - echo 2 > /proc/sys/vm/drop_caches -To free slab objects and pagecache: - echo 3 > /proc/sys/vm/drop_caches - -This is a non-destructive operation and will not free any dirty objects. -To increase the number of objects freed by this operation, the user may run -`sync' prior to writing to /proc/sys/vm/drop_caches. This will minimize the -number of dirty objects on the system and create more candidates to be -dropped. - -This file is not a means to control the growth of the various kernel caches -(inodes, dentries, pagecache, etc...) These objects are automatically -reclaimed by the kernel when memory is needed elsewhere on the system. - -Use of this file can cause performance problems. Since it discards cached -objects, it may cost a significant amount of I/O and CPU to recreate the -dropped objects, especially if they were under heavy use. Because of this, -use outside of a testing or debugging environment is not recommended. - -You may see informational messages in your kernel log when this file is -used: - - cat (1234): drop_caches: 3 - -These are informational only. They do not mean that anything is wrong -with your system. To disable them, echo 4 (bit 3) into drop_caches. - -============================================================== - -extfrag_threshold - -This parameter affects whether the kernel will compact memory or direct -reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in -debugfs shows what the fragmentation index for each order is in each zone in -the system. Values tending towards 0 imply allocations would fail due to lack -of memory, values towards 1000 imply failures are due to fragmentation and -1 -implies that the allocation will succeed as long as watermarks are met. - -The kernel will not compact memory in a zone if the -fragmentation index is <= extfrag_threshold. The default value is 500. - -============================================================== - -highmem_is_dirtyable - -Available only for systems with CONFIG_HIGHMEM enabled (32b systems). - -This parameter controls whether the high memory is considered for dirty -writers throttling. This is not the case by default which means that -only the amount of memory directly visible/usable by the kernel can -be dirtied. As a result, on systems with a large amount of memory and -lowmem basically depleted writers might be throttled too early and -streaming writes can get very slow. - -Changing the value to non zero would allow more memory to be dirtied -and thus allow writers to write more data which can be flushed to the -storage more effectively. Note this also comes with a risk of pre-mature -OOM killer because some writers (e.g. direct block device writes) can -only use the low memory and they can fill it up with dirty data without -any throttling. - -============================================================== - -hugetlb_shm_group - -hugetlb_shm_group contains group id that is allowed to create SysV -shared memory segment using hugetlb page. - -============================================================== - -laptop_mode - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. - -============================================================== - -legacy_va_layout - -If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel -will use the legacy (2.4) layout for all processes. - -============================================================== - -lowmem_reserve_ratio - -For some specialised workloads on highmem machines it is dangerous for -the kernel to allow process memory to be allocated from the "lowmem" -zone. This is because that memory could then be pinned via the mlock() -system call, or by unavailability of swapspace. - -And on large highmem machines this lack of reclaimable lowmem memory -can be fatal. - -So the Linux page allocator has a mechanism which prevents allocations -which _could_ use highmem from using too much lowmem. This means that -a certain amount of lowmem is defended from the possibility of being -captured into pinned user memory. - -(The same argument applies to the old 16 megabyte ISA DMA region. This -mechanism will also defend that region from allocations which could use -highmem or lowmem). - -The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is -in defending these lower zones. - -If you have a machine which uses highmem or ISA DMA and your -applications are using mlock(), or if you are running with no swap then -you probably should change the lowmem_reserve_ratio setting. - -The lowmem_reserve_ratio is an array. You can see them by reading this file. -- -% cat /proc/sys/vm/lowmem_reserve_ratio -256 256 32 -- - -But, these values are not used directly. The kernel calculates # of protection -pages for each zones from them. These are shown as array of protection pages -in /proc/zoneinfo like followings. (This is an example of x86-64 box). -Each zone has an array of protection pages like this. - -- -Node 0, zone DMA - pages free 1355 - min 3 - low 3 - high 4 - : - : - numa_other 0 - protection: (0, 2004, 2004, 2004) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - pagesets - cpu: 0 pcp: 0 - : -- -These protections are added to score to judge whether this zone should be used -for page allocation or should be reclaimed. - -In this example, if normal pages (index=2) are required to this DMA zone and -watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should -not be used because pages_free(1355) is smaller than watermark + protection[2] -(4 + 2004 = 2008). If this protection value is 0, this zone would be used for -normal page requirement. If requirement is DMA zone(index=0), protection[0] -(=0) is used. - -zone[i]'s protection[j] is calculated by following expression. - -(i < j): - zone[i]->protection[j] - = (total sums of managed_pages from zone[i+1] to zone[j] on the node) - / lowmem_reserve_ratio[i]; -(i = j): - (should not be protected. = 0; -(i > j): - (not necessary, but looks 0) - -The default values of lowmem_reserve_ratio[i] are - 256 (if zone[i] means DMA or DMA32 zone) - 32 (others). -As above expression, they are reciprocal number of ratio. -256 means 1/256. # of protection pages becomes about "0.39%" of total managed -pages of higher zones on the node. - -If you would like to protect more pages, smaller values are effective. -The minimum value is 1 (1/1 -> 100%). The value less than 1 completely -disables protection of the pages. - -============================================================== - -max_map_count: - -This file contains the maximum number of memory map areas a process -may have. Memory map areas are used as a side-effect of calling -malloc, directly by mmap, mprotect, and madvise, and also when loading -shared libraries. - -While most applications need less than a thousand maps, certain -programs, particularly malloc debuggers, may consume lots of them, -e.g., up to one or two maps per allocation. - -The default value is 65536. - -============================================================= - -memory_failure_early_kill: - -Control how to kill processes when uncorrected memory error (typically -a 2bit error in a memory module) is detected in the background by hardware -that cannot be handled by the kernel. In some cases (like the page -still having a valid copy on disk) the kernel will handle the failure -transparently without affecting any applications. But if there is -no other uptodate copy of the data it will kill to prevent any data -corruptions from propagating. - -1: Kill all processes that have the corrupted and not reloadable page mapped -as soon as the corruption is detected. Note this is not supported -for a few types of pages, like kernel internally allocated data or -the swap cache, but works for the majority of user pages. - -0: Only unmap the corrupted page from all processes and only kill a process -who tries to access it. - -The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can -handle this if they want to. - -This is only active on architectures/platforms with advanced machine -check handling and depends on the hardware capabilities. - -Applications can override this setting individually with the PR_MCE_KILL prctl - -============================================================== - -memory_failure_recovery - -Enable memory failure recovery (when supported by the platform) - -1: Attempt recovery. - -0: Always panic on a memory failure. - -============================================================== - -min_free_kbytes: - -This is used to force the Linux VM to keep a minimum number -of kilobytes free. The VM uses this number to compute a -watermark[WMARK_MIN] value for each lowmem zone in the system. -Each lowmem zone gets a number of reserved free pages based -proportionally on its size. - -Some minimal amount of memory is needed to satisfy PF_MEMALLOC -allocations; if you set this to lower than 1024KB, your system will -become subtly broken, and prone to deadlock under high loads. - -Setting this too high will OOM your machine instantly. - -============================================================= - -min_slab_ratio: - -This is available only on NUMA kernels. - -A percentage of the total pages in each zone. On Zone reclaim -(fallback from the local zone occurs) slabs will be reclaimed if more -than this percentage of pages in a zone are reclaimable slab pages. -This insures that the slab growth stays under control even in NUMA -systems that rarely perform global reclaim. - -The default is 5 percent. - -Note that slab reclaim is triggered in a per zone / node fashion. -The process of reclaiming slab memory is currently not node specific -and may not be fast. - -============================================================= - -min_unmapped_ratio: - -This is available only on NUMA kernels. - -This is a percentage of the total pages in each zone. Zone reclaim will -only occur if more than this percentage of pages are in a state that -zone_reclaim_mode allows to be reclaimed. - -If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared -against all file-backed unmapped pages including swapcache pages and tmpfs -files. Otherwise, only unmapped pages backed by normal files but not tmpfs -files and similar are considered. - -The default is 1 percent. - -============================================================== - -mmap_min_addr - -This file indicates the amount of address space which a user process will -be restricted from mmapping. Since kernel null dereference bugs could -accidentally operate based on the information in the first couple of pages -of memory userspace processes should not be allowed to write to them. By -default this value is set to 0 and no protections will be enforced by the -security module. Setting this value to something like 64k will allow the -vast majority of applications to work correctly and provide defense in depth -against future potential kernel bugs. - -============================================================== - -mmap_rnd_bits: - -This value can be used to select the number of bits to use to -determine the random offset to the base address of vma regions -resulting from mmap allocations on architectures which support -tuning address space randomization. This value will be bounded -by the architecture's minimum and maximum supported values. - -This value can be changed after boot using the -/proc/sys/vm/mmap_rnd_bits tunable - -============================================================== - -mmap_rnd_compat_bits: - -This value can be used to select the number of bits to use to -determine the random offset to the base address of vma regions -resulting from mmap allocations for applications run in -compatibility mode on architectures which support tuning address -space randomization. This value will be bounded by the -architecture's minimum and maximum supported values. - -This value can be changed after boot using the -/proc/sys/vm/mmap_rnd_compat_bits tunable - -============================================================== - -nr_hugepages - -Change the minimum size of the hugepage pool. - -See Documentation/admin-guide/mm/hugetlbpage.rst - -============================================================== - -nr_hugepages_mempolicy - -Change the size of the hugepage pool at run-time on a specific -set of NUMA nodes. - -See Documentation/admin-guide/mm/hugetlbpage.rst - -============================================================== - -nr_overcommit_hugepages - -Change the maximum size of the hugepage pool. The maximum is -nr_hugepages + nr_overcommit_hugepages. - -See Documentation/admin-guide/mm/hugetlbpage.rst - -============================================================== - -nr_trim_pages - -This is available only on NOMMU kernels. - -This value adjusts the excess page trimming behaviour of power-of-2 aligned -NOMMU mmap allocations. - -A value of 0 disables trimming of allocations entirely, while a value of 1 -trims excess pages aggressively. Any value >= 1 acts as the watermark where -trimming of allocations is initiated. - -The default value is 1. - -See Documentation/nommu-mmap.txt for more information. - -============================================================== - -numa_zonelist_order - -This sysctl is only for NUMA and it is deprecated. Anything but -Node order will fail! - -'where the memory is allocated from' is controlled by zonelists. -(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. - you may be able to read ZONE_DMA as ZONE_DMA32...) - -In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. -ZONE_NORMAL -> ZONE_DMA -This means that a memory allocation request for GFP_KERNEL will -get memory from ZONE_DMA only when ZONE_NORMAL is not available. - -In NUMA case, you can think of following 2 types of order. -Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL - -(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL -(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. - -Type(A) offers the best locality for processes on Node(0), but ZONE_DMA -will be used before ZONE_NORMAL exhaustion. This increases possibility of -out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. - -Type(B) cannot offer the best locality but is more robust against OOM of -the DMA zone. - -Type(A) is called as "Node" order. Type (B) is "Zone" order. - -"Node order" orders the zonelists by node, then by zone within each node. -Specify "[Nn]ode" for node order - -"Zone Order" orders the zonelists by zone type, then by node within each -zone. Specify "[Zz]one" for zone order. - -Specify "[Dd]efault" to request automatic configuration. - -On 32-bit, the Normal zone needs to be preserved for allocations accessible -by the kernel, so "zone" order will be selected. - -On 64-bit, devices that require DMA32/DMA are relatively rare, so "node" -order will be selected. - -Default order is recommended unless this is causing problems for your -system/application. - -============================================================== - -oom_dump_tasks - -Enables a system-wide task dump (excluding kernel threads) to be produced -when the kernel performs an OOM-killing and includes such information as -pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj -score, and name. This is helpful to determine why the OOM killer was -invoked, to identify the rogue task that caused it, and to determine why -the OOM killer chose the task it did to kill. - -If this is set to zero, this information is suppressed. On very -large systems with thousands of tasks it may not be feasible to dump -the memory state information for each one. Such systems should not -be forced to incur a performance penalty in OOM conditions when the -information may not be desired. - -If this is set to non-zero, this information is shown whenever the -OOM killer actually kills a memory-hogging task. - -The default value is 1 (enabled). - -============================================================== - -oom_kill_allocating_task - -This enables or disables killing the OOM-triggering task in -out-of-memory situations. - -If this is set to zero, the OOM killer will scan through the entire -tasklist and select a task based on heuristics to kill. This normally -selects a rogue memory-hogging task that frees up a large amount of -memory when killed. - -If this is set to non-zero, the OOM killer simply kills the task that -triggered the out-of-memory condition. This avoids the expensive -tasklist scan. - -If panic_on_oom is selected, it takes precedence over whatever value -is used in oom_kill_allocating_task. - -The default value is 0. - -============================================================== - -overcommit_kbytes: - -When overcommit_memory is set to 2, the committed address space is not -permitted to exceed swap plus this amount of physical RAM. See below. - -Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one -of them may be specified at a time. Setting one disables the other (which -then appears as 0 when read). - -============================================================== - -overcommit_memory: - -This value contains a flag that enables memory overcommitment. - -When this flag is 0, the kernel attempts to estimate the amount -of free memory left when userspace requests more memory. - -When this flag is 1, the kernel pretends there is always enough -memory until it actually runs out. - -When this flag is 2, the kernel uses a "never overcommit" -policy that attempts to prevent any overcommit of memory. -Note that user_reserve_kbytes affects this policy. - -This feature can be very useful because there are a lot of -programs that malloc() huge amounts of memory "just-in-case" -and don't use much of it. - -The default value is 0. - -See Documentation/vm/overcommit-accounting.rst and -mm/util.c::__vm_enough_memory() for more information. - -============================================================== - -overcommit_ratio: - -When overcommit_memory is set to 2, the committed address -space is not permitted to exceed swap plus this percentage -of physical RAM. See above. - -============================================================== - -page-cluster - -page-cluster controls the number of pages up to which consecutive pages -are read in from swap in a single attempt. This is the swap counterpart -to page cache readahead. -The mentioned consecutivity is not in terms of virtual/physical addresses, -but consecutive on swap space - that means they were swapped out together. - -It is a logarithmic value - setting it to zero means "1 page", setting -it to 1 means "2 pages", setting it to 2 means "4 pages", etc. -Zero disables swap readahead completely. - -The default value is three (eight pages at a time). There may be some -small benefits in tuning this to a different value if your workload is -swap-intensive. - -Lower values mean lower latencies for initial faults, but at the same time -extra faults and I/O delays for following faults if they would have been part of -that consecutive pages readahead would have brought in. - -============================================================= - -panic_on_oom - -This enables or disables panic on out-of-memory feature. - -If this is set to 0, the kernel will kill some rogue process, -called oom_killer. Usually, oom_killer can kill rogue processes and -system will survive. - -If this is set to 1, the kernel panics when out-of-memory happens. -However, if a process limits using nodes by mempolicy/cpusets, -and those nodes become memory exhaustion status, one process -may be killed by oom-killer. No panic occurs in this case. -Because other nodes' memory may be free. This means system total status -may be not fatal yet. - -If this is set to 2, the kernel panics compulsorily even on the -above-mentioned. Even oom happens under memory cgroup, the whole -system panics. - -The default value is 0. -1 and 2 are for failover of clustering. Please select either -according to your policy of failover. -panic_on_oom=2+kdump gives you very strong tool to investigate -why oom happens. You can get snapshot. - -============================================================= - -percpu_pagelist_fraction - -This is the fraction of pages at most (high mark pcp->high) in each zone that -are allocated for each per cpu page list. The min value for this is 8. It -means that we don't allow more than 1/8th of pages in each zone to be -allocated in any single per_cpu_pagelist. This entry only changes the value -of hot per cpu pagelists. User can specify a number like 100 to allocate -1/100th of each zone to each per cpu page list. - -The batch value of each per cpu pagelist is also updated as a result. It is -set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) - -The initial value is zero. Kernel does not use this value at boot time to set -the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. - -============================================================== - -stat_interval - -The time interval between which vm statistics are updated. The default -is 1 second. - -============================================================== - -stat_refresh - -Any read or write (by root only) flushes all the per-cpu vm statistics -into their global totals, for more accurate reports when testing -e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo - -As a side-effect, it also checks for negative totals (elsewhere reported -as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. -(At time of writing, a few stats are known sometimes to be found negative, -with no ill effects: errors and warnings on these stats are suppressed.) - -============================================================== - -numa_stat - -This interface allows runtime configuration of numa statistics. - -When page allocation performance becomes a bottleneck and you can tolerate -some possible tool breakage and decreased numa counter precision, you can -do: - echo 0 > /proc/sys/vm/numa_stat - -When page allocation performance is not a bottleneck and you want all -tooling to work, you can do: - echo 1 > /proc/sys/vm/numa_stat - -============================================================== - -swappiness - -This control is used to define how aggressive the kernel will swap -memory pages. Higher values will increase aggressiveness, lower values -decrease the amount of swap. A value of 0 instructs the kernel not to -initiate swap until the amount of free and file-backed pages is less -than the high water mark in a zone. - -The default value is 60. - -============================================================== - -- user_reserve_kbytes - -When overcommit_memory is set to 2, "never overcommit" mode, reserve -min(3% of current process size, user_reserve_kbytes) of free memory. -This is intended to prevent a user from starting a single memory hogging -process, such that they cannot recover (kill the hog). - -user_reserve_kbytes defaults to min(3% of the current process size, 128MB). - -If this is reduced to zero, then the user will be allowed to allocate -all free memory with a single process, minus admin_reserve_kbytes. -Any subsequent attempts to execute a command will result in -"fork: Cannot allocate memory". - -Changing this takes effect whenever an application requests memory. - -============================================================== - -vfs_cache_pressure ------------------- - -This percentage value controls the tendency of the kernel to reclaim -the memory which is used for caching of directory and inode objects. - -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. - -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. - -============================================================= - -watermark_boost_factor: - -This factor controls the level of reclaim when memory is being fragmented. -It defines the percentage of the high watermark of a zone that will be -reclaimed if pages of different mobility are being mixed within pageblocks. -The intent is that compaction has less work to do in the future and to -increase the success rate of future high-order allocations such as SLUB -allocations, THP and hugetlbfs pages. - -To make it sensible with respect to the watermark_scale_factor parameter, -the unit is in fractions of 10,000. The default value of 15,000 means -that up to 150% of the high watermark will be reclaimed in the event of -a pageblock being mixed due to fragmentation. The level of reclaim is -determined by the number of fragmentation events that occurred in the -recent past. If this value is smaller than a pageblock then a pageblocks -worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor -of 0 will disable the feature. - -============================================================= - -watermark_scale_factor: - -This factor controls the aggressiveness of kswapd. It defines the -amount of memory left in a node/system before kswapd is woken up and -how much memory needs to be free before kswapd goes back to sleep. - -The unit is in fractions of 10,000. The default value of 10 means the -distances between watermarks are 0.1% of the available memory in the -node/system. The maximum value is 1000, or 10% of memory. - -A high rate of threads entering direct reclaim (allocstall) or kswapd -going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate -that the number of free pages kswapd maintains for latency reasons is -too small for the allocation bursts occurring in the system. This knob -can then be used to tune kswapd aggressiveness accordingly. - -============================================================== - -zone_reclaim_mode: - -Zone_reclaim_mode allows someone to set more or less aggressive approaches to -reclaim memory when a zone runs out of memory. If it is set to zero then no -zone reclaim occurs. Allocations will be satisfied from other zones / nodes -in the system. - -This is value ORed together of - -1 = Zone reclaim on -2 = Zone reclaim writes dirty pages out -4 = Zone reclaim swaps pages - -zone_reclaim_mode is disabled by default. For file servers or workloads -that benefit from having their data cached, zone_reclaim_mode should be -left disabled as the caching effect is likely to be more important than -data locality. - -zone_reclaim may be enabled if it's known that the workload is partitioned -such that each partition fits within a NUMA node and that accessing remote -memory would cause a measurable performance reduction. The page allocator -will then reclaim easily reusable pages (those page cache pages that are -currently not used) before allocating off node pages. - -Allowing zone reclaim to write out pages stops processes that are -writing large amounts of data from dirtying pages on other nodes. Zone -reclaim will write out dirty pages if a zone fills up and so effectively -throttle the process. This may decrease the performance of a single process -since it cannot use all of system memory to buffer the outgoing writes -anymore but it preserve the memory on other nodes so that the performance -of other processes running on other nodes will not be affected. - -Allowing regular swap effectively restricts allocations to the local -node unless explicitly overridden by memory policies or cpuset -configurations. - -============ End of Document ================================= |