diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/DocBook/device-drivers.tmpl | 5 | ||||
-rw-r--r-- | Documentation/DocBook/kernel-api.tmpl | 6 | ||||
-rw-r--r-- | Documentation/feature-removal-schedule.txt | 10 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/00-INDEX | 4 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/idmapper.txt | 67 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/nfsroot.txt | 22 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/pnfs.txt | 48 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 14 | ||||
-rw-r--r-- | Documentation/hwmon/ltc4261 | 63 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 5 | ||||
-rw-r--r-- | Documentation/misc-devices/apds990x.txt | 111 | ||||
-rw-r--r-- | Documentation/misc-devices/bh1770glc.txt | 116 | ||||
-rw-r--r-- | Documentation/sysrq.txt | 7 | ||||
-rw-r--r-- | Documentation/timers/hpet_example.c | 27 | ||||
-rw-r--r-- | Documentation/trace/postprocess/trace-vmscan-postprocess.pl | 39 | ||||
-rw-r--r-- | Documentation/vm/highmem.txt | 162 |
16 files changed, 686 insertions, 20 deletions
diff --git a/Documentation/DocBook/device-drivers.tmpl b/Documentation/DocBook/device-drivers.tmpl index feca0758391e..22edcbb9ddaf 100644 --- a/Documentation/DocBook/device-drivers.tmpl +++ b/Documentation/DocBook/device-drivers.tmpl @@ -51,8 +51,13 @@ <sect1><title>Delaying, scheduling, and timer routines</title> !Iinclude/linux/sched.h !Ekernel/sched.c +!Iinclude/linux/completion.h !Ekernel/timer.c </sect1> + <sect1><title>Wait queues and Wake events</title> +!Iinclude/linux/wait.h +!Ekernel/wait.c + </sect1> <sect1><title>High-resolution timers</title> !Iinclude/linux/ktime.h !Iinclude/linux/hrtimer.h diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 6b4e07f28b69..7160652a8736 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -93,6 +93,12 @@ X!Ilib/string.c !Elib/crc32.c !Elib/crc-ccitt.c </sect1> + + <sect1 id="idr"><title>idr/ida Functions</title> +!Pinclude/linux/idr.h idr sync +!Plib/idr.c IDA description +!Elib/idr.c + </sect1> </chapter> <chapter id="mm"> diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index e833c8c81e69..d2af87ba96e1 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -535,3 +535,13 @@ Why: Hareware scan is the prefer method for iwlwifi devices for Who: Wey-Yi Guy <wey-yi.w.guy@intel.com> ---------------------------- + +What: access to nfsd auth cache through sys_nfsservctl or '.' files + in the 'nfsd' filesystem. +When: 2.6.40 +Why: This is a legacy interface which have been replaced by a more + dynamic cache. Continuing to maintain this interface is an + unnecessary burden. +Who: NeilBrown <neilb@suse.de> + +---------------------------- diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX index 2f68cd688769..a57e12411d2a 100644 --- a/Documentation/filesystems/nfs/00-INDEX +++ b/Documentation/filesystems/nfs/00-INDEX @@ -12,5 +12,9 @@ nfs-rdma.txt - how to install and setup the Linux NFS/RDMA client and server software nfsroot.txt - short guide on setting up a diskless box with NFS root filesystem. +pnfs.txt + - short explanation of some of the internals of the pnfs client code rpc-cache.txt - introduction to the caching mechanisms in the sunrpc layer. +idmapper.txt + - information for configuring request-keys to be used by idmapper diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt new file mode 100644 index 000000000000..b9b4192ea8b5 --- /dev/null +++ b/Documentation/filesystems/nfs/idmapper.txt @@ -0,0 +1,67 @@ + +========= +ID Mapper +========= +Id mapper is used by NFS to translate user and group ids into names, and to +translate user and group names into ids. Part of this translation involves +performing an upcall to userspace to request the information. Id mapper will +user request-key to perform this upcall and cache the result. The program +/usr/sbin/nfs.idmap should be called by request-key, and will perform the +translation and initialize a key with the resulting information. + + NFS_USE_NEW_IDMAPPER must be selected when configuring the kernel to use this + feature. + +=========== +Configuring +=========== +The file /etc/request-key.conf will need to be modified so /sbin/request-key can +direct the upcall. The following line should be added: + +#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... +#====== ======= =============== =============== =============================== +create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 + +This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap. +The last parameter, 600, defines how many seconds into the future the key will +expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout +is not specified, nfs.idmap will default to 600 seconds. + +id mapper uses for key descriptions: + uid: Find the UID for the given user + gid: Find the GID for the given group + user: Find the user name for the given UID + group: Find the group name for the given GID + +You can handle any of these individually, rather than using the generic upcall +program. If you would like to use your own program for a uid lookup then you +would edit your request-key.conf so it look similar to this: + +#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... +#====== ======= =============== =============== =============================== +create id_resolver uid:* * /some/other/program %k %d 600 +create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 + +Notice that the new line was added above the line for the generic program. +request-key will find the first matching line and corresponding program. In +this case, /some/other/program will handle all uid lookups and +/usr/sbin/nfs.idmap will handle gid, user, and group lookups. + +See <file:Documentation/keys-request-keys.txt> for more information about the +request-key function. + + +========= +nfs.idmap +========= +nfs.idmap is designed to be called by request-key, and should not be run "by +hand". This program takes two arguments, a serialized key and a key +description. The serialized key is first converted into a key_serial_t, and +then passed as an argument to keyctl_instantiate (both are part of keyutils.h). + +The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap +determines the correct function to call by looking at the first part of the +description string. For example, a uid lookup description will appear as +"uid:user@domain". + +nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise. diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index f2430a7974e1..90c71c6f0d00 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -159,6 +159,28 @@ ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf> Default: any +nfsrootdebug + + This parameter enables debugging messages to appear in the kernel + log at boot time so that administrators can verify that the correct + NFS mount options, server address, and root path are passed to the + NFS client. + + +rdinit=<executable file> + + To specify which file contains the program that starts system + initialization, administrators can use this command line parameter. + The default value of this parameter is "/init". If the specified + file exists and the kernel can execute it, root filesystem related + kernel command line parameters, including `nfsroot=', are ignored. + + A description of the process of mounting the root file system can be + found in: + + Documentation/early-userspace/README + + 3.) Boot Loader diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt new file mode 100644 index 000000000000..bc0b9cfe095b --- /dev/null +++ b/Documentation/filesystems/nfs/pnfs.txt @@ -0,0 +1,48 @@ +Reference counting in pnfs: +========================== + +The are several inter-related caches. We have layouts which can +reference multiple devices, each of which can reference multiple data servers. +Each data server can be referenced by multiple devices. Each device +can be referenced by multiple layouts. To keep all of this straight, +we need to reference count. + + +struct pnfs_layout_hdr +---------------------- +The on-the-wire command LAYOUTGET corresponds to struct +pnfs_layout_segment, usually referred to by the variable name lseg. +Each nfs_inode may hold a pointer to a cache of of these layout +segments in nfsi->layout, of type struct pnfs_layout_hdr. + +We reference the header for the inode pointing to it, across each +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, +LAYOUTCOMMIT), and for each lseg held within. + +Each header is also (when non-empty) put on a list associated with +struct nfs_client (cl_layouts). Being put on this list does not bump +the reference count, as the layout is kept around by the lseg that +keeps it in the list. + +deviceid_cache +-------------- +lsegs reference device ids, which are resolved per nfs_client and +layout driver type. The device ids are held in a RCU cache (struct +nfs4_deviceid_cache). The cache itself is referenced across each +mount. The entries (struct nfs4_deviceid) themselves are held across +the lifetime of each lseg referencing them. + +RCU is used because the deviceid is basically a write once, read many +data structure. The hlist size of 32 buckets needs better +justification, but seems reasonable given that we can have multiple +deviceid's per filesystem, and multiple filesystems per nfs_client. + +The hash code is copied from the nfsd code base. A discussion of +hashing and variations of this algorithm can be found at: +http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809 + +data server cache +----------------- +file driver devices refer to data servers, which are kept in a module +level cache. Its reference is held over the lifetime of the deviceid +pointing to it. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index a6aca8740883..a563b74c7aef 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -374,13 +374,13 @@ Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB -The first of these lines shows the same information as is displayed for the -mapping in /proc/PID/maps. The remaining lines show the size of the mapping, -the amount of the mapping that is currently resident in RAM, the "proportional -set size” (divide each shared page by the number of processes sharing it), the -number of clean and dirty shared pages in the mapping, and the number of clean -and dirty private pages in the mapping. The "Referenced" indicates the amount -of memory currently marked as referenced or accessed. +The first of these lines shows the same information as is displayed for the +mapping in /proc/PID/maps. The remaining lines show the size of the mapping +(size), the amount of the mapping that is currently resident in RAM (RSS), the +process' proportional share of this mapping (PSS), the number of clean and +dirty shared pages in the mapping, and the number of clean and dirty private +pages in the mapping. The "Referenced" indicates the amount of memory +currently marked as referenced or accessed. This file is only present if the CONFIG_MMU kernel configuration option is enabled. diff --git a/Documentation/hwmon/ltc4261 b/Documentation/hwmon/ltc4261 new file mode 100644 index 000000000000..eba2e2c4b94d --- /dev/null +++ b/Documentation/hwmon/ltc4261 @@ -0,0 +1,63 @@ +Kernel driver ltc4261 +===================== + +Supported chips: + * Linear Technology LTC4261 + Prefix: 'ltc4261' + Addresses scanned: - + Datasheet: + http://cds.linear.com/docs/Datasheet/42612fb.pdf + +Author: Guenter Roeck <guenter.roeck@ericsson.com> + + +Description +----------- + +The LTC4261/LTC4261-2 negative voltage Hot Swap controllers allow a board +to be safely inserted and removed from a live backplane. + + +Usage Notes +----------- + +This driver does not probe for LTC4261 devices, since there is no register +which can be safely used to identify the chip. You will have to instantiate +the devices explicitly. + +Example: the following will load the driver for an LTC4261 at address 0x10 +on I2C bus #1: +$ modprobe ltc4261 +$ echo ltc4261 0x10 > /sys/bus/i2c/devices/i2c-1/new_device + + +Sysfs entries +------------- + +Voltage readings provided by this driver are reported as obtained from the ADC +registers. If a set of voltage divider resistors is installed, calculate the +real voltage by multiplying the reported value with (R1+R2)/R2, where R1 is the +value of the divider resistor against the measured voltage and R2 is the value +of the divider resistor against Ground. + +Current reading provided by this driver is reported as obtained from the ADC +Current Sense register. The reported value assumes that a 1 mOhm sense resistor +is installed. If a different sense resistor is installed, calculate the real +current by dividing the reported value by the sense resistor value in mOhm. + +The chip has two voltage sensors, but only one set of voltage alarm status bits. +In many many designs, those alarms are associated with the ADIN2 sensor, due to +the proximity of the ADIN2 pin to the OV pin. ADIN2 is, however, not available +on all chip variants. To ensure that the alarm condition is reported to the user, +report it with both voltage sensors. + +in1_input ADIN2 voltage (mV) +in1_min_alarm ADIN/ADIN2 Undervoltage alarm +in1_max_alarm ADIN/ADIN2 Overvoltage alarm + +in2_input ADIN voltage (mV) +in2_min_alarm ADIN/ADIN2 Undervoltage alarm +in2_max_alarm ADIN/ADIN2 Overvoltage alarm + +curr1_input SENSE current (mA) +curr1_alarm SENSE overcurrent alarm diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b660085dcc69..4bc2f3c3da5b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1541,12 +1541,15 @@ and is between 256 and 4096 characters. It is defined in the file 1 to enable accounting Default value is 0. - nfsaddrs= [NFS] + nfsaddrs= [NFS] Deprecated. Use ip= instead. See Documentation/filesystems/nfs/nfsroot.txt. nfsroot= [NFS] nfs root filesystem for disk-less boxes. See Documentation/filesystems/nfs/nfsroot.txt. + nfsrootdebug [NFS] enable nfsroot debugging messages. + See Documentation/filesystems/nfs/nfsroot.txt. + nfs.callback_tcpport= [NFS] set the TCP port on which the NFSv4 callback channel should listen. diff --git a/Documentation/misc-devices/apds990x.txt b/Documentation/misc-devices/apds990x.txt new file mode 100644 index 000000000000..d5408cade32f --- /dev/null +++ b/Documentation/misc-devices/apds990x.txt @@ -0,0 +1,111 @@ +Kernel driver apds990x +====================== + +Supported chips: +Avago APDS990X + +Data sheet: +Not freely available + +Author: +Samu Onkalo <samu.p.onkalo@nokia.com> + +Description +----------- + +APDS990x is a combined ambient light and proximity sensor. ALS and proximity +functionality are highly connected. ALS measurement path must be running +while the proximity functionality is enabled. + +ALS produces raw measurement values for two channels: Clear channel +(infrared + visible light) and IR only. However, threshold comparisons happen +using clear channel only. Lux value and the threshold level on the HW +might vary quite much depending the spectrum of the light source. + +Driver makes necessary conversions to both directions so that user handles +only lux values. Lux value is calculated using information from the both +channels. HW threshold level is calculated from the given lux value to match +with current type of the lightning. Sometimes inaccuracy of the estimations +lead to false interrupt, but that doesn't harm. + +ALS contains 4 different gain steps. Driver automatically +selects suitable gain step. After each measurement, reliability of the results +is estimated and new measurement is trigged if necessary. + +Platform data can provide tuned values to the conversion formulas if +values are known. Otherwise plain sensor default values are used. + +Proximity side is little bit simpler. There is no need for complex conversions. +It produces directly usable values. + +Driver controls chip operational state using pm_runtime framework. +Voltage regulators are controlled based on chip operational state. + +SYSFS +----- + + +chip_id + RO - shows detected chip type and version + +power_state + RW - enable / disable chip. Uses counting logic + 1 enables the chip + 0 disables the chip +lux0_input + RO - measured lux value + sysfs_notify called when threshold interrupt occurs + +lux0_sensor_range + RO - lux0_input max value. Actually never reaches since sensor tends + to saturate much before that. Real max value varies depending + on the light spectrum etc. + +lux0_rate + RW - measurement rate in Hz + +lux0_rate_avail + RO - supported measurement rates + +lux0_calibscale + RW - calibration value. Set to neutral value by default. + Output results are multiplied with calibscale / calibscale_default + value. + +lux0_calibscale_default + RO - neutral calibration value + +lux0_thresh_above_value + RW - HI level threshold value. All results above the value + trigs an interrupt. 65535 (i.e. sensor_range) disables the above + interrupt. + +lux0_thresh_below_value + RW - LO level threshold value. All results below the value + trigs an interrupt. 0 disables the below interrupt. + +prox0_raw + RO - measured proximity value + sysfs_notify called when threshold interrupt occurs + +prox0_sensor_range + RO - prox0_raw max value (1023) + +prox0_raw_en + RW - enable / disable proximity - uses counting logic + 1 enables the proximity + 0 disables the proximity + +prox0_reporting_mode + RW - trigger / periodic. In "trigger" mode the driver tells two possible + values: 0 or prox0_sensor_range value. 0 means no proximity, + 1023 means proximity. This causes minimal number of interrupts. + In "periodic" mode the driver reports all values above + prox0_thresh_above. This causes more interrupts, but it can give + _rough_ estimate about the distance. + +prox0_reporting_mode_avail + RO - accepted values to prox0_reporting_mode (trigger, periodic) + +prox0_thresh_above_value + RW - threshold level which trigs proximity events. diff --git a/Documentation/misc-devices/bh1770glc.txt b/Documentation/misc-devices/bh1770glc.txt new file mode 100644 index 000000000000..7d64c014dc70 --- /dev/null +++ b/Documentation/misc-devices/bh1770glc.txt @@ -0,0 +1,116 @@ +Kernel driver bh1770glc +======================= + +Supported chips: +ROHM BH1770GLC +OSRAM SFH7770 + +Data sheet: +Not freely available + +Author: +Samu Onkalo <samu.p.onkalo@nokia.com> + +Description +----------- +BH1770GLC and SFH7770 are combined ambient light and proximity sensors. +ALS and proximity parts operates on their own, but they shares common I2C +interface and interrupt logic. In principle they can run on their own, +but ALS side results are used to estimate reliability of the proximity sensor. + +ALS produces 16 bit lux values. The chip contains interrupt logic to produce +low and high threshold interrupts. + +Proximity part contains IR-led driver up to 3 IR leds. The chip measures +amount of reflected IR light and produces proximity result. Resolution is +8 bit. Driver supports only one channel. Driver uses ALS results to estimate +reliability of the proximity results. Thus ALS is always running while +proximity detection is needed. + +Driver uses threshold interrupts to avoid need for polling the values. +Proximity low interrupt doesn't exists in the chip. This is simulated +by using a delayed work. As long as there is proximity threshold above +interrupts the delayed work is pushed forward. So, when proximity level goes +below the threshold value, there is no interrupt and the delayed work will +finally run. This is handled as no proximity indication. + +Chip state is controlled via runtime pm framework when enabled in config. + +Calibscale factor is used to hide differences between the chips. By default +value set to neutral state meaning factor of 1.00. To get proper values, +calibrated source of light is needed as a reference. Calibscale factor is set +so that measurement produces about the expected lux value. + +SYSFS +----- + +chip_id + RO - shows detected chip type and version + +power_state + RW - enable / disable chip. Uses counting logic + 1 enables the chip + 0 disables the chip + +lux0_input + RO - measured lux value + sysfs_notify called when threshold interrupt occurs + +lux0_sensor_range + RO - lux0_input max value + +lux0_rate + RW - measurement rate in Hz + +lux0_rate_avail + RO - supported measurement rates + +lux0_thresh_above_value + RW - HI level threshold value. All results above the value + trigs an interrupt. 65535 (i.e. sensor_range) disables the above + interrupt. + +lux0_thresh_below_value + RW - LO level threshold value. All results below the value + trigs an interrupt. 0 disables the below interrupt. + +lux0_calibscale + RW - calibration value. Set to neutral value by default. + Output results are multiplied with calibscale / calibscale_default + value. + +lux0_calibscale_default + RO - neutral calibration value + +prox0_raw + RO - measured proximity value + sysfs_notify called when threshold interrupt occurs + +prox0_sensor_range + RO - prox0_raw max value + +prox0_raw_en + RW - enable / disable proximity - uses counting logic + 1 enables the proximity + 0 disables the proximity + +prox0_thresh_above_count + RW - number of proximity interrupts needed before triggering the event + +prox0_rate_above + RW - Measurement rate (in Hz) when the level is above threshold + i.e. when proximity on has been reported. + +prox0_rate_below + RW - Measurement rate (in Hz) when the level is below threshold + i.e. when proximity off has been reported. + +prox0_rate_avail + RO - Supported proximity measurement rates in Hz + +prox0_thresh_above0_value + RW - threshold level which trigs proximity events. + Filtered by persistence filter (prox0_thresh_above_count) + +prox0_thresh_above1_value + RW - threshold level which trigs event immediately diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 5c17196c8fe9..312e3754e8c5 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -75,7 +75,7 @@ On all - write a character to /proc/sysrq-trigger. e.g.: 'f' - Will call oom_kill to kill a memory hog process. -'g' - Used by kgdb on ppc and sh platforms. +'g' - Used by kgdb (kernel debugger) 'h' - Will display help (actually any other key than those listed here will display help. but 'h' is easy to remember :-) @@ -110,12 +110,15 @@ On all - write a character to /proc/sysrq-trigger. e.g.: 'u' - Will attempt to remount all mounted filesystems read-only. -'v' - Dumps Voyager SMP processor info to your console. +'v' - Forcefully restores framebuffer console +'v' - Causes ETM buffer dump [ARM-specific] 'w' - Dumps tasks that are in uninterruptable (blocked) state. 'x' - Used by xmon interface on ppc/powerpc platforms. +'y' - Show global CPU Registers [SPARC-64 specific] + 'z' - Dump the ftrace buffer '0'-'9' - Sets the console log level, controlling which kernel messages diff --git a/Documentation/timers/hpet_example.c b/Documentation/timers/hpet_example.c index 4bfafb7bc4c5..9a3e7012c190 100644 --- a/Documentation/timers/hpet_example.c +++ b/Documentation/timers/hpet_example.c @@ -97,6 +97,33 @@ hpet_open_close(int argc, const char **argv) void hpet_info(int argc, const char **argv) { + struct hpet_info info; + int fd; + + if (argc != 1) { + fprintf(stderr, "hpet_info: device-name\n"); + return; + } + + fd = open(argv[0], O_RDONLY); + if (fd < 0) { + fprintf(stderr, "hpet_info: open of %s failed\n", argv[0]); + return; + } + + if (ioctl(fd, HPET_INFO, &info) < 0) { + fprintf(stderr, "hpet_info: failed to get info\n"); + goto out; + } + + fprintf(stderr, "hpet_info: hi_irqfreq 0x%lx hi_flags 0x%lx ", + info.hi_ireqfreq, info.hi_flags); + fprintf(stderr, "hi_hpet %d hi_timer %d\n", + info.hi_hpet, info.hi_timer); + +out: + close(fd); + return; } void diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl index 1b55146d1c8d..b3e73ddb1567 100644 --- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl +++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl @@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY => 20; use constant HIGH_KSWAPD_REWAKEUP => 21; use constant HIGH_NR_SCANNED => 22; use constant HIGH_NR_TAKEN => 23; -use constant HIGH_NR_RECLAIM => 24; +use constant HIGH_NR_RECLAIMED => 24; use constant HIGH_NR_CONTIG_DIRTY => 25; my %perprocesspid; @@ -58,11 +58,13 @@ my $opt_read_procstat; my $total_wakeup_kswapd; my ($total_direct_reclaim, $total_direct_nr_scanned); my ($total_direct_latency, $total_kswapd_latency); +my ($total_direct_nr_reclaimed); my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async); my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async); my ($total_kswapd_nr_scanned, $total_kswapd_wake); my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async); my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async); +my ($total_kswapd_nr_reclaimed); # Catch sigint and exit on request my $sigint_report = 0; @@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)'; my $regex_kswapd_sleep_default = 'nid=([0-9]*)'; my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)'; my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)'; -my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)'; +my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)'; my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)'; my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)'; @@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex( "vmscan/mm_vmscan_lru_shrink_inactive", $regex_lru_shrink_inactive_default, "nid", "zid", - "lru", - "nr_scanned", "nr_reclaimed", "priority"); + "nr_scanned", "nr_reclaimed", "priority", + "flags"); $regex_lru_shrink_active = generate_traceevent_regex( "vmscan/mm_vmscan_lru_shrink_active", $regex_lru_shrink_active_default, @@ -375,6 +377,16 @@ EVENT_PROCESS: my $nr_contig_dirty = $7; $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned; $perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty; + } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") { + $details = $5; + if ($details !~ /$regex_lru_shrink_inactive/o) { + print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n"; + print " $details\n"; + print " $regex_lru_shrink_inactive/o\n"; + next; + } + my $nr_reclaimed = $4; + $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed; } elsif ($tracepoint eq "mm_vmscan_writepage") { $details = $5; if ($details !~ /$regex_writepage/o) { @@ -464,8 +476,8 @@ sub dump_stats { # Print out process activity printf("\n"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Time"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Sync-IO", "ASync-IO", "Stalled"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled"); foreach $process_pid (keys %stats) { if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { @@ -475,6 +487,7 @@ sub dump_stats { $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; $total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; $total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; $total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; @@ -489,11 +502,12 @@ sub dump_stats { $index++; } - printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8.3f", + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f", $process_pid, $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}, $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}, $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}, $this_reclaim_delay / 1000); @@ -529,8 +543,8 @@ sub dump_stats { # Print out kswapd activity printf("\n"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO"); foreach $process_pid (keys %stats) { if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { @@ -539,16 +553,18 @@ sub dump_stats { $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; $total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; $total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; $total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; $total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; - printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u", + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u", $process_pid, $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}, $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}, $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}); @@ -579,6 +595,7 @@ sub dump_stats { print "\nSummary\n"; print "Direct reclaims: $total_direct_reclaim\n"; print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; + print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n"; print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n"; print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n"; print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n"; @@ -588,6 +605,7 @@ sub dump_stats { print "\n"; print "Kswapd wakeups: $total_kswapd_wake\n"; print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; + print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n"; print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n"; print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n"; print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n"; @@ -612,6 +630,7 @@ sub aggregate_perprocesspid() { $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}; $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED}; + $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED}; $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt new file mode 100644 index 000000000000..4324d24ffacd --- /dev/null +++ b/Documentation/vm/highmem.txt @@ -0,0 +1,162 @@ + + ==================== + HIGH MEMORY HANDLING + ==================== + +By: Peter Zijlstra <a.p.zijlstra@chello.nl> + +Contents: + + (*) What is high memory? + + (*) Temporary virtual mappings. + + (*) Using kmap_atomic. + + (*) Cost of temporary mappings. + + (*) i386 PAE. + + +==================== +WHAT IS HIGH MEMORY? +==================== + +High memory (highmem) is used when the size of physical memory approaches or +exceeds the maximum size of virtual memory. At that point it becomes +impossible for the kernel to keep all of the available physical memory mapped +at all times. This means the kernel needs to start using temporary mappings of +the pieces of physical memory that it wants to access. + +The part of (physical) memory not covered by a permanent mapping is what we +refer to as 'highmem'. There are various architecture dependent constraints on +where exactly that border lies. + +In the i386 arch, for example, we choose to map the kernel into every process's +VM space so that we don't have to pay the full TLB invalidation costs for +kernel entry/exit. This means the available virtual memory space (4GiB on +i386) has to be divided between user and kernel space. + +The traditional split for architectures using this approach is 3:1, 3GiB for +userspace and the top 1GiB for kernel space: + + +--------+ 0xffffffff + | Kernel | + +--------+ 0xc0000000 + | | + | User | + | | + +--------+ 0x00000000 + +This means that the kernel can at most map 1GiB of physical memory at any one +time, but because we need virtual address space for other things - including +temporary maps to access the rest of the physical memory - the actual direct +map will typically be less (usually around ~896MiB). + +Other architectures that have mm context tagged TLBs can have separate kernel +and user maps. Some hardware (like some ARMs), however, have limited virtual +space when they use mm context tags. + + +========================== +TEMPORARY VIRTUAL MAPPINGS +========================== + +The kernel contains several ways of creating temporary mappings: + + (*) vmap(). This can be used to make a long duration mapping of multiple + physical pages into a contiguous virtual space. It needs global + synchronization to unmap. + + (*) kmap(). This permits a short duration mapping of a single page. It needs + global synchronization, but is amortized somewhat. It is also prone to + deadlocks when using in a nested fashion, and so it is not recommended for + new code. + + (*) kmap_atomic(). This permits a very short duration mapping of a single + page. Since the mapping is restricted to the CPU that issued it, it + performs well, but the issuing task is therefore required to stay on that + CPU until it has finished, lest some other task displace its mappings. + + kmap_atomic() may also be used by interrupt contexts, since it is does not + sleep and the caller may not sleep until after kunmap_atomic() is called. + + It may be assumed that k[un]map_atomic() won't fail. + + +================= +USING KMAP_ATOMIC +================= + +When and where to use kmap_atomic() is straightforward. It is used when code +wants to access the contents of a page that might be allocated from high memory +(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two +functions, and they can be used in a manner similar to the following: + + /* Find the page of interest. */ + struct page *page = find_get_page(mapping, offset); + + /* Gain access to the contents of that page. */ + void *vaddr = kmap_atomic(page); + + /* Do something to the contents of that page. */ + memset(vaddr, 0, PAGE_SIZE); + + /* Unmap that page. */ + kunmap_atomic(vaddr); + +Note that the kunmap_atomic() call takes the result of the kmap_atomic() call +not the argument. + +If you need to map two pages because you want to copy from one page to +another you need to keep the kmap_atomic calls strictly nested, like: + + vaddr1 = kmap_atomic(page1); + vaddr2 = kmap_atomic(page2); + + memcpy(vaddr1, vaddr2, PAGE_SIZE); + + kunmap_atomic(vaddr2); + kunmap_atomic(vaddr1); + + +========================== +COST OF TEMPORARY MAPPINGS +========================== + +The cost of creating temporary mappings can be quite high. The arch has to +manipulate the kernel's page tables, the data TLB and/or the MMU's registers. + +If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping +simply with a bit of arithmetic that will convert the page struct address into +a pointer to the page contents rather than juggling mappings about. In such a +case, the unmap operation may be a null operation. + +If CONFIG_MMU is not set, then there can be no temporary mappings and no +highmem. In such a case, the arithmetic approach will also be used. + + +======== +i386 PAE +======== + +The i386 arch, under some circumstances, will permit you to stick up to 64GiB +of RAM into your 32-bit machine. This has a number of consequences: + + (*) Linux needs a page-frame structure for each page in the system and the + pageframes need to live in the permanent mapping, which means: + + (*) you can have 896M/sizeof(struct page) page-frames at most; with struct + page being 32-bytes that would end up being something in the order of 112G + worth of pages; the kernel, however, needs to store more than just + page-frames in that memory... + + (*) PAE makes your page tables larger - which slows the system down as more + data has to be accessed to traverse in TLB fills and the like. One + advantage is that PAE has more PTE bits and can provide advanced features + like NX and PAT. + +The general recommendation is that you don't use more than 8GiB on a 32-bit +machine - although more might work for you and your workload, you're pretty +much on your own - don't expect kernel developers to really care much if things +come apart. |