diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2010-10-27 04:15:20 +0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2010-10-27 04:15:20 +0400 |
commit | 31453a9764f7e2a72a6e2c502ace586e2663a68c (patch) | |
tree | 5d4db63de5b4b85d1ffdab4e95a75175a784a10a /Documentation | |
parent | f9ba5375a8aae4aeea6be15df77e24707a429812 (diff) | |
parent | 93ed0e2d07b25aff4db1d61bfbcd1e82074c0ad5 (diff) | |
download | linux-31453a9764f7e2a72a6e2c502ace586e2663a68c.tar.xz |
Merge branch 'akpm-incoming-1'
* akpm-incoming-1: (176 commits)
scripts/checkpatch.pl: add check for declaration of pci_device_id
scripts/checkpatch.pl: add warnings for static char that could be static const char
checkpatch: version 0.31
checkpatch: statement/block context analyser should look at sanitised lines
checkpatch: handle EXPORT_SYMBOL for DEVICE_ATTR and similar
checkpatch: clean up structure definition macro handline
checkpatch: update copyright dates
checkpatch: Add additional attribute #defines
checkpatch: check for incorrect permissions
checkpatch: ensure kconfig help checks only apply when we are adding help
checkpatch: simplify and consolidate "missing space after" checks
checkpatch: add check for space after struct, union, and enum
checkpatch: returning errno typically should be negative
checkpatch: handle casts better fixing false categorisation of : as binary
checkpatch: ensure we do not collapse bracketed sections into constants
checkpatch: suggest cleanpatch and cleanfile when appropriate
checkpatch: types may sit on a line on their own
checkpatch: fix regressions in "fix handling of leading spaces"
div64_u64(): improve precision on 32bit platforms
lib/parser: cleanup match_number()
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/filesystems/proc.txt | 14 | ||||
-rw-r--r-- | Documentation/misc-devices/apds990x.txt | 111 | ||||
-rw-r--r-- | Documentation/misc-devices/bh1770glc.txt | 116 | ||||
-rw-r--r-- | Documentation/timers/hpet_example.c | 27 | ||||
-rw-r--r-- | Documentation/trace/postprocess/trace-vmscan-postprocess.pl | 39 | ||||
-rw-r--r-- | Documentation/vm/highmem.txt | 162 |
6 files changed, 452 insertions, 17 deletions
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index a6aca8740883..a563b74c7aef 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -374,13 +374,13 @@ Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB -The first of these lines shows the same information as is displayed for the -mapping in /proc/PID/maps. The remaining lines show the size of the mapping, -the amount of the mapping that is currently resident in RAM, the "proportional -set size” (divide each shared page by the number of processes sharing it), the -number of clean and dirty shared pages in the mapping, and the number of clean -and dirty private pages in the mapping. The "Referenced" indicates the amount -of memory currently marked as referenced or accessed. +The first of these lines shows the same information as is displayed for the +mapping in /proc/PID/maps. The remaining lines show the size of the mapping +(size), the amount of the mapping that is currently resident in RAM (RSS), the +process' proportional share of this mapping (PSS), the number of clean and +dirty shared pages in the mapping, and the number of clean and dirty private +pages in the mapping. The "Referenced" indicates the amount of memory +currently marked as referenced or accessed. This file is only present if the CONFIG_MMU kernel configuration option is enabled. diff --git a/Documentation/misc-devices/apds990x.txt b/Documentation/misc-devices/apds990x.txt new file mode 100644 index 000000000000..d5408cade32f --- /dev/null +++ b/Documentation/misc-devices/apds990x.txt @@ -0,0 +1,111 @@ +Kernel driver apds990x +====================== + +Supported chips: +Avago APDS990X + +Data sheet: +Not freely available + +Author: +Samu Onkalo <samu.p.onkalo@nokia.com> + +Description +----------- + +APDS990x is a combined ambient light and proximity sensor. ALS and proximity +functionality are highly connected. ALS measurement path must be running +while the proximity functionality is enabled. + +ALS produces raw measurement values for two channels: Clear channel +(infrared + visible light) and IR only. However, threshold comparisons happen +using clear channel only. Lux value and the threshold level on the HW +might vary quite much depending the spectrum of the light source. + +Driver makes necessary conversions to both directions so that user handles +only lux values. Lux value is calculated using information from the both +channels. HW threshold level is calculated from the given lux value to match +with current type of the lightning. Sometimes inaccuracy of the estimations +lead to false interrupt, but that doesn't harm. + +ALS contains 4 different gain steps. Driver automatically +selects suitable gain step. After each measurement, reliability of the results +is estimated and new measurement is trigged if necessary. + +Platform data can provide tuned values to the conversion formulas if +values are known. Otherwise plain sensor default values are used. + +Proximity side is little bit simpler. There is no need for complex conversions. +It produces directly usable values. + +Driver controls chip operational state using pm_runtime framework. +Voltage regulators are controlled based on chip operational state. + +SYSFS +----- + + +chip_id + RO - shows detected chip type and version + +power_state + RW - enable / disable chip. Uses counting logic + 1 enables the chip + 0 disables the chip +lux0_input + RO - measured lux value + sysfs_notify called when threshold interrupt occurs + +lux0_sensor_range + RO - lux0_input max value. Actually never reaches since sensor tends + to saturate much before that. Real max value varies depending + on the light spectrum etc. + +lux0_rate + RW - measurement rate in Hz + +lux0_rate_avail + RO - supported measurement rates + +lux0_calibscale + RW - calibration value. Set to neutral value by default. + Output results are multiplied with calibscale / calibscale_default + value. + +lux0_calibscale_default + RO - neutral calibration value + +lux0_thresh_above_value + RW - HI level threshold value. All results above the value + trigs an interrupt. 65535 (i.e. sensor_range) disables the above + interrupt. + +lux0_thresh_below_value + RW - LO level threshold value. All results below the value + trigs an interrupt. 0 disables the below interrupt. + +prox0_raw + RO - measured proximity value + sysfs_notify called when threshold interrupt occurs + +prox0_sensor_range + RO - prox0_raw max value (1023) + +prox0_raw_en + RW - enable / disable proximity - uses counting logic + 1 enables the proximity + 0 disables the proximity + +prox0_reporting_mode + RW - trigger / periodic. In "trigger" mode the driver tells two possible + values: 0 or prox0_sensor_range value. 0 means no proximity, + 1023 means proximity. This causes minimal number of interrupts. + In "periodic" mode the driver reports all values above + prox0_thresh_above. This causes more interrupts, but it can give + _rough_ estimate about the distance. + +prox0_reporting_mode_avail + RO - accepted values to prox0_reporting_mode (trigger, periodic) + +prox0_thresh_above_value + RW - threshold level which trigs proximity events. diff --git a/Documentation/misc-devices/bh1770glc.txt b/Documentation/misc-devices/bh1770glc.txt new file mode 100644 index 000000000000..7d64c014dc70 --- /dev/null +++ b/Documentation/misc-devices/bh1770glc.txt @@ -0,0 +1,116 @@ +Kernel driver bh1770glc +======================= + +Supported chips: +ROHM BH1770GLC +OSRAM SFH7770 + +Data sheet: +Not freely available + +Author: +Samu Onkalo <samu.p.onkalo@nokia.com> + +Description +----------- +BH1770GLC and SFH7770 are combined ambient light and proximity sensors. +ALS and proximity parts operates on their own, but they shares common I2C +interface and interrupt logic. In principle they can run on their own, +but ALS side results are used to estimate reliability of the proximity sensor. + +ALS produces 16 bit lux values. The chip contains interrupt logic to produce +low and high threshold interrupts. + +Proximity part contains IR-led driver up to 3 IR leds. The chip measures +amount of reflected IR light and produces proximity result. Resolution is +8 bit. Driver supports only one channel. Driver uses ALS results to estimate +reliability of the proximity results. Thus ALS is always running while +proximity detection is needed. + +Driver uses threshold interrupts to avoid need for polling the values. +Proximity low interrupt doesn't exists in the chip. This is simulated +by using a delayed work. As long as there is proximity threshold above +interrupts the delayed work is pushed forward. So, when proximity level goes +below the threshold value, there is no interrupt and the delayed work will +finally run. This is handled as no proximity indication. + +Chip state is controlled via runtime pm framework when enabled in config. + +Calibscale factor is used to hide differences between the chips. By default +value set to neutral state meaning factor of 1.00. To get proper values, +calibrated source of light is needed as a reference. Calibscale factor is set +so that measurement produces about the expected lux value. + +SYSFS +----- + +chip_id + RO - shows detected chip type and version + +power_state + RW - enable / disable chip. Uses counting logic + 1 enables the chip + 0 disables the chip + +lux0_input + RO - measured lux value + sysfs_notify called when threshold interrupt occurs + +lux0_sensor_range + RO - lux0_input max value + +lux0_rate + RW - measurement rate in Hz + +lux0_rate_avail + RO - supported measurement rates + +lux0_thresh_above_value + RW - HI level threshold value. All results above the value + trigs an interrupt. 65535 (i.e. sensor_range) disables the above + interrupt. + +lux0_thresh_below_value + RW - LO level threshold value. All results below the value + trigs an interrupt. 0 disables the below interrupt. + +lux0_calibscale + RW - calibration value. Set to neutral value by default. + Output results are multiplied with calibscale / calibscale_default + value. + +lux0_calibscale_default + RO - neutral calibration value + +prox0_raw + RO - measured proximity value + sysfs_notify called when threshold interrupt occurs + +prox0_sensor_range + RO - prox0_raw max value + +prox0_raw_en + RW - enable / disable proximity - uses counting logic + 1 enables the proximity + 0 disables the proximity + +prox0_thresh_above_count + RW - number of proximity interrupts needed before triggering the event + +prox0_rate_above + RW - Measurement rate (in Hz) when the level is above threshold + i.e. when proximity on has been reported. + +prox0_rate_below + RW - Measurement rate (in Hz) when the level is below threshold + i.e. when proximity off has been reported. + +prox0_rate_avail + RO - Supported proximity measurement rates in Hz + +prox0_thresh_above0_value + RW - threshold level which trigs proximity events. + Filtered by persistence filter (prox0_thresh_above_count) + +prox0_thresh_above1_value + RW - threshold level which trigs event immediately diff --git a/Documentation/timers/hpet_example.c b/Documentation/timers/hpet_example.c index 4bfafb7bc4c5..9a3e7012c190 100644 --- a/Documentation/timers/hpet_example.c +++ b/Documentation/timers/hpet_example.c @@ -97,6 +97,33 @@ hpet_open_close(int argc, const char **argv) void hpet_info(int argc, const char **argv) { + struct hpet_info info; + int fd; + + if (argc != 1) { + fprintf(stderr, "hpet_info: device-name\n"); + return; + } + + fd = open(argv[0], O_RDONLY); + if (fd < 0) { + fprintf(stderr, "hpet_info: open of %s failed\n", argv[0]); + return; + } + + if (ioctl(fd, HPET_INFO, &info) < 0) { + fprintf(stderr, "hpet_info: failed to get info\n"); + goto out; + } + + fprintf(stderr, "hpet_info: hi_irqfreq 0x%lx hi_flags 0x%lx ", + info.hi_ireqfreq, info.hi_flags); + fprintf(stderr, "hi_hpet %d hi_timer %d\n", + info.hi_hpet, info.hi_timer); + +out: + close(fd); + return; } void diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl index 1b55146d1c8d..b3e73ddb1567 100644 --- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl +++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl @@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY => 20; use constant HIGH_KSWAPD_REWAKEUP => 21; use constant HIGH_NR_SCANNED => 22; use constant HIGH_NR_TAKEN => 23; -use constant HIGH_NR_RECLAIM => 24; +use constant HIGH_NR_RECLAIMED => 24; use constant HIGH_NR_CONTIG_DIRTY => 25; my %perprocesspid; @@ -58,11 +58,13 @@ my $opt_read_procstat; my $total_wakeup_kswapd; my ($total_direct_reclaim, $total_direct_nr_scanned); my ($total_direct_latency, $total_kswapd_latency); +my ($total_direct_nr_reclaimed); my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async); my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async); my ($total_kswapd_nr_scanned, $total_kswapd_wake); my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async); my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async); +my ($total_kswapd_nr_reclaimed); # Catch sigint and exit on request my $sigint_report = 0; @@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)'; my $regex_kswapd_sleep_default = 'nid=([0-9]*)'; my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)'; my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)'; -my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)'; +my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)'; my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)'; my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)'; @@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex( "vmscan/mm_vmscan_lru_shrink_inactive", $regex_lru_shrink_inactive_default, "nid", "zid", - "lru", - "nr_scanned", "nr_reclaimed", "priority"); + "nr_scanned", "nr_reclaimed", "priority", + "flags"); $regex_lru_shrink_active = generate_traceevent_regex( "vmscan/mm_vmscan_lru_shrink_active", $regex_lru_shrink_active_default, @@ -375,6 +377,16 @@ EVENT_PROCESS: my $nr_contig_dirty = $7; $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned; $perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty; + } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") { + $details = $5; + if ($details !~ /$regex_lru_shrink_inactive/o) { + print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n"; + print " $details\n"; + print " $regex_lru_shrink_inactive/o\n"; + next; + } + my $nr_reclaimed = $4; + $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed; } elsif ($tracepoint eq "mm_vmscan_writepage") { $details = $5; if ($details !~ /$regex_writepage/o) { @@ -464,8 +476,8 @@ sub dump_stats { # Print out process activity printf("\n"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Time"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Sync-IO", "ASync-IO", "Stalled"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled"); foreach $process_pid (keys %stats) { if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { @@ -475,6 +487,7 @@ sub dump_stats { $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; $total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; $total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; $total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; @@ -489,11 +502,12 @@ sub dump_stats { $index++; } - printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8.3f", + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f", $process_pid, $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}, $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}, $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}, $this_reclaim_delay / 1000); @@ -529,8 +543,8 @@ sub dump_stats { # Print out kswapd activity printf("\n"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO"); foreach $process_pid (keys %stats) { if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { @@ -539,16 +553,18 @@ sub dump_stats { $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; $total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; $total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; $total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; $total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; - printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u", + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u", $process_pid, $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}, $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}, $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}); @@ -579,6 +595,7 @@ sub dump_stats { print "\nSummary\n"; print "Direct reclaims: $total_direct_reclaim\n"; print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; + print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n"; print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n"; print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n"; print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n"; @@ -588,6 +605,7 @@ sub dump_stats { print "\n"; print "Kswapd wakeups: $total_kswapd_wake\n"; print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; + print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n"; print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n"; print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n"; print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n"; @@ -612,6 +630,7 @@ sub aggregate_perprocesspid() { $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}; $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED}; + $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED}; $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt new file mode 100644 index 000000000000..4324d24ffacd --- /dev/null +++ b/Documentation/vm/highmem.txt @@ -0,0 +1,162 @@ + + ==================== + HIGH MEMORY HANDLING + ==================== + +By: Peter Zijlstra <a.p.zijlstra@chello.nl> + +Contents: + + (*) What is high memory? + + (*) Temporary virtual mappings. + + (*) Using kmap_atomic. + + (*) Cost of temporary mappings. + + (*) i386 PAE. + + +==================== +WHAT IS HIGH MEMORY? +==================== + +High memory (highmem) is used when the size of physical memory approaches or +exceeds the maximum size of virtual memory. At that point it becomes +impossible for the kernel to keep all of the available physical memory mapped +at all times. This means the kernel needs to start using temporary mappings of +the pieces of physical memory that it wants to access. + +The part of (physical) memory not covered by a permanent mapping is what we +refer to as 'highmem'. There are various architecture dependent constraints on +where exactly that border lies. + +In the i386 arch, for example, we choose to map the kernel into every process's +VM space so that we don't have to pay the full TLB invalidation costs for +kernel entry/exit. This means the available virtual memory space (4GiB on +i386) has to be divided between user and kernel space. + +The traditional split for architectures using this approach is 3:1, 3GiB for +userspace and the top 1GiB for kernel space: + + +--------+ 0xffffffff + | Kernel | + +--------+ 0xc0000000 + | | + | User | + | | + +--------+ 0x00000000 + +This means that the kernel can at most map 1GiB of physical memory at any one +time, but because we need virtual address space for other things - including +temporary maps to access the rest of the physical memory - the actual direct +map will typically be less (usually around ~896MiB). + +Other architectures that have mm context tagged TLBs can have separate kernel +and user maps. Some hardware (like some ARMs), however, have limited virtual +space when they use mm context tags. + + +========================== +TEMPORARY VIRTUAL MAPPINGS +========================== + +The kernel contains several ways of creating temporary mappings: + + (*) vmap(). This can be used to make a long duration mapping of multiple + physical pages into a contiguous virtual space. It needs global + synchronization to unmap. + + (*) kmap(). This permits a short duration mapping of a single page. It needs + global synchronization, but is amortized somewhat. It is also prone to + deadlocks when using in a nested fashion, and so it is not recommended for + new code. + + (*) kmap_atomic(). This permits a very short duration mapping of a single + page. Since the mapping is restricted to the CPU that issued it, it + performs well, but the issuing task is therefore required to stay on that + CPU until it has finished, lest some other task displace its mappings. + + kmap_atomic() may also be used by interrupt contexts, since it is does not + sleep and the caller may not sleep until after kunmap_atomic() is called. + + It may be assumed that k[un]map_atomic() won't fail. + + +================= +USING KMAP_ATOMIC +================= + +When and where to use kmap_atomic() is straightforward. It is used when code +wants to access the contents of a page that might be allocated from high memory +(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two +functions, and they can be used in a manner similar to the following: + + /* Find the page of interest. */ + struct page *page = find_get_page(mapping, offset); + + /* Gain access to the contents of that page. */ + void *vaddr = kmap_atomic(page); + + /* Do something to the contents of that page. */ + memset(vaddr, 0, PAGE_SIZE); + + /* Unmap that page. */ + kunmap_atomic(vaddr); + +Note that the kunmap_atomic() call takes the result of the kmap_atomic() call +not the argument. + +If you need to map two pages because you want to copy from one page to +another you need to keep the kmap_atomic calls strictly nested, like: + + vaddr1 = kmap_atomic(page1); + vaddr2 = kmap_atomic(page2); + + memcpy(vaddr1, vaddr2, PAGE_SIZE); + + kunmap_atomic(vaddr2); + kunmap_atomic(vaddr1); + + +========================== +COST OF TEMPORARY MAPPINGS +========================== + +The cost of creating temporary mappings can be quite high. The arch has to +manipulate the kernel's page tables, the data TLB and/or the MMU's registers. + +If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping +simply with a bit of arithmetic that will convert the page struct address into +a pointer to the page contents rather than juggling mappings about. In such a +case, the unmap operation may be a null operation. + +If CONFIG_MMU is not set, then there can be no temporary mappings and no +highmem. In such a case, the arithmetic approach will also be used. + + +======== +i386 PAE +======== + +The i386 arch, under some circumstances, will permit you to stick up to 64GiB +of RAM into your 32-bit machine. This has a number of consequences: + + (*) Linux needs a page-frame structure for each page in the system and the + pageframes need to live in the permanent mapping, which means: + + (*) you can have 896M/sizeof(struct page) page-frames at most; with struct + page being 32-bytes that would end up being something in the order of 112G + worth of pages; the kernel, however, needs to store more than just + page-frames in that memory... + + (*) PAE makes your page tables larger - which slows the system down as more + data has to be accessed to traverse in TLB fills and the like. One + advantage is that PAE has more PTE bits and can provide advanced features + like NX and PAT. + +The general recommendation is that you don't use more than 8GiB on a 32-bit +machine - although more might work for you and your workload, you're pretty +much on your own - don't expect kernel developers to really care much if things +come apart. |