summaryrefslogtreecommitdiff
path: root/include/trace
AgeCommit message (Collapse)AuthorFilesLines
2010-10-27writeback: do not sleep on the congestion queue if there are no congested ↵Mel Gorman1-0/+7
BDIs or if significant congestion is not being encountered in the current zone If congestion_wait() is called with no BDI congested, the caller will sleep for the full timeout and this may be an unnecessary sleep. This patch adds a wait_iff_congested() that checks congestion and only sleeps if a BDI is congested else, it calls cond_resched() to ensure the caller is not hogging the CPU longer than its quota but otherwise will not sleep. This is aimed at reducing some of the major desktop stalls reported during IO. For example, while kswapd is operating, it calls congestion_wait() but it could just have been reclaiming clean page cache pages with no congestion. Without this patch, it would sleep for a full timeout but after this patch, it'll just call schedule() if it has been on the CPU too long. Similar logic applies to direct reclaimers that are not making enough progress. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaimKOSAKI Motohiro1-3/+3
shrink_page_list() can decide to give up reclaiming a page under a number of conditions such as 1. trylock_page() failure 2. page is unevictable 3. zone reclaim and page is mapped 4. PageWriteback() is true 5. page is swapbacked and swap is full 6. add_to_swap() failure 7. page is dirty and gfpmask don't have GFP_IO, GFP_FS 8. page is pinned 9. IO queue is congested 10. pageout() start IO, but not finished With lumpy reclaim, failures result in entering synchronous lumpy reclaim but this can be unnecessary. In cases (2), (3), (5), (6), (7) and (8), there is no point retrying. This patch causes lumpy reclaim to abort when it is known it will fail. Case (9) is more interesting. current behavior is, 1. start shrink_page_list(async) 2. found queue_congested() 3. skip pageout write 4. still start shrink_page_list(sync) 5. wait on a lot of pages 6. again, found queue_congested() 7. give up pageout write again So, it's useless time wasting. However, just skipping page reclaim is also notgood as x86 allocating a huge page needs 512 pages for example. It can have more dirty pages than queue congestion threshold (~=128). After this patch, pageout() behaves as follows; - If order > PAGE_ALLOC_COSTLY_ORDER Ignore queue congestion always. - If order <= PAGE_ALLOC_COSTLY_ORDER skip write page and disable lumpy reclaim. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27writeback: account for time spent congestion_waitedMel Gorman1-0/+28
There is strong evidence to indicate a lot of time is being spent in congestion_wait(), some of it unnecessarily. This patch adds a tracepoint for congestion_wait to record when congestion_wait() was called, how long the timeout was for and how long it actually slept. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27tracing, vmscan: add trace events for LRU list shrinkingMel Gorman1-0/+42
There have been numerous reports of stalls that pointed at the problem being somewhere in the VM. There are multiple roots to the problems which means dealing with any of the root problems in isolation is tricky to justify on their own and they would still need integration testing. This patch series puts together two different patch sets which in combination should tackle some of the root causes of latency problems being reported. Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the most important results is being able to calculate the scanning/reclaim ratio as a measure of the amount of work being done by page reclaim. Patch 2 accounts for time spent in congestion_wait. Patches 3-6 were originally developed by Kosaki Motohiro but reworked for this series. It has been noted that lumpy reclaim is far too aggressive and trashes the system somewhat. As SLUB uses high-order allocations, a large cost incurred by lumpy reclaim will be noticeable. It was also reported during transparent hugepage support testing that lumpy reclaim was trashing the system and these patches should mitigate that problem without disabling lumpy reclaim. Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait(). wait_iff_congested() only sleeps if there is a BDI that is currently congested. Patch 8 notes that any BDI being congested is not necessarily a problem because there could be multiple BDIs of varying speeds and numberous zones. It attempts to track when a zone being reclaimed contains many pages backed by a congested BDI and if so, reclaimers wait on the congestion queue. I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each machine had 3G of RAM and the CPUs were X86: Intel P4 2-core X86-64: AMD Phenom 4-core PPC64: PPC970MP Each used a single disk and the onboard IO controller. Dirty ratio was left at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to keep this report short. Four kernels were tested each based on v2.6.36-rc4 traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better waitcongest-v2r3: Patches 1-7 to only wait on congestion waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO The tests run were as follows kernbench compile-based benchmark. Smoke test performance sysbench OLTP read-only benchmark. Will be re-run in the future as read-write micro-mapped-file-stream This is a micro-benchmark from Johannes Weiner that accesses a large sparse-file through mmap(). It was configured to run in only single-CPU mode but can be indicative of how well page reclaim identifies suitable pages. stress-highalloc Tries to allocate huge pages under heavy load. kernbench, iozone and sysbench did not report any performance regression on any machine. sysbench did pressure the system lightly and there was reclaim activity but there were no difference of major interest between the kernels. X86-64 micro-mapped-file-stream traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4 pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%) pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%) pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%) pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%) pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%) pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%) pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%) pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%) pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%) allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%) These are based on the raw figures taken from /proc/vmstat. It's a rough measure of reclaim activity. Note that allocstall counts are higher because we are entering direct reclaim more often as a result of not sleeping in congestion. In itself, it's not necessarily a bad thing. It's easier to get a view of what happened from the vmscan tracepoint report. FTrace Reclaim Statistics: vmscan traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4 Direct reclaims 443 273 513 1568 Direct reclaim pages scanned 305968 280402 600825 957933 Direct reclaim pages reclaimed 43503 19005 30327 117191 Direct reclaim write file async I/O 0 0 0 0 Direct reclaim write anon async I/O 0 3 4 12 Direct reclaim write file sync I/O 0 0 0 0 Direct reclaim write anon sync I/O 0 0 0 0 Wake kswapd requests 187649 132338 191695 267701 Kswapd wakeups 3 1 4 1 Kswapd pages scanned 4599269 4454162 4296815 3891906 Kswapd pages reclaimed 2295947 2428434 2399818 2319706 Kswapd reclaim write file async I/O 1 0 1 1 Kswapd reclaim write anon async I/O 59 187 41 222 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96 Time kswapd awake (seconds) 11.15 10.25 11.01 10.19 Total pages scanned 4905237 4734564 4897640 4849839 Total pages reclaimed 2339450 2447439 2430145 2436897 %age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25% %age total pages scanned/written 0.00% 0.00% 0.00% 0.00% %age file pages scanned/written 0.00% 0.00% 0.00% 0.00% Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25% Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86% What is interesting here for nocongest in particular is that while direct reclaim scans more pages, the overall number of pages scanned remains the same and the ratio of pages scanned to pages reclaimed is more or less the same. In other words, while we are sleeping less, reclaim is not doing more work and as direct reclaim and kswapd is awake for less time, it would appear to be doing less work. FTrace Reclaim Statistics: congestion_wait Direct number congest waited 87 196 64 0 Direct time congest waited 4604ms 4732ms 5420ms 0ms Direct full congest waited 72 145 53 0 Direct number conditional waited 0 0 324 1315 Direct time conditional waited 0ms 0ms 0ms 0ms Direct full conditional waited 0 0 0 0 KSwapd number congest waited 20 10 15 7 KSwapd time congest waited 1264ms 536ms 884ms 284ms KSwapd full congest waited 10 4 6 2 KSwapd number conditional waited 0 0 0 0 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 0 0 0 0 The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at all asleep with the patches. MMTests Statistics: duration User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66 Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76 Overall, the tests completed faster. It is interesting to note that backing off further when a zone is congested and not just a BDI was more efficient overall. PPC64 micro-mapped-file-stream pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%) pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%) pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%) pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%) pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%) allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%) Similar trends to x86-64. allocstalls are up but it's not necessarily bad. FTrace Reclaim Statistics: vmscan Direct reclaims 977 2709 2098 5136 Direct reclaim pages scanned 629825 963814 1063938 1711935 Direct reclaim pages reclaimed 75550 242538 150904 387647 Direct reclaim write file async I/O 0 0 0 2 Direct reclaim write anon async I/O 0 10 0 4 Direct reclaim write file sync I/O 0 0 0 0 Direct reclaim write anon sync I/O 0 0 0 0 Wake kswapd requests 392119 1201712 571935 571921 Kswapd wakeups 3 2 3 3 Kswapd pages scanned 4601307 4128076 3912317 3377165 Kswapd pages reclaimed 2432523 2318797 2312673 2144616 Kswapd reclaim write file async I/O 20 1 1 1 Kswapd reclaim write anon async I/O 57 132 11 121 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88 Time kswapd awake (seconds) 21.73 26.51 25.55 23.90 Total pages scanned 5231132 5091890 4976255 5089100 Total pages reclaimed 2508073 2561335 2463577 2532263 %age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76% %age total pages scanned/written 0.00% 0.00% 0.00% 0.00% %age file pages scanned/written 0.00% 0.00% 0.00% 0.00% Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65% Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40% Again, a similar trend that the congestion_wait changes mean that direct reclaim scans more pages but the overall number of pages scanned while slightly reduced, are very similar. The ratio of scanning/reclaimed remains roughly similar. The downside is that kswapd and direct reclaim was awake longer and for a larger percentage of the overall workload. It's possible there were big differences in the amount of time spent reclaiming slab pages between the different kernels which is plausible considering that the micro tests runs after fsmark and sysbench. Trace Reclaim Statistics: congestion_wait Direct number congest waited 845 1312 104 0 Direct time congest waited 19416ms 26560ms 7544ms 0ms Direct full congest waited 745 1105 72 0 Direct number conditional waited 0 0 1322 2935 Direct time conditional waited 0ms 0ms 12ms 312ms Direct full conditional waited 0 0 0 3 KSwapd number congest waited 39 102 75 63 KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms KSwapd full congest waited 20 48 46 25 KSwapd number conditional waited 0 0 0 0 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 0 0 0 0 The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms asleep with the patches. The time kswapd spent congest waited was also reduced by a large factor. MMTests Statistics: duration ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47 Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88 With all patches applies, the completion times are very similar. X86-64 STRESS-HIGHALLOC traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4 Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%) Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%) At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%) Success figures across the board are broadly similar. traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4 Direct reclaims 1045 944 886 887 Direct reclaim pages scanned 135091 119604 109382 101019 Direct reclaim pages reclaimed 88599 47535 47863 46671 Direct reclaim write file async I/O 494 283 465 280 Direct reclaim write anon async I/O 29357 13710 16656 13462 Direct reclaim write file sync I/O 154 2 2 3 Direct reclaim write anon sync I/O 14594 571 509 561 Wake kswapd requests 7491 933 872 892 Kswapd wakeups 814 778 731 780 Kswapd pages scanned 7290822 15341158 11916436 13703442 Kswapd pages reclaimed 3587336 3142496 3094392 3187151 Kswapd reclaim write file async I/O 91975 32317 28022 29628 Kswapd reclaim write anon async I/O 1992022 789307 829745 849769 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07 Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82 Total pages scanned 7425913 15460762 12025818 13804461 Total pages reclaimed 3675935 3190031 3142255 3233822 %age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43% %age total pages scanned/written 28.66% 5.41% 7.28% 6.47% %age file pages scanned/written 1.25% 0.21% 0.24% 0.22% Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99% Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25% Scanned/reclaimed ratios again look good with big improvements in efficiency. The Scanned/written ratios also look much improved. With a better scanned/written ration, there is an expectation that IO would be more efficient and indeed, the time spent in direct reclaim is much reduced by the full series and kswapd spends a little less time awake. Overall, indications here are that allocations were happening much faster and this can be seen with a graph of the latency figures as the allocations were taking place http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps FTrace Reclaim Statistics: congestion_wait Direct number congest waited 1333 204 169 4 Direct time congest waited 78896ms 8288ms 7260ms 200ms Direct full congest waited 756 92 69 2 Direct number conditional waited 0 0 26 186 Direct time conditional waited 0ms 0ms 0ms 2504ms Direct full conditional waited 0 0 0 25 KSwapd number congest waited 4 395 227 282 KSwapd time congest waited 384ms 25136ms 10508ms 18380ms KSwapd full congest waited 3 232 98 176 KSwapd number conditional waited 0 0 0 0 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 0 0 0 0 KSwapd full conditional waited 318 0 312 9 Overall, the time spent speeping is reduced. kswapd is still hitting congestion_wait() but that is because there are callers remaining where it wasn't clear in advance if they should be changed to wait_iff_congested() or not. Overall the sleep imes are reduced though - from 79ish seconds to about 19. MMTests Statistics: duration User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5 Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39 With the full series, the time to complete the tests are reduced by 30% PPC64 STRESS-HIGHALLOC traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4 Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%) Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%) At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%) Success rates there are *way* up particularly considering that the 16MB huge pages on PPC64 mean that it's always much harder to allocate them. FTrace Reclaim Statistics: vmscan stress-highalloc stress-highalloc stress-highalloc stress-highalloc traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4 Direct reclaims 499 505 564 509 Direct reclaim pages scanned 223478 41898 51818 45605 Direct reclaim pages reclaimed 137730 21148 27161 23455 Direct reclaim write file async I/O 399 136 162 136 Direct reclaim write anon async I/O 46977 2865 4686 3998 Direct reclaim write file sync I/O 29 0 1 3 Direct reclaim write anon sync I/O 31023 159 237 239 Wake kswapd requests 420 351 360 326 Kswapd wakeups 185 294 249 277 Kswapd pages scanned 15703488 16392500 17821724 17598737 Kswapd pages reclaimed 5808466 2908858 3139386 3145435 Kswapd reclaim write file async I/O 159938 18400 18717 13473 Kswapd reclaim write anon async I/O 3467554 228957 322799 234278 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23 Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88 Total pages scanned 15926966 16434398 17873542 17644342 Total pages reclaimed 5946196 2930006 3166547 3168890 %age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96% %age total pages scanned/written 23.27% 1.52% 1.94% 1.43% %age file pages scanned/written 1.01% 0.11% 0.11% 0.08% Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91% Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14% While the scanning rates are slightly up, the scanned/reclaimed and scanned/written figures are much improved. The time spent in direct reclaim and with kswapd are massively reduced, mostly by the lowlumpy patches. FTrace Reclaim Statistics: congestion_wait Direct number congest waited 725 303 126 3 Direct time congest waited 45524ms 9180ms 5936ms 300ms Direct full congest waited 487 190 52 3 Direct number conditional waited 0 0 200 301 Direct time conditional waited 0ms 0ms 0ms 1904ms Direct full conditional waited 0 0 0 19 KSwapd number congest waited 0 2 23 4 KSwapd time congest waited 0ms 200ms 420ms 404ms KSwapd full congest waited 0 2 2 4 KSwapd number conditional waited 0 0 0 0 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 0 0 0 0 Not as dramatic a story here but the time spent asleep is reduced and we can still see what wait_iff_congested is going to sleep when necessary. MMTests Statistics: duration User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16 Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85 The time to complete this test goes way down. With the full series, we are allocating over twice the number of huge pages in 30% of the time and there is a corresponding impact on the allocation latency graph available at. http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps This patch: Add a trace event for shrink_inactive_list() and updates the sample postprocessing script appropriately. It can be used to determine how many pages were reclaimed and for non-lumpy reclaim where exactly the pages were reclaimed from. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27writeback: remove nonblocking/encountered_congestion referencesWu Fengguang2-5/+5
This removes more dead code that was somehow missed by commit 0d99519efef (writeback: remove unused nonblocking and congestion checks). There are no behavior change except for the removal of two entries from one of the ext4 tracing interface. The nonblocking checks in ->writepages are no longer used because the flusher now prefer to block on get_request_wait() than to skip inodes on IO congestion. The latter will lead to more seeky IO. The nonblocking checks in ->writepage are no longer used because it's redundant with the WB_SYNC_NONE check. We no long set ->nonblocking in VM page out and page migration, because a) it's effectively redundant with WB_SYNC_NONE in current code b) it's old semantic of "Don't get stuck on request queues" is mis-behavior: that would skip some dirty inodes on congestion and page out others, which is unfair in terms of LRU age. Inspired by Christoph Hellwig. Thanks! Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: David Howells <dhowells@redhat.com> Cc: Sage Weil <sage@newdream.net> Cc: Steve French <sfrench@samba.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-23Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-9/+68
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: remove in_workqueue_context() workqueue: Clarify that schedule_on_each_cpu is synchronous memory_hotplug: drop spurious calls to flush_scheduled_work() shpchp: update workqueue usage pciehp: update workqueue usage isdn/eicon: don't call flush_scheduled_work() from diva_os_remove_soft_isr() workqueue: add and use WQ_MEM_RECLAIM flag workqueue: fix HIGHPRI handling in keep_working() workqueue: add queue_work and activate_work trace points workqueue: prepare for more tracepoints workqueue: implement flush[_delayed]_work_sync() workqueue: factor out start_flush_work() workqueue: cleanup flush/cancel functions workqueue: implement alloc_ordered_workqueue() Fix up trivial conflict in fs/gfs2/main.c as per Tejun
2010-10-21Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-0/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits) sched: Export account_system_vtime() sched: Call tick_check_idle before __irq_enter sched: Remove irq time from available CPU power sched: Do not account irq time to current task x86: Add IRQ_TIME_ACCOUNTING sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time sched: Add a PF flag for ksoftirqd identification sched: Consolidate account_system_vtime extern declaration sched: Fix softirq time accounting sched: Drop group_capacity to 1 only if local group has extra capacity sched: Force balancing on newidle balance if local group has capacity sched: Set group_imb only a task can be pulled from the busiest cpu sched: Do not consider SCHED_IDLE tasks to be cache hot sched: Drop all load weight manipulation for RT tasks sched: Create special class for stop/migrate work sched: Unindent labels sched: Comment updates: fix default latency and granularity numbers tracing/sched: Add sched_pi_setprio tracepoint sched: Give CPU bound RT tasks preference sched: Try not to migrate higher priority RT tasks ...
2010-10-05workqueue: add queue_work and activate_work trace pointsTejun Heo1-0/+53
These two tracepoints allow tracking when and how a work is queued and activated. This patch is based on Frederic's patch to add queue_work trace point. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
2010-10-05workqueue: prepare for more tracepointsTejun Heo1-13/+19
Define workqueue_work event class and use it for workqueue_execute_end trace point. Also, move trace/events/workqueue.h include downwards such that all struct definitions are visible to it. This is to prepare for more tracepoints and doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-21tracing/sched: Add sched_pi_setprio tracepointSteven Rostedt1-0/+29
Add a tracepoint that shows the priority of a task being boosted via priority inheritance. Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-09-17tracing, perf: Add more power related eventsJean Pihet1-3/+87
This patch adds new generic events for dynamic power management tracing: - clock events class: used for clock enable/disable and for clock rate change, - power_domain events class: used for power domains transitions. The OMAP architecture will be using the new events for PM debugging, however the new events are made generic enough to be used by all platforms. Signed-off-by: Jean Pihet <j-pihet@ti.com> Acked-by: Thomas Renninger <trenn@suse.de> Cc: discuss@lesswatts.org Cc: linux-pm@lists.linux-foundation.org Cc: Thomas Renninger <trenn@suse.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Len Brown <len.brown@intel.com> Cc: Kevin Hilman <khilman@deeprootsystems.com> LKML-Reference: <AANLkTinUmbSUUuxUzc8++pcb9gd1CZFdyTQFrveTBXyV@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-07skb: Add tracepoints to freeing skbKoki Sanagi1-0/+17
This patch adds tracepoint to consume_skb and add trace_kfree_skb before __kfree_skb in skb_free_datagram_locked and net_tx_action. Combinating with tracepoint on dev_hard_start_xmit, we can check how long it takes to free transmitted packets. And using it, we can calculate how many packets driver had at that time. It is useful when a drop of transmitted packet is a problem. sshd-6828 [000] 112689.258154: consume_skb: skbaddr=f2d99bb8 Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C724364.50903@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-07netdev: Add tracepoints to netdev layerKoki Sanagi1-0/+82
This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit, netif_rx and netif_receive_skb. These tracepoints help you to monitor network driver's input/output. <idle>-0 [001] 112447.902030: netif_rx: dev=eth1 skbaddr=f3ef0900 len=84 <idle>-0 [001] 112447.902039: netif_receive_skb: dev=eth1 skbaddr=f3ef0900 len=84 sshd-6828 [000] 112447.903257: net_dev_queue: dev=eth4 skbaddr=f3fca538 len=226 sshd-6828 [000] 112447.903260: net_dev_xmit: dev=eth4 skbaddr=f3fca538 len=226 rc=0 Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C72431E.3000901@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-07napi: Convert trace_napi_poll to TRACE_EVENTNeil Horman1-2/+23
This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT to improve the usability of napi_poll tracepoint. <idle>-0 [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3 <idle>-0 [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1 The original patch is below: http://marc.info/?l=linux-kernel&m=126021713809450&w=2 [ sanagi.koki@jp.fujitsu.com: And add a fix by Steven Rostedt: http://marc.info/?l=linux-kernel&m=126150506519173&w=2 ] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C7242D7.4050009@jp.fujitsu.com> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-07irq: Add tracepoint to softirq_raiseLai Jiangshan1-2/+24
Add a tracepoint for tracing when softirq action is raised. This and the existing tracepoints complete softirq's tracepoints: softirq_raise, softirq_entry and softirq_exit. And when this tracepoint is used in combination with the softirq_entry tracepoint we can determine the softirq raise latency. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: David Miller <davem@davemloft.net> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C724298.4050509@jp.fujitsu.com> [ factorize softirq events with DECLARE_EVENT_CLASS ] Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-08-24Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds1-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: watchdog: Don't throttle the watchdog tracing: Fix timer tracing
2010-08-22workqueue: Add basic tracepoints to track workqueue executionArjan van de Ven1-0/+62
With the introduction of the new unified work queue thread pools, we lost one feature: It's no longer possible to know which worker is causing the CPU to wake out of idle. The result is that PowerTOP now reports a lot of "kworker/a:b" instead of more readable results. This patch adds a pair of tracepoints to the new workqueue code, similar in style to the timer/hrtimer tracepoints. With this pair of tracepoints, the next PowerTOP can correctly report which work item caused the wakeup (and how long it took): Interrupt (43) i915 time 3.51ms wakeups 141 Work ieee80211_iface_work time 0.81ms wakeups 29 Work do_dbs_timer time 0.55ms wakeups 24 Process Xorg time 21.36ms wakeups 4 Timer sched_rt_period_timer time 0.01ms wakeups 1 Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-19tracing: Fix timer tracingArjan van de Ven1-2/+6
PowerTOP would like to be able to trace timers. Unfortunately, the current timer tracing is not very useful: the actual timer function is not recorded in the trace at the start of timer execution. Although this is recorded for timer "start" time (when it gets armed), this is not useful; most timers get started early, and a tracer like PowerTOP will never see this event, but will only see the actual running of the timer. This patch just adds the function to the timer tracing; I've verified with PowerTOP that now it can get useful information about timers. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: xiaoguangrong@cn.fujitsu.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> # .35.x, .34.x, .33.x LKML-Reference: <4C6C5FA9.3000405@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-08-11Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2-5/+169
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits) block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n xen-blkfront: fix missing out label blkdev: fix blkdev_issue_zeroout return value block: update request stacking methods to support discards block: fix missing export of blk_types.h writeback: fix bad _bh spinlock nesting drbd: revert "delay probes", feature is being re-implemented differently drbd: Initialize all members of sync_conf to their defaults [Bugz 315] drbd: Disable delay probes for the upcomming release writeback: cleanup bdi_register writeback: add new tracepoints writeback: remove unnecessary init_timer call writeback: optimize periodic bdi thread wakeups writeback: prevent unnecessary bdi threads wakeups writeback: move bdi threads exiting logic to the forker thread writeback: restructure bdi forker loop a little writeback: move last_active to bdi writeback: do not remove bdi from bdi_list writeback: simplify bdi code a little writeback: do not lose wake-ups in bdi threads ... Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and drivers/scsi/scsi_error.c as per Jens.
2010-08-10memcg: add mm_vmscan_memcg_isolate tracepointKOSAKI Motohiro1-0/+15
Memcg also need to trace page isolation information as global reclaim. This patch does it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10vmscan: convert mm_vmscan_lru_isolate to DEFINE_EVENTKOSAKI Motohiro1-1/+16
Mel Gorman recently added some vmscan tracepoints. Unfortunately they are covered only global reclaim. But we want to trace memcg reclaim too. Thus, this patch convert them to DEFINE_TRACE macro. it help to reuse tracepoint definition for other similar usage (i.e. memcg). This patch have no functionally change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10memcg, vmscan: add memcg reclaim tracepointKOSAKI Motohiro1-0/+28
Memcg also need to trace reclaim progress as direct reclaim. This patch add it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10vmscan: convert direct reclaim tracepoint to DEFINE_TRACEKOSAKI Motohiro1-2/+17
Mel Gorman recently added some vmscan tracepoints. Unfortunately they are covered only global reclaim. But we want to trace memcg reclaim too. Thus, this patch convert them to DEFINE_TRACE macro. it help to reuse tracepoint definition for other similar usage (i.e. memcg). This patch have no functionally change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10vmscan: tracing: add trace event when a page is writtenMel Gorman1-0/+41
Add a trace event for when page reclaim queues a page for IO and records whether it is synchronous or asynchronous. Excessive synchronous IO for a process can result in noticeable stalls during direct reclaim. Excessive IO from page reclaim may indicate that the system is seriously under provisioned for the amount of dirty pages that exist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10vmscan: tracing: add trace events for LRU page isolationMel Gorman1-0/+46
Add an event for when pages are isolated en-masse from the LRU lists. This event augments the information available on LRU traffic and can be used to evaluate lumpy reclaim. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10vmscan: tracing: add trace events for kswapd wakeup, sleeping and direct reclaimMel Gorman3-37/+153
Add two trace events for kswapd waking up and going asleep for the purposes of tracking kswapd activity and two trace events for direct reclaim beginning and ending. The information can be used to work out how much time a process or the system is spending on the reclamation of pages and in the case of direct reclaim, how many pages were reclaimed for that process. High frequency triggering of these events could point to memory pressure problems. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-08Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds1-8/+12
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: Adding error check after calling ext4_mb_regular_allocator() ext4: Fix dirtying of journalled buffers in data=journal mode ext4: re-inline ext4_rec_len_(to|from)_disk functions jbd2: Remove t_handle_lock from start_this_handle() jbd2: Change j_state_lock to be a rwlock_t jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop ext4: Add mount options in superblock ext4: force block allocation on quota_off ext4: fix freeze deadlock under IO ext4: drop inode from orphan list if ext4_delete_inode() fails ext4: check to make make sure bd_dev is set before dereferencing it jbd2: Make barrier messages less scary ext4: don't print scary messages for allocation failures post-abort ext4: fix EFBIG edge case when writing to large non-extent file ext4: fix ext4_get_blocks references ext4: Always journal quota file modifications ext4: Fix potential memory leak in ext4_fill_super ext4: Don't error out the fs if the user tries to make a file too big ext4: allocate stripe-multiple IOs on stripe boundaries ext4: move aio completion after unwritten extent conversion ... Fix up conflicts in fs/ext4/inode.c as per Ted. Fix up xfs conflicts as per earlier xfs merge.
2010-08-07Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-92/+0
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits) workqueue: mark init_workqueues() as early_initcall() workqueue: explain for_each_*cwq_cpu() iterators fscache: fix build on !CONFIG_SYSCTL slow-work: kill it gfs2: use workqueue instead of slow-work drm: use workqueue instead of slow-work cifs: use workqueue instead of slow-work fscache: drop references to slow-work fscache: convert operation to use workqueue instead of slow-work fscache: convert object to use workqueue instead of slow-work workqueue: fix how cpu number is stored in work->data workqueue: fix mayday_mask handling on UP workqueue: fix build problem on !CONFIG_SMP workqueue: fix locking in retry path of maybe_create_worker() async: use workqueue for worker pool workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead workqueue: implement unbound workqueue workqueue: prepare for WQ_UNBOUND implementation libata: take advantage of cmwq and remove concurrency limitations workqueue: fix worker management invocation without pending works ... Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c
2010-08-07writeback: add new tracepointsArtem Bityutskiy1-0/+2
Add 2 new trace points to the periodic write-back wake up case, just like we do in the 'bdi_queue_work()' function. Namely, introduce: 1. trace_writeback_wake_thread(bdi) 2. trace_writeback_wake_forker_thread(bdi) The first event is triggered every time we wake up a bdi thread to start periodic background write-out. The second event is triggered only when the bdi thread does not exist and should be created by the forker thread. This patch was suggested by Dave Chinner and Christoph Hellwig. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback.h: needs linux/device.hRandy Dunlap1-0/+1
include/trace/events/writeback.h uses dev_name(), so it needs to include linux/device.h. include/trace/events/writeback.h:12: error: implicit declaration of function 'dev_name' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Add tracing to write_cache_pagesDave Chinner1-0/+1
Add a trace event to the ->writepage loop in write_cache_pages to give visibility into how the ->writepage call is changing variables within the writeback control structure. Of most interest is how wbc->nr_to_write changes from call to call, especially with filesystems that write multiple pages in ->writepage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Add tracing to balance_dirty_pagesDave Chinner1-0/+64
Tracing high level background writeback events is good, but it doesn't give the entire picture. Add visibility into write throttling to catch IO dispatched by foreground throttling of processing dirtying lots of pages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Initial tracing supportDave Chinner1-0/+91
Trace queue/sched/exec parts of the writeback loop. This provides insight into when and why flusher threads are scheduled to run. e.g a sync invocation leaves traces like: sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 This also lays the foundation for adding more writeback tracing to provide deeper insight into the whole writeback path. The original tracing code is from Jens Axboe, though this version is a rewrite as a result of the code being traced changing significantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07block: remove wrappers for request type/flagsChristoph Hellwig1-5/+10
Remove all the trivial wrappers for the cmd_type and cmd_flags fields in struct requests. This allows much easier grepping for different request types instead of unwinding through macros. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-02tracing: Drop cpparg() macroFrederic Weisbecker1-5/+2
Drop the cpparg() macro that wraps CPP parameters. We already have the PARAM() macro for that, no need to have several versions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>
2010-07-27ext4: fix potential NULL dereference while tracingTheodore Ts'o1-8/+12
The allocation_context pointer can be NULL. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-23Merge branch 'tip/perf/core' of ↵Ingo Molnar1-9/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
2010-07-22x86 cpufreq, perf: Make trace_power_frequency cpufreq driver independentThomas Renninger1-12/+15
and fix the broken case if a core's frequency depends on others. trace_power_frequency was only implemented in a rather ungeneric way in acpi-cpufreq driver's target() function only. -> Move the call to trace_power_frequency to cpufreq.c:cpufreq_notify_transition() where CPUFREQ_POSTCHANGE notifier is triggered. This will support power frequency tracing by all cpufreq drivers. trace_power_frequency did not trace frequency changes correctly when the userspace governor was used or when CPU cores' frequency depend on each other. -> Moving this into the CPUFREQ_POSTCHANGE notifier and pass the cpu which gets switched automatically fixes this. Robert Schoene provided some important fixes on top of my initial quick shot version which are integrated in this patch: - Forgot some changes in power_end trace (TP_printk/variable names) - Variable dummy in power_end must now be cpu_id - Use static 64 bit variable instead of unsigned int for cpu_id [akpm@linux-foundation.org: build fix] Signed-off-by: Thomas Renninger <trenn@suse.de> Cc: davej@codemonkey.org.uk Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Jones <davej@codemonkey.org.uk> Acked-by: Arjan van de Ven <arjan@infradead.org> Cc: Robert Schoene <robert.schoene@tu-dresden.de> Tested-by: Robert Schoene <robert.schoene@tu-dresden.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2010-07-21tracing: Reduce latency and remove percpu trace_seqLai Jiangshan1-9/+3
__print_flags() and __print_symbolic() use percpu trace_seq: 1) Its memory is allocated at compile time, it wastes memory if we don't use tracing. 2) It is percpu data and it wastes more memory for multi-cpus system. 3) It disables preemption when it executes its core routine "trace_seq_printf(s, "%s: ", #call);" and introduces latency. So we move this trace_seq to struct trace_iterator. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4C078350.7090106@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-06-29workqueue: temporarily remove workqueue tracingTejun Heo1-92/+0
Strip tracing code from workqueue and remove workqueue tracing. This is temporary measure till concurrency managed workqueue is complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
2010-06-29tracing: Use class->reg() for all registering of eventsSteven Rostedt1-0/+2
Because kprobes and syscalls need special processing to register events, the class->reg() method was created to handle the differences. But instead of creating a default ->reg for perf and ftrace events, the code was scattered with: if (class->reg) class->reg(); else default_reg(); This is messy and can also lead to bugs. This patch cleans up this code and creates a default reg() entry for the events allowing for the code to directly call the class->reg() without the condition. Reported-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-06-29tracing: Convert more sched events to DEFINE_EVENTLi Zefan1-25/+7
Convert sched_wait_task to DEFINE_EVENT, and save ~1K: text data bss dec hex filename 104595 9424 4992 119011 1d0e3 kernel/sched.o.orig 103619 9344 4992 117955 1ccc3 kernel/sched.o No change in functionality. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA3787.2040800@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-06-29tracing: Convert some timer events to DEFINE_EVENTLi Zefan1-48/+32
Use DECLARE_EVENT_CLASS, and save ~2.3K: text data bss dec hex filename 7018823 2031888 7251132 16301843 f8bf13 vmlinux.o.orig 7016727 2031696 7251132 16299555 f8b623 vmlinux.o 5 events are converted: timer_class: timer_init, timer_expire_exit, timer_cancel hrtimer_class: hrtimer_init, hrtimer_cancel No change in functionality. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA3773.3060200@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-06-29tracing: Use a global field list for all syscall exit eventsLi Zefan1-1/+0
All syscall exit events have the same fields. The kernel size drops 2.5K: text data bss dec hex filename 7018612 2034376 7251132 16304120 f8c7f8 vmlinux.o.orig 7018612 2031888 7251132 16301632 f8be40 vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA3746.8070100@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-06-18Merge commit 'v2.6.35-rc3' into perf/coreIngo Molnar3-6/+21
Merge reason: Go from -rc1 base to -rc3 base, merge in fixes.
2010-06-10Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix null pointer deref with SEND_SIG_FORCED perf: Fix signed comparison in perf_adjust_period() powerpc/oprofile: fix potential buffer overrun in op_model_cell.c perf symbols: Set the DSO long name when using symbol_conf.vmlinux_name
2010-06-09Merge branch 'perf/core' of ↵Ingo Molnar2-61/+1
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core
2010-06-09writeback: pay attention to wbc->nr_to_write in write_cache_pagesDave Chinner1-4/+1
If a filesystem writes more than one page in ->writepage, write_cache_pages fails to notice this and continues to attempt writeback when wbc->nr_to_write has gone negative - this trace was captured from XFS: wbc_writeback_start: towrt=1024 wbc_writepage: towrt=1024 wbc_writepage: towrt=0 wbc_writepage: towrt=-1 wbc_writepage: towrt=-5 wbc_writepage: towrt=-21 wbc_writepage: towrt=-85 This has adverse effects on filesystem writeback behaviour. write_cache_pages() needs to terminate after a certain number of pages are written, not after a certain number of calls to ->writepage are made. This is a regression introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add no_nrwrite_index_update writeback control flag"), but cannot be reverted directly due to subsequent bug fixes that have gone in on top of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-09tracing: Fix null pointer deref with SEND_SIG_FORCEDOleg Nesterov1-1/+2
BUG: unable to handle kernel NULL pointer dereference at 0000000000000006 IP: [<ffffffff8107bd37>] ftrace_raw_event_signal_generate+0x87/0x140 TP_STORE_SIGINFO() forgets about SEND_SIG_FORCED, fix. We should probably export is_si_special() and change TP_STORE_SIGINFO() to use it in the longer term. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: 2.6.33.x-2.6.34.x <stable@kernel.org> LKML-Reference: <20100603213409.GA8307@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-06-09tracing: Remove boot tracerAmérico Wang1-60/+0
The boot tracer is useless. It simply logs the initcalls but in fact these initcalls are also logged through printk while using the initcall_debug kernel parameter. Nobody seem to be using it so far. Then just remove it. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Chase Douglas <chase.douglas@canonical.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <20100526105753.GA5677@cr0.nay.redhat.com> [ remove the hooks in main.c, and the headers ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>