summaryrefslogtreecommitdiff
path: root/mm/swapfile.c
AgeCommit message (Collapse)AuthorFilesLines
2020-06-02mm/swapfile.c: omit a duplicate code by compare tmp and max firstWei Yang1-10/+8
There are two duplicate code to handle the case when there is no available swap entry. To avoid this, we can compare tmp and max first and let the second guard do its job. No functional change is expected. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: tmp is always smaller than maxWei Yang1-1/+1
If tmp is bigger or equal to max, we would jump to new_cluster. Return true directly. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200421213824.8099-2-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: found_free could be represented by (tmp < max)Wei Yang1-8/+3
This is not necessary to use the variable found_free to record the status. Just check tmp and max is enough. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200421213824.8099-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: remove the extra check in scan_swap_map_slots()Wei Yang1-3/+0
scan_swap_map_slots() is only called by scan_swap_map() and get_swap_pages(). Both ensure nr would not exceed SWAP_BATCH. Just remove it. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200325220309.9803-2-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: simplify the calculation of n_goalWei Yang1-5/+1
Use min3() to simplify the comparison and make it more self-explaining. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200325220309.9803-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: remove the unnecessary goto for SSD caseWei Yang1-5/+1
Now we can see there is redundant goto for SSD case. In these two places, we can just let the code walk through to the correct tag instead of explicitly jump to it. Let's remove them for better readability. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20200328060520.31449-4-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: explicitly show ssd/non-ssd is handled mutually exclusiveWei Yang1-7/+3
The code shows if this is ssd, it will jump to specific tag and skip the following code for non-ssd. Let's use "else if" to explicitly show the mutually exclusion for ssd/non-ssd to reduce ambiguity. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20200328060520.31449-3-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: offset is only used when there is more slotsWei Yang1-3/+1
scan_swap_map_slots() is used to iterate swap_map[] array for an available swap entry. While after several optimizations, e.g. for ssd case, the logic of this function is a little not easy to catch. This patchset tries to clean up the logic a little: * shows the ssd/non-ssd case is handled mutually exclusively * remove some unnecessary goto for ssd case This patch (of 3): When si->cluster_nr is zero, function would reach done and return. The increased offset would not be used any more. This means we can move the offset increment into the if clause. This brings a further code cleanup possibility. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20200328060520.31449-1-richard.weiyang@gmail.com Link: http://lkml.kernel.org/r/20200328060520.31449-2-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: swap: properly update readahead statistics in unuse_pte_range()Andrea Righi1-4/+8
In unuse_pte_range() we blindly swap-in pages without checking if the swap entry is already present in the swap cache. By doing this, the hit/miss ratio used by the swap readahead heuristic is not properly updated and this leads to non-optimal performance during swapoff. Tracing the distribution of the readahead size returned by the swap readahead heuristic during swapoff shows that a small readahead size is used most of the time as if we had only misses (this happens both with cluster and vma readahead), for example: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 36948 $retval = 8 44151 $retval = 4 49290 $retval = 1 527771 $retval = 2 Checking if the swap entry is present in the swap cache, instead, allows to properly update the readahead statistics and the heuristic behaves in a better way during swapoff, selecting a bigger readahead size: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 1618 $retval = 1 4960 $retval = 2 41315 $retval = 4 103521 $retval = 8 In terms of swapoff performance the result is the following: Testing environment =================== - Host: CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache) HDD: PC401 NVMe SK hynix 512GB MEM: 16GB - Guest (kvm): 8GB of RAM virtio block driver 16GB swap file on ext4 (/swapfile) Test case ========= - allocate 85% of memory - `systemctl hibernate` to force all the pages to be swapped-out to the swap file - resume the system - measure the time that swapoff takes to complete: # /usr/bin/time swapoff /swapfile Result (swapoff time) ====== 5.6 vanilla 5.6 w/ this patch ----------- ----------------- cluster-readahead 22.09s 12.19s vma-readahead 18.20s 15.33s Conclusion ========== The specific use case this patch is addressing is to improve swapoff performance in cloud environments when a VM has been hibernated, resumed and all the memory needs to be forced back to RAM by disabling swap. This change allows to better exploits the advantages of the readahead heuristic during swapoff and this improvement allows to to speed up the resume process of such VMs. [andrea.righi@canonical.com: update changelog] Link: http://lkml.kernel.org/r/20200418084705.GA147642@xps-13 Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Anchal Agarwal <anchalag@amazon.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vineeth Remanan Pillai <vpillai@digitalocean.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Link: http://lkml.kernel.org/r/20200416180132.GB3352@xps-13 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile: use list_{prev,next}_entry() instead of open-codingchenqiwu1-10/+6
Use list_{prev,next}_entry() instead of list_entry() for better code readability. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Qian Cai <cai@lca.pw> Cc: Baoquan He <bhe@redhat.com> Link: http://lkml.kernel.org/r/1586599916-15456-2-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07proc: faster open/read/close with "permanent" filesAlexey Dobriyan1-0/+1
Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include <chrono> #include <iostream> #include <thread> #include <vector> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector<std::thread> T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration<double> dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/swapfile: fix data races in try_to_unuse()Qian Cai1-4/+4
si->inuse_pages could be accessed concurrently as noticed by KCSAN, write to 0xffff98b00ebd04dc of 4 bytes by task 82262 on cpu 92: swap_range_free+0xbe/0x230 swap_range_free at mm/swapfile.c:719 swapcache_free_entries+0x1be/0x250 free_swap_slot+0x1c8/0x220 __swap_entry_free.constprop.19+0xa3/0xb0 free_swap_and_cache+0x53/0xa0 unmap_page_range+0x7e0/0x1ce0 unmap_single_vma+0xcd/0x170 unmap_vmas+0x18b/0x220 exit_mmap+0xee/0x220 mmput+0xe7/0x240 do_exit+0x598/0xfd0 do_group_exit+0x8b/0x180 get_signal+0x293/0x13d0 do_signal+0x37/0x5d0 prepare_exit_to_usermode+0x1b7/0x2c0 ret_from_intr+0x32/0x42 read to 0xffff98b00ebd04dc of 4 bytes by task 82499 on cpu 46: try_to_unuse+0x86b/0xc80 try_to_unuse at mm/swapfile.c:2185 __x64_sys_swapoff+0x372/0xd40 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe The plain reads in try_to_unuse() are outside si->lock critical section which result in data races that could be dangerous to be used in a loop. Fix them by adding READ_ONCE(). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/1582578903-29294-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/swapfile.c: fix comments for swapcache_prepareChen Wandun1-1/+1
The -EEXIST returned by __swap_duplicate means there is a swap cache instead -EBUSY Signed-off-by: Chen Wandun <chenwandun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200212145754.27123-1-chenwandun@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29mm/swapfile.c: move inode_lock out of claim_swapfileNaohiro Aota1-21/+20
claim_swapfile() currently keeps the inode locked when it is successful, or the file is already swapfile (with -EBUSY). And, on the other error cases, it does not lock the inode. This inconsistency of the lock state and return value is quite confusing and actually causing a bad unlock balance as below in the "bad_swap" section of __do_sys_swapon(). This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE check out of claim_swapfile(). The inode is unlocked in "bad_swap_unlock_inode" section, so that the inode is ensured to be unlocked at "bad_swap". Thus, error handling codes after the locking now jumps to "bad_swap_unlock_inode" instead of "bad_swap". ===================================== WARNING: bad unlock balance detected! 5.5.0-rc7+ #176 Not tainted ------------------------------------- swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550 but there are no more locks to release! other info that might help us debug this: no locks held by swapon/4294. stack backtrace: CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176 Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014 Call Trace: dump_stack+0xa1/0xea print_unlock_imbalance_bug.cold+0x114/0x123 lock_release+0x562/0xed0 up_write+0x2d/0x490 __do_sys_swapon+0x94b/0x3550 __x64_sys_swapon+0x54/0x80 do_syscall_64+0xa4/0x4b0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f15da0a0dc7 Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Qais Youef <qais.yousef@arm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-21mm/swapfile.c: fix a comment in sys_swapon()Christoph Hellwig1-1/+1
claim_swapfile now always takes i_rwsem. Link: http://lkml.kernel.org/r/20200114161225.309792-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04proc: convert everything to "struct proc_ops"Alexey Dobriyan1-7/+7
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31mm/swapfile.c: swap_next should increase position indexVasily Averin1-1/+1
If seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") "Some ->next functions do not increment *pos when they return NULL... Note that such ->next functions are buggy and should be fixed. A simple demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size larger than the size of /proc/swaps. This will always show the whole last line of /proc/swaps" Described problem is still actual. If you make lseek into middle of last output line following read will output end of last line and whole last line once again. $ dd if=/proc/swaps bs=1 # usual output Filename Type Size Used Priority /dev/dm-0 partition 4194812 97536 -2 104+0 records in 104+0 records out 104 bytes copied $ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice dd: /proc/swaps: cannot skip to specified offset v/dm-0 partition 4194812 97536 -2 /dev/dm-0 partition 4194812 97536 -2 3+1 records in 3+1 records out 131 bytes copied https://bugzilla.kernel.org/show_bug.cgi?id=206283 Link: http://lkml.kernel.org/r/bd8cfd7b-ac95-9b91-f9e7-e8438bd5047d@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01mm, swap: disallow swapon() on zoned block devicesNaohiro Aota1-0/+7
A zoned block device consists of a number of zones. Zones are either conventional and accepting random writes or sequential and requiring that writes be issued in LBA order from each zone write pointer position. For the write restriction, zoned block devices are not suitable for a swap device. Disallow swapon on them. [akpm@linux-foundation.org: reflow and reword comment, per Christoph] Link: http://lkml.kernel.org/r/20191015085814.637837-1-naohiro.aota@wdc.com Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Theodore Y. Ts'o" <tytso@mit.edu> Cc: Hannes Reinecke <hare@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-20vfs: don't allow writes to swap filesDarrick J. Wong1-1/+11
Don't let userspace write to an active swap file because the kernel effectively has a long term lease on the storage and things could get seriously corrupted if we let this happen. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-08-20mm: set S_SWAPFILE on blockdev swap devicesDarrick J. Wong1-16/+15
Set S_SWAPFILE on block device inodes so that they have the same protections as a swap flie. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-12mm, swap: use rbtree for swap_extentAaron Lu1-62/+75
swap_extent is used to map swap page offset to backing device's block offset. For a continuous block range, one swap_extent is used and all these swap_extents are managed in a linked list. These swap_extents are used by map_swap_entry() during swap's read and write path. To find out the backing device's block offset for a page offset, the swap_extent list will be traversed linearly, with curr_swap_extent being used as a cache to speed up the search. This works well as long as swap_extents are not huge or when the number of processes that access swap device are few, but when the swap device has many extents and there are a number of processes accessing the swap device concurrently, it can be a problem. On one of our servers, the disk's remaining size is tight: $df -h Filesystem Size Used Avail Use% Mounted on ... ... /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4 When creating a 80G swapfile there, there are as many as 84656 swap extents. The end result is, kernel spends abou 30% time in map_swap_entry() and swap throughput is only 70MB/s. As a comparison, when I used smaller sized swapfile, like 4G whose swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and map_swap_entry() is about 3%. One downside of using rbtree for swap_extent is, 'struct rbtree' takes 24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more for each swap_extent. For a swapfile that has 80k swap_extents, that means 625KiB more memory consumed. Test: Since it's not possible to reboot that server, I can not test this patch diretly there. Instead, I tested it on another server with NVMe disk. I created a 20G swapfile on an NVMe backed XFS fs. By default, the filesystem is quite clean and the created swapfile has only 2 extents. Testing vanilla and this patch shows no obvious performance difference when swapfile is not fragmented. To see the patch's effects, I used some tweaks to manually fragment the swapfile by breaking the extent at 1M boundary. This made the swapfile have 20K extents. nr_task=4 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 165191 90.77% 171798 90.21% patched 858993 +420% 2.16% 715827 +317% 0.77% nr_task=8 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 306783 92.19% 318145 87.76% patched 954437 +211% 2.35% 1073741 +237% 1.57% swapout: the throughput of swap out, in KB/s, higher is better 1st map_swap_entry: cpu cycles percent sampled by perf swapin: the throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry: cpu cycles percent sampled by perf nr_task=1 doesn't show any difference, this is due to the curr_swap_extent can be effectively used to cache the correct swap extent for single task workload. [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/] Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, swap: fix race between swapoff and some swap operationsHuang Ying1-32/+122
When swapin is performed, after getting the swap entry information from the page table, system will swap in the swap entry, without any lock held to prevent the swap device from being swapoff. This may cause the race like below, CPU 1 CPU 2 ----- ----- do_swap_page swapin_readahead __read_swap_cache_async swapoff swapcache_prepare p->swap_map = NULL __swap_duplicate p->swap_map[?] /* !!! NULL pointer access */ Because swapoff is usually done when system shutdown only, the race may not hit many people in practice. But it is still a race need to be fixed. To fix the race, get_swap_device() is added to check whether the specified swap entry is valid in its swap device. If so, it will keep the swap entry valid via preventing the swap device from being swapoff, until put_swap_device() is called. Because swapoff() is very rare code path, to make the normal path runs as fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of reference count is used to implement get/put_swap_device(). >From get_swap_device() to put_swap_device(), RCU reader side is locked, so synchronize_rcu() in swapoff() will wait until put_swap_device() is called. In addition to swap_map, cluster_info, etc. data structure in the struct swap_info_struct, the swap cache radix tree will be freed after swapoff, so this patch fixes the race between swap cache looking up and swapoff too. Races between some other swap cache usages and swapoff are fixed too via calling synchronize_rcu() between clearing PageSwapCache() and freeing swap cache data structure. Another possible method to fix this is to use preempt_off() + stop_machine() to prevent the swap device from being swapoff when its data structure is being accessed. The overhead in hot-path of both methods is similar. The advantages of RCU based method are, 1. stop_machine() may disturb the normal execution code path on other CPUs. 2. File cache uses RCU to protect its radix tree. If the similar mechanism is used for swap cache too, it is easier to share code between them. 3. RCU is used to protect swap cache in total_swapcache_pages() and exit_swap_address_space() already. The two mechanisms can be merged to simplify the logic. Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com Fixes: 235b62176712 ("mm/swap: add cluster lock") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Not-nacked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-21treewide: Add SPDX license identifier for missed filesThomas Gleixner1-0/+1
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-19mm: swapoff: shmem_unuse() stop eviction without igrab()Hugh Dickins1-6/+5
The igrab() in shmem_unuse() looks good, but we forgot that it gives no protection against concurrent unmounting: a point made by Konstantin Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78 ("tmpfs: fix race between umount and swapoff"). The current 5.1-rc swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds. Have a nice day..." followed by GPF. Once again, give up on using igrab(); but don't go back to making such heavy-handed use of shmem_swaplist_mutex as last time: that would spoil the new design, and I expect could deadlock inside shmem_swapin_page(). Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem- specific inode, and shmem_evict_inode() wait for that to go down to 0. Call it "stop_eviction" rather than "swapoff_busy" because it can be put to use for others later (huge tmpfs patches expect to use it). That simplifies shmem_unuse(), protecting it from both unlink and unmount; and in practice lets it locate all the swap in its first try. But do not rely on that: there's still a theoretical case, when shmem_writepage() might have been preempted after its get_swap_page(), before making the swap entry visible to swapoff. [hughd@google.com: remove incorrect list_del()] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904091133570.1898@eggly.anvils Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081259400.1523@eggly.anvils Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> Cc: Huang Ying <ying.huang@intel.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Rik van Riel <riel@surriel.com> Cc: Vineeth Pillai <vpillai@digitalocean.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19mm: swapoff: take notice of completion soonerHugh Dickins1-8/+11
The old try_to_unuse() implementation was driven by find_next_to_unuse(), which terminated as soon as all the swap had been freed. Add inuse_pages checks now (alongside signal_pending()) to stop scanning mms and swap_map once finished. The same ought to be done in shmem_unuse() too, but never was before, and needs a different interface: so leave it as is for now. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081258200.1523@eggly.anvils Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> Cc: Huang Ying <ying.huang@intel.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Rik van Riel <riel@surriel.com> Cc: Vineeth Pillai <vpillai@digitalocean.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIESHugh Dickins1-8/+8
SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but further testing has proved it to be a source of unnecessary swapoff EBUSY failures (which can then be followed by unmount EBUSY failures). When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting or inode being evicted, freeing up swap independent of try_to_unuse(). Those typically completed much sooner than the old quadratic swapoff, but now it's more common that swapoff may need to wait for them. It's possible to move those cases from init_mm.mmlist and shmem_swaplist to separate "exiting" swaplists, and try_to_unuse() then wait for those lists to be emptied; but we've not bothered with that in the past, and don't want to risk missing some other forgotten case. So just revert to cycling around until the swap is gone, without any retries limit. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> Cc: Huang Ying <ying.huang@intel.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Rik van Riel <riel@surriel.com> Cc: Vineeth Pillai <vpillai@digitalocean.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm/swapfile.c: use struct_size() in kvzalloc()Gustavo A. R. Silva1-2/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = kvzalloc(size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL); Notice that, in this case, variable size is not necessary, hence it is removed. This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20190221154622.GA19599@embeddedor Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06numa: make "nr_node_ids" unsigned intAlexey Dobriyan1-1/+1
Number of NUMA nodes can't be negative. This saves a few bytes on x86_64: add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238) Function old new delta hv_synic_alloc.cold 88 110 +22 prealloc_shrinker 260 262 +2 bootstrap 249 251 +2 sched_init_numa 1566 1567 +1 show_slab_objects 778 777 -1 s_show 1201 1200 -1 kmem_cache_init 346 345 -1 __alloc_workqueue_key 1146 1145 -1 mem_cgroup_css_alloc 1614 1612 -2 __do_sys_swapon 4702 4699 -3 __list_lru_init 655 651 -4 nic_probe 2379 2374 -5 store_user_store 118 111 -7 red_zone_store 106 99 -7 poison_store 106 99 -7 wq_numa_init 348 338 -10 __kmem_cache_empty 75 65 -10 task_numa_free 186 173 -13 merge_across_nodes_store 351 336 -15 irq_create_affinity_masks 1261 1246 -15 do_numa_crng_init 343 321 -22 task_numa_fault 4760 4737 -23 swapfile_init 179 156 -23 hv_synic_alloc 536 492 -44 apply_wqattrs_prepare 746 695 -51 Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm, swap: bounds check swap_info array accesses to avoid NULL derefsDaniel Jordan1-22/+29
Dan Carpenter reports a potential NULL dereference in get_swap_page_of_type: Smatch complains that the NULL checks on "si" aren't consistent. This seems like a real bug because we have not ensured that the type is valid and so "si" can be NULL. Add the missing check for NULL, taking care to use a read barrier to ensure CPU1 observes CPU0's updates in the correct order: CPU0 CPU1 alloc_swap_info() if (type >= nr_swapfiles) swap_info[type] = p /* handle invalid entry */ smp_wmb() smp_rmb() ++nr_swapfiles p = swap_info[type] Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before CPU0's write to swap_info[type] and read NULL from swap_info[type]. Ying Huang noticed other places in swapfile.c don't order these reads properly. Introduce swap_type_to_swap_info to encourage correct usage. Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model (see tools/memory-model/Documentation/explanation.txt). This ordering need not be enforced in places where swap_lock is held (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles and the swap_info array. Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Shaohua Li <shli@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tejun Heo <tj@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: rid swapoff of quadratic complexityVineeth Remanan Pillai1-270/+163
This patch was initially posted by Kelley Nielsen. Reposting the patch with all review comments addressed and with minor modifications and optimizations. Also, folding in the fixes offered by Hugh Dickins and Huang Ying. Tests were rerun and commit message updated with new results. try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. This new proposed implementation of try_to_unuse simplifies its complexity to linear. It iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables. It also makes similar changes to shmem_unuse. Improvement swapoff was called on a swap partition containing about 6G of data, in a VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted. Present implementation....about 1200M calls(8min, avg 80% cpu util). Prototype.................about 9.0K calls(3min, avg 5% cpu util). Details In shmem_unuse(), iterate over the shmem_swaplist and, for each shmem_inode_info that contains a swap entry, pass it to shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(), iterate over its associated xarray, and store the index and value of each swap entry in an array for passing to shmem_swapin_page() outside of the RCU critical section. In try_to_unuse(), instead of iterating over the entries in the type and unusing them one by one, perhaps walking all the page tables for all the processes for each one, iterate over the mmlist, making one pass. Pass each mm to unuse_mm() to begin its page table walk, and during the walk, unuse all the ptes that have backing store in the swap type received by try_to_unuse(). After the walk, check the type for orphaned swap entries with find_next_to_unuse(), and remove them from the swap cache. If find_next_to_unuse() starts over at the beginning of the type, repeat the check of the shmem_swaplist and the walk a maximum of three times. Change unuse_mm() and the intervening walk functions down to unuse_pte_range() to take the type as a parameter, and to iterate over their entire range, calling the next function down on every iteration. In unuse_pte_range(), make a swap entry from each pte in the range using the passed in type. If it has backing store in the type, call swapin_readahead() to retrieve the page and pass it to unuse_pte(). Pass the count of pages_to_unuse down the page table walks in try_to_unuse(), and return from the walk when the desired number of pages has been swapped back in. Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, swap: fix swapoff with KSM pagesHuang Ying1-1/+2
KSM pages may be mapped to the multiple VMAs that cannot be reached from one anon_vma. So during swapin, a new copy of the page need to be generated if a different anon_vma is needed, please refer to comments of ksm_might_need_to_copy() for details. During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and virtual address mapped to the page, so not all mappings to a swapped out KSM page could be found. So in try_to_unuse(), even if the swap count of a swap entry isn't zero, the page needs to be deleted from swap cache, so that, in the next round a new page could be allocated and swapin for the other mappings of the swapped out KSM page. But this contradicts with the THP swap support. Where the THP could be deleted from swap cache only after the swap count of every swap entry in the huge swap cluster backing the THP has reach 0. So try_to_unuse() is changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") to check that before delete a page from swap cache, but this has broken KSM swapoff too. Fortunately, KSM is for the normal pages only, so the original behavior for KSM pages could be restored easily via checking PageTransCompound(). That is how this patch works. The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out"), which is merged by v4.14-rc1. So I think we should backport the fix to from 4.14 on. But Hugh thinks it may be rare for the KSM pages being in the swap device when swapoff, so nobody reports the bug so far. Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/swap: use nr_node_ids for avail_lists in swap_info_structAaron Lu1-1/+2
Since a2468cc9bfdf ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18mm/swapfile.c: use kvzalloc for swap_info_struct allocationVasily Averin1-3/+3
Commit a2468cc9bfdf ("swap: choose swap device according to numa node") changed 'avail_lists' field of 'struct swap_info_struct' to an array. In popular linux distros it increased size of swap_info_struct up to 40 Kbytes and now swap_info_struct allocation requires order-4 page. Switch to kvzmalloc allows to avoid unexpected allocation failures. Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27mm: export add_swap_extent()Omar Sandoval1-0/+1
Btrfs currently does not support swap files because swap's use of bmap does not work with copy-on-write and multiple devices. See 35054394c4b3 ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions"). However, the swap code has a mechanism for the filesystem to manually add swap extents using add_swap_extent() from the ->swap_activate() aop. iomap has done this since 67482129cdab ("iomap: add a swapfile activation function"). Btrfs will do the same in a later patch, so export add_swap_extent(). Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Sterba <dsterba@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27mm: split SWP_FILE into SWP_ACTIVATED and SWP_FSOmar Sandoval1-5/+8
The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go through the filesystem, and to make swapoff() call ->swap_deactivate(). For Btrfs, we want the latter but not the former, so split this flag into two. This makes us always call ->swap_deactivate() if ->swap_activate() succeeded, not just if it didn't add any swap extents itself. This also resolves the issue of the very misleading name of SWP_FILE, which is only used for swap files over NFS. Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27mm/swapfile.c: clear si->swap_map[] in swap_free_cluster()Huang Ying1-3/+1
si->swap_map[] of the swap entries in cluster needs to be cleared during freeing. Previously, this is done in the caller of swap_free_cluster(). This may cause code duplication (one user now, will add more users later) and lock/unlock cluster unnecessarily. In this patch, the clearing code is moved to swap_free_cluster() to avoid the downside. Link: http://lkml.kernel.org/r/20180827075535.17406-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27mm/swapfile.c: call free_swap_slot() in __swap_entry_free()Huang Ying1-6/+4
This is a code cleanup patch without functionality change. Originally, when __swap_entry_free() is called, and its return value is 0, free_swap_slot() will always be called to free the swap entry to the per-CPU pool. So move the call to free_swap_slot() to __swap_entry_free() to simplify the code. Link: http://lkml.kernel.org/r/20180827075535.17406-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27mm/swapfile.c: use __try_to_reclaim_swap() in free_swap_and_cache()Huang Ying1-32/+25
The code path to reclaim the swap entry in free_swap_and_cache() is almost same as that of __try_to_reclaim_swap(). The largest difference is just coding style. So the support to the additional requirement of free_swap_and_cache() is added into __try_to_reclaim_swap(). free_swap_and_cache() is changed to call __try_to_reclaim_swap(), and delete the duplicated code. This will improve code readability and reduce the potential bugs. There are 2 functionality differences between __try_to_reclaim_swap() and swap entry reclaim code of free_swap_and_cache(). - free_swap_and_cache() only reclaims the swap entry if the page is unmapped or swap is getting full. The support has been added into __try_to_reclaim_swap(). - try_to_free_swap() (called by __try_to_reclaim_swap()) checks pm_suspended_storage(), while free_swap_and_cache() not. I think this is OK. Because the page and the swap entry can be reclaimed later eventually. Link: http://lkml.kernel.org/r/20180827075535.17406-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: put_swap_page: share more between huge/normal code pathHuang Ying1-10/+10
In this patch, locking related code is shared between huge/normal code path in put_swap_page() to reduce code duplication. The `free_entries == 0` case is merged into the more general `free_entries != SWAPFILE_CLUSTER` case, because the new locking method makes it easy. The added lines is same as the removed lines. But the code size is increased when CONFIG_TRANSPARENT_HUGEPAGE=n. text data bss dec hex filename base: 24123 2004 340 26467 6763 mm/swapfile.o unified: 24485 2004 340 26829 68cd mm/swapfile.o Dig on step deeper with `size -A mm/swapfile.o` for base and unified kernel and compare the result, yields, -.text 17723 0 +.text 17835 0 -.orc_unwind_ip 1380 0 +.orc_unwind_ip 1480 0 -.orc_unwind 2070 0 +.orc_unwind 2220 0 -Total 26686 +Total 27048 The total difference is the same. The text segment difference is much smaller: 112. More difference comes from the ORC unwinder segments: (1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is used, this costs nothing. Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: add __swap_entry_free_locked()Huang Ying1-6/+14
The part of __swap_entry_free() with lock held is separated into a new function __swap_entry_free_locked(). Because we want to reuse that piece of code in some other places. Just mechanical code refactoring, there is no any functional change in this function. Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, swap, get_swap_pages: use entry_size instead of cluster in parameterHuang Ying1-8/+8
As suggested by Matthew Wilcox, it is better to use "int entry_size" instead of "bool cluster" as parameter to specify whether to operate for huge or normal swap entries. Because this improve the flexibility to support other swap entry size. And Dave Hansen thinks that this improves code readability too. So in this patch, the "bool cluster" parameter of get_swap_pages() is replaced by "int entry_size". And nr_swap_entries() trick is used to reduce the binary size when !CONFIG_TRANSPARENT_HUGE_PAGE. text data bss dec hex filename base 24215 2028 340 26583 67d7 mm/swapfile.o head 24123 2004 340 26467 6763 mm/swapfile.o Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: unify normal/huge code path in put_swap_page()Huang Ying1-46/+37
In this patch, the normal/huge code path in put_swap_page() and several helper functions are unified to avoid duplicated code, bugs, etc. and make it easier to review the code. The removed lines are more than added lines. And the binary size is kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n. Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: unify normal/huge code path in swap_page_trans_huge_swapped()Huang Ying1-4/+3
As suggested by Dave, we should unify the code path for normal and huge swap support if possible to avoid duplicated code, bugs, etc. and make it easier to review code. In this patch, the normal/huge code path in swap_page_trans_huge_swapped() is unified, the added and removed lines are same. And the binary size is kept almost same when CONFIG_TRANSPARENT_HUGEPAGE=n. text data bss dec hex filename base: 24179 2028 340 26547 67b3 mm/swapfile.o unified: 24215 2028 340 26583 67d7 mm/swapfile.o Link: http://lkml.kernel.org/r/20180720071845.17920-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: use swap_count() in swap_page_trans_huge_swapped()Huang Ying1-2/+2
In swap_page_trans_huge_swapped(), to identify whether there's any page table mapping for a 4k sized swap entry, "si->swap_map[i] != SWAP_HAS_CACHE" is used. This works correctly now, because all users of the function will only call it after checking SWAP_HAS_CACHE. But as pointed out by Daniel, it is better to use "swap_count(map[i])" here, because it works for "map[i] == 0" case too. And this makes the implementation more consistent between normal and huge swap entry. Link: http://lkml.kernel.org/r/20180720071845.17920-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: replace some #ifdef with IS_ENABLED()Huang Ying1-40/+20
In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when THP isn't enabled. But #ifdef/#endif in .c file hurt the code readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP) instead and let compiler to do the dirty job for us. This has potential to remove some duplicated code too. From output of `size`, text data bss dec hex filename THP=y: 26269 2076 340 28685 700d mm/swapfile.o ifdef/endif: 24115 2028 340 26483 6773 mm/swapfile.o IS_ENABLED: 24179 2028 340 26547 67b3 mm/swapfile.o IS_ENABLED() based solution works quite well, almost as good as that of #ifdef/#endif. And from the diffstat, the removed lines are more than added lines. One #ifdef for split_swap_cluster() is kept. Because it is a public function with a stub implementation for CONFIG_THP_SWAP=n in swap.h. Link: http://lkml.kernel.org/r/20180720071845.17920-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm: swap: add comments to lock_cluster_or_swap_info()Huang Ying1-2/+7
Patch series "swap: THP optimizing refactoring", v4. Now the THP (Transparent Huge Page) swap optimizing is implemented in the way like below, #ifdef CONFIG_THP_SWAP huge_function(...) { } #else normal_function(...) { } #endif general_function(...) { if (huge) return thp_function(...); else return normal_function(...); } As pointed out by Dave Hansen, this will, 1. Create a new, wholly untested code path for huge page 2. Create two places to patch bugs 3. Are not reusing code when possible This patchset is to address these problems via merging huge/normal code path/functions if possible. One concern is that this may cause code size to dilate when !CONFIG_TRANSPARENT_HUGEPAGE. The data shows that most refactoring will only cause quite slight code size increase. This patch (of 8): To improve code readability. Link: http://lkml.kernel.org/r/20180720071845.17920-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-14Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+31
Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...
2018-07-09memcontrol: schedule throttling if we are congestedTejun Heo1-0/+31
Memory allocations can induce swapping via kswapd or direct reclaim. If we are having IO done for us by kswapd and don't actually go into direct reclaim we may never get scheduled for throttling. So instead check to see if our cgroup is congested, and if so schedule the throttling. Before we return to user space the throttling stuff will only throttle if we actually required it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-20x86/speculation/l1tf: Limit swap file size to MAX_PA/2Andi Kleen1-16/+30
For the L1TF workaround its necessary to limit the swap file size to below MAX_PA/2, so that the higher bits of the swap offset inverted never point to valid memory. Add a mechanism for the architecture to override the swap file size check in swapfile.c and add a x86 specific max swapfile check function that enforces that limit. The check is only enabled if the CPU is vulnerable to L1TF. In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system with 46bit PA it is 32TB. The limit is only per individual swap file, so it's always possible to exceed these limits with multiple swap files or partitions. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
2018-06-15mm/swapfile.c: fix swap_count comment about nonexistent SWAP_HAS_CONTDaniel Jordan1-1/+1
Commit 570a335b8e22 ("swap_info: swap count continuations") introduces COUNT_CONTINUED but refers to it incorrectly as SWAP_HAS_CONT in a comment in swap_count. Fix it. Link: http://lkml.kernel.org/r/20180612175919.30413-1-daniel.m.jordan@oracle.com Fixes: 570a335b8e22 ("swap_info: swap count continuations") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>