summaryrefslogtreecommitdiff
path: root/fs/f2fs/gc.c
AgeCommit message (Collapse)AuthorFilesLines
2022-03-17f2fs: introduce gc_urgent_mid modeDaeho Jeong1-0/+3
We need a mid level of gc urgent mode to do GC forcibly in a period of given gc_urgent_sleep_time, but not like using greedy GC approach and switching to SSR mode such as gc urgent high mode. This can be used for more aggressive periodic storage clean up. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-04f2fs: fix to unlock page correctly in error path of is_alive()Chao Yu1-1/+3
As Pavel Machek reported in below link [1]: After commit 77900c45ee5c ("f2fs: fix to do sanity check in is_alive()"), node page should be unlock via calling f2fs_put_page() in the error path of is_alive(), otherwise, f2fs may hang when it tries to lock the node page, fix it. [1] https://lore.kernel.org/stable/20220124203637.GA19321@duo.ucw.cz/ Fixes: 77900c45ee5c ("f2fs: fix to do sanity check in is_alive()") Cc: <stable@vger.kernel.org> Reported-by: Pavel Machek <pavel@denx.de> Signed-off-by: Pavel Machek <pavel@denx.de> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-25f2fs: move f2fs to use reader-unfair rwsemsTim Murray1-23/+23
f2fs rw_semaphores work better if writers can starve readers, especially for the checkpoint thread, because writers are strictly more important than reader threads. This prevents significant priority inversion between low-priority readers that blocked while trying to acquire the read lock and a second acquisition of the write lock that might be blocking high priority work. Signed-off-by: Tim Murray <timmurray@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-19Merge tag 'f2fs-for-5.17-rc1' of ↵Linus Torvalds1-5/+21
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've tried to address some performance issues in f2fs_checkpoint and direct IO flows. Also, there was a work to enhance the page cache management used for compression. Other than them, we've done typical work including sysfs, code clean-ups, tracepoint, sanity check, in addition to bug fixes on corner cases. Enhancements: - use iomap for direct IO - try to avoid lock contention to improve f2fs_ckpt speed - avoid unnecessary memory allocation in compression flow - POSIX_FADV_DONTNEED drops the page cache containing compression pages - add some sysfs entries (gc_urgent_high_remaining, pending_discard) Bug fixes: - try not to expose unwritten blocks to user by DIO (this was added to avoid merge conflict; another patch is coming to address other missing case) - relax minor error condition for file pinning feature used in Android OTA - fix potential deadlock case in compression flow - should not truncate any block on pinned file In addition, we've done some code clean-ups and tracepoint/sanity check improvement" * tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits) f2fs: do not allow partial truncation on pinned file f2fs: remove redunant invalidate compress pages f2fs: Simplify bool conversion f2fs: don't drop compressed page cache in .{invalidate,release}page f2fs: fix to reserve space for IO align feature f2fs: fix to check available space of CP area correctly in update_ckpt_flags() f2fs: support fault injection to f2fs_trylock_op() f2fs: clean up __find_inline_xattr() with __find_xattr() f2fs: fix to do sanity check on last xattr entry in __f2fs_setxattr() f2fs: do not bother checkpoint by f2fs_get_node_info f2fs: avoid down_write on nat_tree_lock during checkpoint f2fs: compress: fix potential deadlock of compress file f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file f2fs: add gc_urgent_high_remaining sysfs node f2fs: fix to do sanity check in is_alive() f2fs: fix to avoid panic in is_alive() if metadata is inconsistent f2fs: fix to do sanity check on inode type during garbage collection f2fs: avoid duplicate call of mark_inode_dirty f2fs: show number of pending discard commands f2fs: support POSIX_FADV_DONTNEED drop compressed page cache ...
2022-01-15mm: introduce memalloc_retry_wait()NeilBrown1-3/+2
Various places in the kernel - largely in filesystems - respond to a memory allocation failure by looping around and re-trying. Some of these cannot conveniently use __GFP_NOFAIL, for reasons such as: - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on - a need to check for the process being signalled between failures - the possibility that other recovery actions could be performed - the allocation is quite deep in support code, and passing down an extra flag to say if __GFP_NOFAIL is wanted would be clumsy. Many of these currently use congestion_wait() which (in almost all cases) simply waits the given timeout - congestion isn't tracked for most devices. It isn't clear what the best delay is for loops, but it is clear that the various filesystems shouldn't be responsible for choosing a timeout. This patch introduces memalloc_retry_wait() with takes on that responsibility. Code that wants to retry a memory allocation can call this function passing the GFP flags that were used. It will wait however is appropriate. For now, it only considers __GFP_NORETRY and whatever gfpflags_allow_blocking() tests. If blocking is allowed without __GFP_NORETRY, then alloc_page either made some reclaim progress, or waited for a while, before failing. So there is no need for much further waiting. memalloc_retry_wait() will wait until the current jiffie ends. If this condition is not met, then alloc_page() won't have waited much if at all. In that case memalloc_retry_wait() waits about 200ms. This is the delay that most current loops uses. linux/sched/mm.h needs to be included in some files now, but linux/backing-dev.h does not. Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-05f2fs: do not bother checkpoint by f2fs_get_node_infoJaegeuk Kim1-3/+3
This patch tries to mitigate lock contention between f2fs_write_checkpoint and f2fs_get_node_info along with nat_tree_lock. The idea is, if checkpoint is currently running, other threads that try to grab nat_tree_lock would be better to wait for checkpoint. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-11f2fs: add gc_urgent_high_remaining sysfs nodeDaeho Jeong1-0/+12
Added a new sysfs node called gc_urgent_high_remaining. The user can set the trial count limit for GC urgent high mode with this value. If GC thread gets to the limit, the mode will turn back to GC normal mode. By default, the value is zero, which means there is no limit like before. Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-11f2fs: fix to do sanity check in is_alive()Chao Yu1-0/+3
In fuzzed image, SSA table may indicate that a data block belongs to invalid node, which node ID is out-of-range (0, 1, 2 or max_nid), in order to avoid migrating inconsistent data in such corrupted image, let's do sanity check anyway before data block migration. Cc: stable@vger.kernel.org Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-11f2fs: fix to avoid panic in is_alive() if metadata is inconsistentChao Yu1-1/+1
As report by Wenqing Liu in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215231 If we enable CONFIG_F2FS_CHECK_FS config, and with fuzzed image attached in above link, we will encounter panic when executing below script: 1. mkdir mnt 2. mount -t f2fs tmp1.img mnt 3. touch tmp F2FS-fs (loop11): mismatched blkaddr 5765 (source_blkaddr 1) in seg 3 kernel BUG at fs/f2fs/gc.c:1042! do_garbage_collect+0x90f/0xa80 [f2fs] f2fs_gc+0x294/0x12a0 [f2fs] f2fs_balance_fs+0x2c5/0x7d0 [f2fs] f2fs_create+0x239/0xd90 [f2fs] lookup_open+0x45e/0xa90 open_last_lookups+0x203/0x670 path_openat+0xae/0x490 do_filp_open+0xbc/0x160 do_sys_openat2+0x2f1/0x500 do_sys_open+0x5e/0xa0 __x64_sys_openat+0x28/0x40 Previously, f2fs tries to catch data inconcistency exception in between SSA and SIT table during GC, however once the exception is caught, it will call f2fs_bug_on to hang kernel, it's not needed, instead, let's set SBI_NEED_FSCK flag and skip migrating current block. Fixes: bbf9f7d90f21 ("f2fs: Fix indefinite loop in f2fs_gc()") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-11f2fs: fix to do sanity check on inode type during garbage collectionChao Yu1-1/+2
As report by Wenqing Liu in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215231 - Overview kernel NULL pointer dereference triggered in folio_mark_dirty() when mount and operate on a crafted f2fs image - Reproduce tested on kernel 5.16-rc3, 5.15.X under root 1. mkdir mnt 2. mount -t f2fs tmp1.img mnt 3. touch tmp 4. cp tmp mnt F2FS-fs (loop0): sanity_check_inode: inode (ino=49) extent info [5942, 4294180864, 4] is incorrect, run fsck to fix F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=31340049, run fsck to fix. BUG: kernel NULL pointer dereference, address: 0000000000000000 folio_mark_dirty+0x33/0x50 move_data_page+0x2dd/0x460 [f2fs] do_garbage_collect+0xc18/0x16a0 [f2fs] f2fs_gc+0x1d3/0xd90 [f2fs] f2fs_balance_fs+0x13a/0x570 [f2fs] f2fs_create+0x285/0x840 [f2fs] path_openat+0xe6d/0x1040 do_filp_open+0xc5/0x140 do_sys_openat2+0x23a/0x310 do_sys_open+0x57/0x80 The root cause is for special file: e.g. character, block, fifo or socket file, f2fs doesn't assign address space operations pointer array for mapping->a_ops field, so, in a fuzzed image, SSA table indicates a data block belong to special file, when f2fs tries to migrate that block, it causes NULL pointer access once move_data_page() calls a_ops->set_dirty_page(). Cc: stable@vger.kernel.org Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-27f2fs: introduce fragment allocation mode mount optionDaeho Jeong1-1/+4
Added two options into "mode=" mount option to make it possible for developers to simulate filesystem fragmentation/after-GC situation itself. The developers use these modes to understand filesystem fragmentation/after-GC condition well, and eventually get some insights to handle them better. "fragment:segment": f2fs allocates a new segment in ramdom position. With this, we can simulate the after-GC condition. "fragment:block" : We can scatter block allocation with "max_fragment_chunk" and "max_fragment_hole" sysfs nodes. f2fs will allocate 1..<max_fragment_chunk> blocks in a chunk and make a hole in the length of 1..<max_fragment_hole> by turns in a newly allocated free segment. Plus, this mode implicitly enables "fragment:segment" option for more randomness. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30f2fs: fix to account missing .skipped_gc_rwsemChao Yu1-1/+3
There is a missing place we forgot to account .skipped_gc_rwsem, fix it. Fixes: 6f8d4455060d ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-23f2fs: separate out iostat featureDaeho Jeong1-0/+1
Added F2FS_IOSTAT config option to support getting IO statistics through sysfs and printing out periodic IO statistics tracepoint events and moved I/O statistics related codes into separate files for better maintenance. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> [Jaegeuk Kim: set default=y] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-17f2fs: support fault injection for f2fs_kmem_cache_alloc()Chao Yu1-2/+4
This patch supports to inject fault into f2fs_kmem_cache_alloc(). Usage: a) echo 32768 > /sys/fs/f2fs/<dev>/inject_type or b) mount -o fault_type=32768 <dev> <mountpoint> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-19f2fs: Revert "f2fs: Fix indefinite loop in f2fs_gc() v1"Jia Yang1-1/+1
This reverts commit 957fa47823dfe449c5a15a944e4e7a299a6601db. The patch "f2fs: Fix indefinite loop in f2fs_gc()" v1 and v4 are all merged. Patch v4 is test info for patch v1. Patch v1 doesn't work and may cause that sbi->cur_victim_sec can't be resetted to NULL_SEGNO, which makes SSR unable to get segment of sbi->cur_victim_sec. So it should be reverted. The mails record: [1] https://lore.kernel.org/linux-f2fs-devel/7288dcd4-b168-7656-d1af-7e2cafa4f720@huawei.com/T/ [2] https://lore.kernel.org/linux-f2fs-devel/20190809153653.GD93481@jaegeuk-macbookpro.roam.corp.google.com/T/ Signed-off-by: Jia Yang <jiayang5@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-14f2fs: add sysfs nodes to get GC info for each GC modeDaeho Jeong1-0/+1
Added gc_reclaimed_segments and gc_segment_mode sysfs nodes. 1) "gc_reclaimed_segments" shows how many segments have been reclaimed by GC during a specific GC mode. 2) "gc_segment_mode" is used to control for which gc mode the "gc_reclaimed_segments" node shows. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-28f2fs: remove false alarm on iget failure during GCJaegeuk Kim1-3/+1
This patch removes setting SBI_NEED_FSCK when GC gets an error on f2fs_iget, since f2fs_iget can give ENOMEM and others by race condition. If we set this critical fsck flag, we'll get EIO during fsync via the below code path. In f2fs_inplace_write_data(), if (is_sbi_flag_set(sbi, SBI_NEED_FSCK) || f2fs_cp_error(sbi)) { err = -EIO; goto drop_bio; } Fixes: 9557727876674 ("f2fs: drop inplace IO if fs status is abnormal") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23f2fs: compress: add compress_inode to cache compressed blocksChao Yu1-0/+1
Support to use address space of inner inode to cache compressed block, in order to improve cache hit ratio of random read. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23f2fs: logging neateningJoe Perches1-2/+2
Update the logging uses that have unnecessary newlines as the f2fs_printk function and so its f2fs_<level> macro callers already adds one. This allows searching single line logging entries with an easier grep and also avoids unnecessary blank lines in the logging. Miscellanea: o Coalesce formats o Align to open parenthesis Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14f2fs: atgc: fix to set default age thresholdChao Yu1-0/+1
Default age threshold value is missed to set, fix it. Fixes: 093749e296e2 ("f2fs: support age threshold based garbage collection") Reported-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14f2fs: restructure f2fs page.private layoutChao Yu1-3/+3
Restruct f2fs page private layout for below reasons: There are some cases that f2fs wants to set a flag in a page to indicate a specified status of page: a) page is in transaction list for atomic write b) page contains dummy data for aligned write c) page is migrating for GC d) page contains inline data for inline inode flush e) page belongs to merkle tree, and is verified for fsverity f) page is dirty and has filesystem/inode reference count for writeback g) page is temporary and has decompress io context reference for compression There are existed places in page structure we can use to store f2fs private status/data: - page.flags: PG_checked, PG_private - page.private However it was a mess when we using them, which may cause potential confliction: page.private PG_private PG_checked page._refcount (+1 at most) a) -1 set +1 b) -2 set c), d), e) set f) 0 set +1 g) pointer set The other problem is page.flags has no free slot, if we can avoid set zero to page.private and set PG_private flag, then we use non-zero value to indicate PG_private status, so that we may have chance to reclaim PG_private slot for other usage. [1] The other concern is f2fs has bad scalability in aspect of indicating more page status. So in this patch, let's restructure f2fs' page.private as below to solve above issues: Layout A: lowest bit should be 1 | bit0 = 1 | bit1 | bit2 | ... | bit MAX | private data .... | bit 0 PAGE_PRIVATE_NOT_POINTER bit 1 PAGE_PRIVATE_ATOMIC_WRITE bit 2 PAGE_PRIVATE_DUMMY_WRITE bit 3 PAGE_PRIVATE_ONGOING_MIGRATION bit 4 PAGE_PRIVATE_INLINE_INODE bit 5 PAGE_PRIVATE_REF_RESOURCE bit 6- f2fs private data Layout B: lowest bit should be 0 page.private is a wrapped pointer. After the change: page.private PG_private PG_checked page._refcount (+1 at most) a) 11 set +1 b) 101 set +1 c) 1001 set +1 d) 10001 set +1 e) set f) 100001 set +1 g) pointer set +1 [1] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org/T/#u Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-10f2fs: clean up build warningsYi Zhuang1-1/+5
This patch combined the below three clean-up patches. - modify open brace '{' following function definitions - ERROR: spaces required around that ':' - ERROR: spaces required before the open parenthesis '(' - ERROR: spaces prohibited before that ',' - Made suggested modifications from checkpatch in reference to WARNING: Missing a blank line after declarations Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com> Signed-off-by: Jia Yang <jiayang5@huawei.com> Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-31f2fs: introduce gc_merge mount optionChao Yu1-4/+22
In this patch, we will add two new mount options: "gc_merge" and "nogc_merge", when background_gc is on, "gc_merge" option can be set to let background GC thread to handle foreground GC requests, it can eliminate the sluggish issue caused by slow foreground GC operation when GC is triggered from a process with limited I/O and CPU resources. Original idea is from Xiang. Signed-off-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26f2fs: fix to avoid touching checkpointed data in get_victim()Chao Yu1-8/+20
In CP disabling mode, there are two issues when using LFS or SSR | AT_SSR mode to select victim: 1. LFS is set to find source section during GC, the victim should have no checkpointed data, since after GC, section could not be set free for reuse. Previously, we only check valid chpt blocks in current segment rather than section, fix it. 2. SSR | AT_SSR are set to find target segment for writes which can be fully filled by checkpointed and newly written blocks, we should never select such segment, otherwise it can cause panic or data corruption during allocation, potential case is described as below: a) target segment has 'n' (n < 512) ckpt valid blocks b) GC migrates 'n' valid blocks to other segment (segment is still in dirty list) c) GC migrates '512 - n' blocks to target segment (segment has 'n' cp_vblocks and '512 - n' vblocks) d) If GC selects target segment via {AT,}SSR allocator, however there is no free space in targe segment. Fixes: 4354994f097d ("f2fs: checkpoint disabling") Fixes: 093749e296e2 ("f2fs: support age threshold based garbage collection") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26f2fs: do not use AT_SSR mode in FG_GC & high urgent BG_GCWeichao Guo1-1/+2
AT_SSR mode is introduced by age threshold based GC for better hot/cold data seperation and avoiding free segment cost. However, LFS write mode is preferred in the scenario of foreground or high urgent GC, which should be finished ASAP. Let's only use AT_SSR in background GC and not high urgent GC modes. Signed-off-by: Weichao Guo <guoweichao@oppo.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-13f2fs: fix panic during f2fs_resize_fs()Chao Yu1-0/+13
f2fs_resize_fs() hangs in below callstack with testcase: - mkfs 16GB image & mount image - dd 8GB fileA - dd 8GB fileB - sync - rm fileA - sync - resize filesystem to 8GB kernel BUG at segment.c:2484! Call Trace: allocate_segment_by_default+0x92/0xf0 [f2fs] f2fs_allocate_data_block+0x44b/0x7e0 [f2fs] do_write_page+0x5a/0x110 [f2fs] f2fs_outplace_write_data+0x55/0x100 [f2fs] f2fs_do_write_data_page+0x392/0x850 [f2fs] move_data_page+0x233/0x320 [f2fs] do_garbage_collect+0x14d9/0x1660 [f2fs] free_segment_range+0x1f7/0x310 [f2fs] f2fs_resize_fs+0x118/0x330 [f2fs] __f2fs_ioctl+0x487/0x3680 [f2fs] __x64_sys_ioctl+0x8e/0xd0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The root cause is we forgot to check that whether we have enough space in resized filesystem to store all valid blocks in before-resizing filesystem, then allocator will run out-of-space during block migration in free_segment_range(). Fixes: b4b10061ef98 ("f2fs: refactor resize_fs to avoid meta updates in progress") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-13f2fs: fix to allow migrating fully valid segmentChao Yu1-9/+12
F2FS_IOC_FLUSH_DEVICE/F2FS_IOC_RESIZE_FS needs to migrate all blocks of target segment to other place, no matter the segment has partially or fully valid blocks. However, after commit 803e74be04b3 ("f2fs: stop GC when the victim becomes fully valid"), we may skip migration due to target segment is fully valid, result in failing the ioctl interface, fix this. Fixes: 803e74be04b3 ("f2fs: stop GC when the victim becomes fully valid") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-01-28f2fs: trival cleanup in move_data_block()Chao Yu1-5/+3
Trival cleanups: - relocate set_summary() before its use - relocate "allocate block address" to correct place - remove unneeded f2fs_wait_on_page_writeback() Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-03f2fs: change to use rwsem for cp_mutexSahitya Tummala1-2/+2
Use rwsem to ensure serialization of the callers and to avoid starvation of high priority tasks, when the system is under heavy IO workload. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-14f2fs: clean up kvfreeChao Yu1-2/+2
After commit 0b6d4ca04a86 ("f2fs: don't return vmalloc() memory from f2fs_kmalloc()"), f2fs_k{m,z}alloc() will not return vmalloc()'ed memory, so clean up to use kfree() instead of kvfree() to free vmalloc()'ed memory. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: support age threshold based garbage collectionChao Yu1-8/+372
There are several issues in current background GC algorithm: - valid blocks is one of key factors during cost overhead calculation, so if segment has less valid block, however even its age is young or it locates hot segment, CB algorithm will still choose the segment as victim, it's not appropriate. - GCed data/node will go to existing logs, no matter in-there datas' update frequency is the same or not, it may mix hot and cold data again. - GC alloctor mainly use LFS type segment, it will cost free segment more quickly. This patch introduces a new algorithm named age threshold based garbage collection to solve above issues, there are three steps mainly: 1. select a source victim: - set an age threshold, and select candidates beased threshold: e.g. 0 means youngest, 100 means oldest, if we set age threshold to 80 then select dirty segments which has age in range of [80, 100] as candiddates; - set candidate_ratio threshold, and select candidates based the ratio, so that we can shrink candidates to those oldest segments; - select target segment with fewest valid blocks in order to migrate blocks with minimum cost; 2. select a target victim: - select candidates beased age threshold; - set candidate_radius threshold, search candidates whose age is around source victims, searching radius should less than the radius threshold. - select target segment with most valid blocks in order to avoid migrating current target segment. 3. merge valid blocks from source victim into target victim with SSR alloctor. Test steps: - create 160 dirty segments: * half of them have 128 valid blocks per segment * left of them have 384 valid blocks per segment - run background GC Benefit: GC count and block movement count both decrease obviously: - Before: - Valid: 86 - Dirty: 1 - Prefree: 11 - Free: 6001 (6001) GC calls: 162 (BG: 220) - data segments : 160 (160) - node segments : 2 (2) Try to move 41454 blocks (BG: 41454) - data blocks : 40960 (40960) - node blocks : 494 (494) IPU: 0 blocks SSR: 0 blocks in 0 segments LFS: 41364 blocks in 81 segments - After: - Valid: 87 - Dirty: 0 - Prefree: 4 - Free: 6008 (6008) GC calls: 75 (BG: 76) - data segments : 74 (74) - node segments : 1 (1) Try to move 12813 blocks (BG: 12813) - data blocks : 12544 (12544) - node blocks : 269 (269) IPU: 0 blocks SSR: 12032 blocks in 77 segments LFS: 855 blocks in 2 segments Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: inherit mtime of original block during GCChao Yu1-2/+2
Don't let f2fs inner GC ruins original aging degree of segment. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: introduce inmem cursegChao Yu1-1/+1
Previous implementation of aligned pinfile allocation will: - allocate new segment on cold data log no matter whether last used segment is partially used or not, it makes IOs more random; - force concurrent cold data/GCed IO going into warm data area, it can make a bad effect on hot/cold data separation; In this patch, we introduce a new type of log named 'inmem curseg', the differents from normal curseg is: - it reuses existed segment type (CURSEG_XXX_NODE/DATA); - it only exists in memory, its segno, blkofs, summary will not b persisted into checkpoint area; With this new feature, we can enhance scalability of log, special allocators can be created for purposes: - pure lfs allocator for aligned pinfile allocation or file defragmentation - pure ssr allocator for later feature So that, let's update aligned pinfile allocation to use this new inmem curseg fwk. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: support zone capacity less than zone sizeAravind Ramesh1-6/+19
NVMe Zoned Namespace devices can have zone-capacity less than zone-size. Zone-capacity indicates the maximum number of sectors that are usable in a zone beginning from the first sector of the zone. This makes the sectors sectors after the zone-capacity till zone-size to be unusable. This patch set tracks zone-size and zone-capacity in zoned devices and calculate the usable blocks per segment and usable segments per section. If zone-capacity is less than zone-size mark only those segments which start before zone-capacity as free segments. All segments at and beyond zone-capacity are treated as permanently used segments. In cases where zone-capacity does not align with segment size the last segment will start before zone-capacity and end beyond the zone-capacity of the zone. For such spanning segments only sectors within the zone-capacity are used. During writes and GC manage the usable segments in a section and usable blocks per segment. Segments which are beyond zone-capacity are never allocated, and do not need to be garbage collected, only the segments which are before zone-capacity needs to garbage collected. For spanning segments based on the number of usable blocks in that segment, write to blocks only up to zone-capacity. Zone-capacity is device specific and cannot be configured by the user. Since NVMe ZNS device zones are sequentially write only, a block device with conventional zones or any normal block device is needed along with the ZNS device for the metadata operations of F2fs. A typical nvme-cli output of a zoned device shows zone start and capacity and write pointer as below: SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones are in EMPTY state. For each zone, only zone start + 49MB is usable area, any lba/sector after 49MB cannot be read or written to, the drive will fail any attempts to read/write. So, the second zone starts at 64MB and is usable till 113MB (64 + 49) and the range between 113 and 128MB is again unusable. The next zone starts at 128MB, and so on. Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-08f2fs: add GC_URGENT_LOW mode in gc_urgentDaeho Jeong1-3/+3
Added a new gc_urgent mode, GC_URGENT_LOW, in which mode F2FS will lower the bar of checking idle in order to process outstanding discard commands and GC a little bit aggressively. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-08f2fs: fix return value of move_data_block()Chao Yu1-1/+3
If f2fs_grab_cache_page() fails, it needs to return -ENOMEM. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-08f2fs: add f2fs_gc exception handle in f2fs_ioc_gc_rangeQilong Zhang1-6/+14
When f2fs_ioc_gc_range performs multiple segments gc ops, the return value of f2fs_ioc_gc_range is determined by the last segment gc ops. If its ops failed, the f2fs_ioc_gc_range will be considered to be failed despite some of previous segments gc ops succeeded. Therefore, so we fix: Redefine the return value of getting victim ops and add exception handle for f2fs_gc. In particular, 1).if target has no valid block, it will go on. 2).if target sectoion has valid block(s), but it is current section, we will reminder the caller. Signed-off-by: Qilong Zhang <zhangqilong3@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-08f2fs: clean up parameter of f2fs_allocate_data_block()Chao Yu1-1/+1
Use validation of @fio to inidcate whether caller want to serialize IOs in io.io_list or not, then @add_list will be redundant, remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-08f2fs: add prefix for exported symbolsChao Yu1-1/+1
to avoid polluting global symbol namespace. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18f2fs: get the right gc victim section when section has several segmentsJack Qiu1-17/+24
Assume each section has 4 segment: .___________________________. |_Segment0_|_..._|_Segment3_| . . . . .__________. |_section0_| Segment 0~2 has 0 valid block, segment 3 has 512 valid blocks. It will fail if we want to gc section0 in this scenes, because all 4 segments in section0 is not dirty. So we should use dirty section bitmap instead of dirty segment bitmap to get right victim section. Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-12f2fs: add compressed/gc data read IO statChao Yu1-0/+2
in order to account data read IOs more accurately. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-12f2fs: refactor resize_fs to avoid meta updates in progressJaegeuk Kim1-49/+68
Sahitya raised an issue: - prevent meta updates while checkpoint is in progress allocate_segment_for_resize() can cause metapage updates if it requires to change the current node/data segments for resizing. Stop these meta updates when there is a checkpoint already in progress to prevent inconsistent CP data. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-04-17f2fs: support read iostatChao Yu1-0/+6
Adds to support accounting read IOs from userspace/kernel. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-31f2fs: don't trigger data flush in foreground operationChao Yu1-1/+1
Data flush can generate heavy IO and cause long latency during flush, so it's not appropriate to trigger it in foreground operation. And also, we may face below potential deadlock during data flush: - f2fs_write_multi_pages - f2fs_write_raw_pages - f2fs_write_single_data_page - f2fs_balance_fs - f2fs_balance_fs_bg - f2fs_sync_dirty_inodes - filemap_fdatawrite -- stuck on flush same cluster Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-23f2fs: fix to update f2fs_super_block fields under sb_lockChao Yu1-4/+13
Fields in struct f2fs_super_block should be updated under coverage of sb_lock, fix to adjust update_sb_metadata() for that rule. Fixes: 04f0b2eaa3b3 ("f2fs: ioctl for removing a range from F2FS") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-23f2fs: Fix mount failure due to SPO after a successful online resize FSSahitya Tummala1-0/+6
Even though online resize is successfully done, a SPO immediately after resize, still causes below error in the next mount. [ 11.294650] F2FS-fs (sda8): Wrong user_block_count: 2233856 [ 11.300272] F2FS-fs (sda8): Failed to get valid F2FS checkpoint This is because after FS metadata is updated in update_fs_metadata() if the SBI_IS_DIRTY is not dirty, then CP will not be done to reflect the new user_block_count. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-19f2fs: skip migration only when BG_GC is calledJaegeuk Kim1-1/+1
FG_GC needs to move entire section more quickly. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-19f2fs: introduce DEFAULT_IO_TIMEOUTChao Yu1-1/+2
As Geert Uytterhoeven reported: for parameter HZ/50 in congestion_wait(BLK_RW_ASYNC, HZ/50); On some platforms, HZ can be less than 50, then unexpected 0 timeout jiffies will be set in congestion_wait(). This patch introduces a macro DEFAULT_IO_TIMEOUT to wrap a determinate value with msecs_to_jiffies(20) to instead HZ/50 to avoid such issue. Quoted from Geert Uytterhoeven: "A timeout of HZ means 1 second. HZ/50 means 20 ms, but has the risk of being zero, if HZ < 50. If you want to use a timeout of 20 ms, you best use msecs_to_jiffies(20), as that takes care of the special cases, and never returns 0." Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-19f2fs: skip GC when section is fullJaegeuk Kim1-2/+2
This fixes skipping GC when segment is full in large section. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-19f2fs: add migration count iff migration happensJaegeuk Kim1-1/+1
If first segment is empty and migration_granularity is 1, we can't move this at all. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>