Age | Commit message (Collapse) | Author | Files | Lines |
|
Document that inode index feature solves breaking hard links on
copy up.
Simplify Kconfig backward compatibility disclaimer.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Remove the "origin" language from the functions that handle set, get
and verify of "origin" xattr and pass the xattr name as an argument.
The same helpers are going to be used for NFS export to get, get and
verify the "upper" xattr for directory index entries.
ovl_verify_origin() is now a helper used only to verify non upper
file handle stored in "origin" xattr of upper inode.
The upper root dir file handle is still stored in "origin" xattr on
the index dir for backward compatibility. This is going to be changed
by the patch that adds directory index entries support.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Pass the fs instance with lower_layers array instead of the dentry
lowerstack array to ovl_check_origin_fh(), because the dentry members
of lowerstack play no role in this helper.
This change simplifies the argument list of ovl_check_origin(),
ovl_cleanup_index() and ovl_verify_index().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Re-factor ovl_check_origin() and ovl_get_origin(), so origin fh xattr is
read from upper inode only once during lookup with multiple lower layers
and only once when verifying index entry origin.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Store the fs root layer index inside ovl_layer struct, so we can
get the root fs layer index from merge dir lower layer instead of
find it with ovl_find_layer() helper.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When work dir creation fails, a warning is emitted and overlay is
mounted r/o. Trying to remount r/w will fail with no work dir.
When index dir creation fails, the same warning is emitted and overlay
is mounted r/o, but trying to remount r/w will succeed. This may cause
unintentional corruption of filesystem consistency.
Adjust the behavior of index dir creation failure to that of work dir
creation failure and do not allow to remount r/w. User needs to state
an explicitly intention to work without an index by mounting with
option 'index=off' to allow r/w mount with no index dir.
When mounting with option 'index=on' and no 'upperdir', index is
implicitly disabled, so do not warn about no file handle support.
The issue was introduced with inodes index feature in v4.13, but this
patch will not apply cleanly before ovl_fill_super() re-factoring in
v4.15.
Fixes: 02bcd1577400 ("ovl: introduce the inodes index dir feature")
Cc: <stable@vger.kernel.org> #v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Overlayfs falls back to index=off if lower/upper fs does not support
file handles. Do the same if upper fs does not support xattr.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
For a merge dir that was copied up before v4.12 or that was hand crafted
offline (e.g. mkdir {upper/lower}/dir), upper dir does not contain the
'trusted.overlay.origin' xattr. In that case, stat(2) on the merge dir
returns the lower dir st_ino, but getdents(2) returns the upper dir d_ino.
After this change, on merge dir lookup, missing origin xattr on upper
dir will be fixed and 'impure' xattr will be fixed on parent of the legacy
merge dir.
Suggested-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in
'net'.
The esp{4,6}_offload.c conflicts were overlapping changes.
The 'out' label is removed so we just return ERR_PTR(-EINVAL)
directly.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch just adds the capability for GFS2 to track which function
called gfs2_log_flush. This should make it easier to diagnose
problems based on the sequence of events found in the journals.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
This patch adds a new structure called gfs2_log_header_v2 which is used
to store expanded fields into previously unused areas of the log headers
(i.e., this change is backwards compatible). Some of these are used for
debug purposes so we can backtrack when problems occur. Others are
reserved for future expansion.
This patch is based on a prototype from Steve Whitehouse.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
It isn't needed at all in these files, dynamic debug is the best way to
enable this type of thing, if you really want it. As it is, these
defines were not doing anything at all.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Move the license "mark" of the sysfs files to be in SPDX form, instead
of the custom text that it currently is in. This is in a quest to get
rid of the 700+ different ways we say "GPLv2" in the kernel tree.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit bdcf0a423ea1 ("kernel: make groups_sort calling a responsibility
group_info allocators") appears to break nfsd rootsquash in a pretty
major way.
It adds a call to groups_sort() inside the loop that copies/squashes
gids, which means the valid gids are sorted along with the following
garbage. The net result is that the highest numbered valid gids are
replaced with any lower-valued garbage gids, possibly including 0.
We should sort only once, after filling in all the gids.
Fixes: bdcf0a423ea1 ("kernel: make groups_sort calling a responsibility ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If fsck.f2fs changes crc, we have no way to recover some inode blocks by roll-
forward recovery. Let's relax the condition to recover them.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This fixes lost i_inline flags during roll-forward.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
__vfs_removexattr() transfers "NULL" value to the setxattr handler of
the f2fs filesystem in order to remove the extended attribute. But,
__f2fs_setxattr() just ignores the removal request when the value of
the extended attribute is already NULL. We have to remove the extended
attribute itself even if the value of that is already NULL.
We can reporduce this bug with the below:
1. touch file
2. setfattr -n "user.foo" file
3. setfattr -x "user.foo" file
4. getfattr -d file
> user.foo
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Don't remain dirtied page cache in f2fs after shutdown, it can mitigate
memory pressure of whole system, in order to keep other modules working
properly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Once filesystem shuts down, daemons like gc/discard thread should be
aware of it, and do exit, in addtion, drop all cached pending discard
commands and turn off real-time discard mode.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch makes f2fs_ioc_shutdown handling error case correctly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch splits need_inplace_update to two functions:
a. should_update_inplace() includes all conditions that we must use IPU.
b. should_update_outplace() includes all conditions that we must use OPU.
So that, in f2fs_ioc_set_pin_file() and f2fs_defragment_range(), we can
use corresponding function to check whether we can trigger OPU/IPU or not.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch fixes to update last_disk_size only when writing out page
successfully.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Use get_inline_xattr_addrs directly instead of F2FS_INLINE_XATTR_ADDRS.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch cleans up error path of fille_super to avoid unneeded
release step.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When io_bits is set, GCing encrypted block may hit the following hungtask.
Since io_bits requires aligned block address, f2fs_submit_page_write may
return -EAGAIN if new_blkaddr does not satisify io_bits alignment. As a
result, the encrypted page will never be writtenback.
This patch makes move_data_block aware the EAGAIN error and cancel the
writeback.
[ 246.751371] INFO: task kworker/u4:4:797 blocked for more than 90 seconds.
[ 246.752423] Not tainted 4.15.0-rc4+ #11
[ 246.754176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.755336] kworker/u4:4 D25448 797 2 0x80000000
[ 246.755597] Workqueue: writeback wb_workfn (flush-7:0)
[ 246.755616] Call Trace:
[ 246.755695] ? __schedule+0x322/0xa90
[ 246.755761] ? blk_init_request_from_bio+0x120/0x120
[ 246.755773] ? pci_mmcfg_check_reserved+0xb0/0xb0
[ 246.755801] ? __radix_tree_create+0x19e/0x200
[ 246.755813] ? delete_node+0x136/0x370
[ 246.755838] schedule+0x43/0xc0
[ 246.755904] io_schedule+0x17/0x40
[ 246.755939] wait_on_page_bit_common+0x17b/0x240
[ 246.755950] ? wake_page_function+0xa0/0xa0
[ 246.755961] ? add_to_page_cache_lru+0x160/0x160
[ 246.755972] ? page_cache_tree_insert+0x170/0x170
[ 246.755983] ? __lru_cache_add+0x96/0xb0
[ 246.756086] __filemap_fdatawait_range+0x14f/0x1c0
[ 246.756097] ? wait_on_page_bit_common+0x240/0x240
[ 246.756120] ? __wake_up_locked_key_bookmark+0x20/0x20
[ 246.756167] ? wait_on_all_pages_writeback+0xc9/0x100
[ 246.756179] ? __remove_ino_entry+0x120/0x120
[ 246.756192] ? wait_woken+0x100/0x100
[ 246.756204] filemap_fdatawait_range+0x9/0x20
[ 246.756216] write_checkpoint+0x18a1/0x1f00
[ 246.756254] ? blk_get_request+0x10/0x10
[ 246.756265] ? cpumask_next_and+0x43/0x60
[ 246.756279] ? f2fs_sync_inode_meta+0x160/0x160
[ 246.756289] ? remove_element.isra.4+0xa0/0xa0
[ 246.756300] ? __put_compound_page+0x40/0x40
[ 246.756310] ? f2fs_sync_fs+0xec/0x1c0
[ 246.756320] ? f2fs_sync_fs+0x120/0x1c0
[ 246.756329] f2fs_sync_fs+0x120/0x1c0
[ 246.756357] ? trace_event_raw_event_f2fs__page+0x260/0x260
[ 246.756393] ? ata_build_rw_tf+0x173/0x410
[ 246.756397] f2fs_balance_fs_bg+0x198/0x390
[ 246.756405] ? drop_inmem_page+0x230/0x230
[ 246.756415] ? ahci_qc_prep+0x1bb/0x2e0
[ 246.756418] ? ahci_qc_issue+0x1df/0x290
[ 246.756422] ? __accumulate_pelt_segments+0x42/0xd0
[ 246.756426] ? f2fs_write_node_pages+0xd1/0x380
[ 246.756429] f2fs_write_node_pages+0xd1/0x380
[ 246.756437] ? sync_node_pages+0x8f0/0x8f0
[ 246.756440] ? update_curr+0x53/0x220
[ 246.756444] ? __accumulate_pelt_segments+0xa2/0xd0
[ 246.756448] ? __update_load_avg_se.isra.39+0x349/0x360
[ 246.756452] ? do_writepages+0x2a/0xa0
[ 246.756456] do_writepages+0x2a/0xa0
[ 246.756460] __writeback_single_inode+0x70/0x490
[ 246.756463] ? check_preempt_wakeup+0x199/0x310
[ 246.756467] writeback_sb_inodes+0x2a2/0x660
[ 246.756471] ? is_empty_dir_inode+0x40/0x40
[ 246.756474] ? __writeback_single_inode+0x490/0x490
[ 246.756477] ? string+0xbf/0xf0
[ 246.756480] ? down_read_trylock+0x35/0x60
[ 246.756484] __writeback_inodes_wb+0x9f/0xf0
[ 246.756488] wb_writeback+0x41d/0x4b0
[ 246.756492] ? writeback_inodes_wb.constprop.55+0x150/0x150
[ 246.756498] ? set_worker_desc+0xf7/0x130
[ 246.756502] ? current_is_workqueue_rescuer+0x60/0x60
[ 246.756511] ? _find_next_bit+0x2c/0xa0
[ 246.756514] ? wb_workfn+0x400/0x5d0
[ 246.756518] wb_workfn+0x400/0x5d0
[ 246.756521] ? finish_task_switch+0xdf/0x2a0
[ 246.756525] ? inode_wait_for_writeback+0x30/0x30
[ 246.756529] process_one_work+0x3a7/0x6f0
[ 246.756533] worker_thread+0x82/0x750
[ 246.756537] kthread+0x16f/0x1c0
[ 246.756541] ? trace_event_raw_event_workqueue_work+0x110/0x110
[ 246.756544] ? kthread_create_worker_on_cpu+0xb0/0xb0
[ 246.756548] ret_from_fork+0x1f/0x30
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch allows quota to use reserved blocks all the time.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In commit 57864ae5ce3a ("f2fs: limit # of inmemory pages"), we have
limited memory footprint of all inmem pages with 20% of total memory,
otherwise, if we exceed the threshold, we will try to drop all inmem
pages to avoid excessive memory pressure resulting in performance
regression.
But in some unrelated error paths, we will also drop all inmem pages,
which should be wrong, fix it in this patch.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We have supported to get next page offset with valid mapping crossing
hole in f2fs_map_blocks, utilizing it to speed up defragment on sparse
file.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch introduces a new ioctl F2FS_IOC_PRECACHE_EXTENTS to precache
extent info like ext4, in order to gain better performance during
triggering AIO by eliminating synchronous waiting of mapping info.
Referred commit: 7869a4a6c5ca ("ext4: add support for extent pre-caching")
In addition, with newly added extent precache abilitiy, this patch add
to support FIEMAP_FLAG_CACHE in ->fiemap.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch gives a flag to disable GC on given file, which would be useful, when
user wants to keep its block map. It also conducts in-place-update for dontmove
file.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In orangefs_devreq_read, there is a loop which picks an op off the list
of pending ops. If the loop fails to find an op, there is nothing to
read, and it returns EAGAIN. If the op has been given up on, the loop
is restarted via a goto. The bug is that the variable which the found
op is written to is not reinitialized, so if there are no more eligible
ops on the list, the code runs again on the already handled op.
This is triggered by interrupting a process while the op is being copied
to the client-core. It's a fairly small window, but it's there.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
set_op_state_purged can delete the op.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 60999ca4b403 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The btrfs_free_stale_devices() is updated to match for the given device
path and delete it. (It searches for only unmounted list of devices.)
Also drop the comment about different path being used for the same
device, since now we will have cli to clean any device that's not a
concern any more.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
No functional changes.
Rename btrfs_free_stale_devices() arg to skip_dev, so that it
reflects what that arg for.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This updates btrfs_free_stale_devices() helper function to delete all
unmouted devices, when arg is NULL.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Let the list iterator iterate further and find other stale
devices and delete it. This is in preparation to add support
for user land request-able stale devices cleanup. Also rename
btrfs_free_stale_device() to btrfs_free_stale_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to check for btrfs_fs_devices::seeding when we
have checked for btrfs_fs_devices::opened, because we can't sprout
without its seed FS being opened.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's not good to crash the machine if panic_on_warn() is set just
because someone made a stupid mistake of trying to create a sysfs file
with the same name of an existing one. This makes the automated testing
tools a lot harder to find the real bugs in the kernel.
So just print a warning out and dump the stack to get the attention of
the developer that they did something foolish. Then keep on trucking,
as this should not be a fatal error at all.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
No functional changes, just makes the code more readable
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.
This is adding noinline attribute to merge_extent_mapping().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a subtle case, so in order to understand the problem, it'd be good
to know the content of existing and em when any error occurs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This test case simulates the racy situation of dio write vs dio read,
and see if btrfs_get_extent() would return -EEXIST.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This test case simulates the racy situation of buffered write vs dio
read, and see if btrfs_get_extent() would return -EEXIST.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We've observed that btrfs_get_extent() and merge_extent_mapping() could
return -EEXIST in several cases, and they are caused by some racy
condition, e.g dio read vs dio write, which makes the problem very tricky
to reproduce.
This adds extent map selftests in order to simulate those racy situations.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
[ minor string adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These helpers are extent map specific, move them to extent_map.c.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This fixes a corner case that is caused by a race of dio write vs dio
read/write.
Here is how the race could happen.
Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).
t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
------------------------------------------------------
t1 t2
btrfs_get_blocks_direct() btrfs_get_blocks_direct()
-> btrfs_get_extent() -> btrfs_get_extent()
-> lookup_extent_mapping()
-> add_extent_mapping() -> lookup_extent_mapping()
# load [0, 32K)
-> btrfs_new_extent_direct()
-> btrfs_drop_extent_cache()
# split [0, 32K) and
# drop [8K, 32K)
-> add_extent_mapping()
# add [8K, 32K)
-> add_extent_mapping()
# handle -EEXIST when adding
# [0, 32K)
------------------------------------------------------
About how t2(dio read/write) runs into -EEXIST:
a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),
b) search_extent_mapping() then returns [0, 8k) as the existing em,
even though start == existing->start, em is [0, 32k) so that
extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,
c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
(with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,
d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
which is confusing applications.
Here I conclude all the possible situations,
1) start < existing->start
+-----------+em+-----------+
+--prev---+ | +-------------+ |
| | | | | |
+---------+ + +---+existing++ ++
+
|
+
start
2) start == existing->start
+------------em------------+
| +-------------+ |
| | | |
+ +----existing-+ +
|
|
+
start
3) start > existing->start && start < (existing->start + existing->len)
+------------em------------+
| +-------------+ |
| | | |
+ +----existing-+ +
|
|
+
start
4) start >= (existing->start + existing->len)
+-----------+em+-----------+
| +-------------+ | +--next---+
| | | | | |
+ +---+existing++ + +---------+
+
|
+
start
As we can see, it turns out that if start is within existing em (front
inclusive), then the existing em should be returned as is, otherwise,
we try our best to merge candidate em with sibling ems to form a
larger em (in order to reduce the total number of em).
Reported-by: David Vallender <david.vallender@landmark.co.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
%block_len could be checked on deciding if two em are mergeable.
merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.
For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, and for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|