Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 3cc64e7ebfb0d7faaba2438334c43466955a96e8 ]
Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.
Fixes: a67509c30079 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit bbc37d6e475eee8ffa2156ec813efc6bbb43c06d upstream.
If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:
1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
and it's about to start writing out dirty extent buffers;
2) Transaction N + 1 already started and another task, task A, just called
btrfs_commit_transaction() on it;
3) Block group B was dirtied (extents allocated from it) by transaction
N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
very beginning of the transaction commit, it starts writeback for the
block group's space cache by calling btrfs_write_out_cache(), which
allocates the pages array for the block group's io_ctl with a call to
io_ctl_init(). Block group A is added to the io_list of transaction
N + 1 by btrfs_start_dirty_block_groups();
4) While transaction N's commit is writing out the extent buffers, it gets
an IO error and aborts transaction N, also setting the file system to
RO mode;
5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
btrfs_commit_transaction() and has set transaction N + 1 state to
TRANS_STATE_COMMIT_START. Immediately after that it checks that the
filesystem was turned to RO mode, due to transaction N's abort, and
jumps to the "cleanup_transaction" label. After that we end up at
btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
That helper finds block group B in the transaction's io_list but it
never releases the pages array of the block group's io_ctl, resulting in
a memory leak.
In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().
We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.
So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().
This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:
unreferenced object 0xffff9bbf009fa600 (size 512):
comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
hex dump (first 32 bytes):
00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=...
80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=...
backtrace:
[<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
[<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
[<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
[<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
[<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
[<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
[<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
[<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
[<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
[<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
[<0000000043ace2c9>] do_syscall_64+0x33/0x80
[<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
try_merge_free_space
commit bf53d4687b8f3f6b752f091eb85f62369a515dfd upstream.
In try_to_merge_free_space we attempt to find entries to the left and
right of the entry we are adding to see if they can be merged. We
search for an entry past our current info (saved into right_info), and
then if right_info exists and it has a rb_prev() we save the rb_prev()
into left_info.
However there's a slight problem in the case that we have a right_info,
but no entry previous to that entry. At that point we will search for
an entry just before the info we're attempting to insert. This will
simply find right_info again, and assign it to left_info, making them
both the same pointer.
Now if right_info _can_ be merged with the range we're inserting, we'll
add it to the info and free right_info. However further down we'll
access left_info, which was right_info, and thus get a use-after-free.
Fix this by only searching for the left entry if we don't find a right
entry at all.
The CVE referenced had a specially crafted file system that could
trigger this use-after-free. However with the tree checker improvements
we no longer trigger the conditions for the UAF. But the original
conditions still apply, hence this fix.
Reference: CVE-2019-19448
Fixes: 963030817060 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3797136b626ad4b6582223660c041efdea8f26b2 upstream.
While testing 5.2 we ran into the following panic
[52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
[52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
[52238.304051] Call Trace:
[52238.308958] try_to_free_buffers+0x15b/0x1b0
[52238.317503] shrink_page_list+0x1164/0x1780
[52238.325877] shrink_inactive_list+0x18f/0x3b0
[52238.334596] shrink_node_memcg+0x23e/0x7d0
[52238.342790] ? do_shrink_slab+0x4f/0x290
[52238.350648] shrink_node+0xce/0x4a0
[52238.357628] balance_pgdat+0x2c7/0x510
[52238.365135] kswapd+0x216/0x3e0
[52238.371425] ? wait_woken+0x80/0x80
[52238.378412] ? balance_pgdat+0x510/0x510
[52238.386265] kthread+0x111/0x130
[52238.392727] ? kthread_create_on_node+0x60/0x60
[52238.401782] ret_from_fork+0x1f/0x30
The page we were trying to drop had a page->private, but had no
page->mapping and so called drop_buffers, assuming that we had a
buffer_head on the page, and then panic'ed trying to deref 1, which is
our page->private for data pages.
This is happening because we're truncating the free space cache while
we're trying to load the free space cache. This isn't supposed to
happen, and I'll fix that in a followup patch. However we still
shouldn't allow those sort of mistakes to result in messing with pages
that do not belong to us. So add the page->mapping check to verify that
we still own this page after dropping and re-acquiring the page lock.
This page being unlocked as:
btrfs_readpage
extent_read_full_page
__extent_read_full_page
__do_readpage
if (!nr)
unlock_page <-- nr can be 0 only if submit_extent_page
returns an error
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add callchain ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9084cb6a24bf5838a665af92ded1af8363f9e563 upstream.
We were iterating a block group's free space cache rbtree without locking
first the lock that protects it (the free_space_ctl->free_space_offset
rbtree is protected by the free_space_ctl->tree_lock spinlock).
KASAN reported an use-after-free problem when iterating such a rbtree due
to a concurrent rbtree delete:
[ 9520.359168] ==================================================================
[ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90
[ 9520.359949] Read of size 8 at addr ffff8800b7ada500 by task btrfs-transacti/1721
[ 9520.360357]
[ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G L 4.19.0-rc8-nbor #555
[ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 9520.362682] Call Trace:
[ 9520.362887] dump_stack+0xa4/0xf5
[ 9520.363146] print_address_description+0x78/0x280
[ 9520.363412] kasan_report+0x263/0x390
[ 9520.363650] ? rb_next+0x13/0x90
[ 9520.363873] __asan_load8+0x54/0x90
[ 9520.364102] rb_next+0x13/0x90
[ 9520.364380] btrfs_dump_free_space+0x146/0x160 [btrfs]
[ 9520.364697] dump_space_info+0x2cd/0x310 [btrfs]
[ 9520.364997] btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
[ 9520.365310] __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
[ 9520.365646] ? btrfs_update_time+0x180/0x180 [btrfs]
[ 9520.365923] ? _raw_spin_unlock+0x27/0x40
[ 9520.366204] ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
[ 9520.366549] btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
[ 9520.366880] cache_save_setup+0x42e/0x580 [btrfs]
[ 9520.367220] ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
[ 9520.367518] ? lock_downgrade+0x2f0/0x2f0
[ 9520.367799] ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
[ 9520.368104] ? kasan_check_read+0x11/0x20
[ 9520.368349] ? do_raw_spin_unlock+0xa8/0x140
[ 9520.368638] btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
[ 9520.368978] ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
[ 9520.369282] ? do_raw_spin_unlock+0xa8/0x140
[ 9520.369534] ? _raw_spin_unlock+0x27/0x40
[ 9520.369811] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.370137] commit_cowonly_roots+0x4b9/0x610 [btrfs]
[ 9520.370560] ? commit_fs_roots+0x350/0x350 [btrfs]
[ 9520.370926] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.371285] btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
[ 9520.371612] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[ 9520.371943] ? start_transaction+0x168/0x6c0 [btrfs]
[ 9520.372257] transaction_kthread+0x21c/0x240 [btrfs]
[ 9520.372537] kthread+0x1d2/0x1f0
[ 9520.372793] ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
[ 9520.373090] ? kthread_park+0xb0/0xb0
[ 9520.373329] ret_from_fork+0x3a/0x50
[ 9520.373567]
[ 9520.373738] Allocated by task 1804:
[ 9520.373974] kasan_kmalloc+0xff/0x180
[ 9520.374208] kasan_slab_alloc+0x11/0x20
[ 9520.374447] kmem_cache_alloc+0xfc/0x2d0
[ 9520.374731] __btrfs_add_free_space+0x40/0x580 [btrfs]
[ 9520.375044] unpin_extent_range+0x4f7/0x7a0 [btrfs]
[ 9520.375383] btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs]
[ 9520.375707] btrfs_commit_transaction+0xb06/0x10e0 [btrfs]
[ 9520.376027] btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs]
[ 9520.376365] btrfs_check_data_free_space+0x81/0xd0 [btrfs]
[ 9520.376689] btrfs_delalloc_reserve_space+0x25/0x80 [btrfs]
[ 9520.377018] btrfs_direct_IO+0x42e/0x6d0 [btrfs]
[ 9520.377284] generic_file_direct_write+0x11e/0x220
[ 9520.377587] btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.377875] aio_write+0x25c/0x360
[ 9520.378106] io_submit_one+0xaa0/0xdc0
[ 9520.378343] __se_sys_io_submit+0xfa/0x2f0
[ 9520.378589] __x64_sys_io_submit+0x43/0x50
[ 9520.378840] do_syscall_64+0x7d/0x240
[ 9520.379081] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.379387]
[ 9520.379557] Freed by task 1802:
[ 9520.379782] __kasan_slab_free+0x173/0x260
[ 9520.380028] kasan_slab_free+0xe/0x10
[ 9520.380262] kmem_cache_free+0xc1/0x2c0
[ 9520.380544] btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs]
[ 9520.380866] find_free_extent+0xa99/0x17e0 [btrfs]
[ 9520.381166] btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
[ 9520.381474] btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs]
[ 9520.381761] __blockdev_direct_IO+0x10ee/0x58a1
[ 9520.382059] btrfs_direct_IO+0x25a/0x6d0 [btrfs]
[ 9520.382321] generic_file_direct_write+0x11e/0x220
[ 9520.382623] btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.382904] aio_write+0x25c/0x360
[ 9520.383172] io_submit_one+0xaa0/0xdc0
[ 9520.383416] __se_sys_io_submit+0xfa/0x2f0
[ 9520.383678] __x64_sys_io_submit+0x43/0x50
[ 9520.383927] do_syscall_64+0x7d/0x240
[ 9520.384165] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.384439]
[ 9520.384610] The buggy address belongs to the object at ffff8800b7ada500
which belongs to the cache btrfs_free_space of size 72
[ 9520.385175] The buggy address is located 0 bytes inside of
72-byte region [ffff8800b7ada500, ffff8800b7ada548)
[ 9520.385691] The buggy address belongs to the page:
[ 9520.385957] page:ffffea0002deb680 count:1 mapcount:0 mapping:ffff880108a1d700 index:0x0 compound_mapcount: 0
[ 9520.388030] flags: 0x8100(slab|head)
[ 9520.388281] raw: 0000000000008100 ffffea0002deb608 ffffea0002728808 ffff880108a1d700
[ 9520.388722] raw: 0000000000000000 0000000000130013 00000001ffffffff 0000000000000000
[ 9520.389169] page dumped because: kasan: bad access detected
[ 9520.389473]
[ 9520.389658] Memory state around the buggy address:
[ 9520.389943] ffff8800b7ada400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.390368] ffff8800b7ada480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.390796] >ffff8800b7ada500: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
[ 9520.391223] ^
[ 9520.391461] ffff8800b7ada580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.391885] ffff8800b7ada600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.392313] ==================================================================
[ 9520.392772] BTRFS critical (device vdc): entry offset 2258497536, bytes 131072, bitmap no
[ 9520.393247] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
[ 9520.393705] PGD 800000010dbab067 P4D 800000010dbab067 PUD 107551067 PMD 0
[ 9520.394059] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 9520.394378] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G B L 4.19.0-rc8-nbor #555
[ 9520.394858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 9520.395350] RIP: 0010:rb_next+0x3c/0x90
[ 9520.396461] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
[ 9520.396762] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
[ 9520.397115] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
[ 9520.397468] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
[ 9520.397821] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
[ 9520.398188] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
[ 9520.398555] FS: 0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
[ 9520.399007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9520.399335] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
[ 9520.399679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9520.400023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9520.400400] Call Trace:
[ 9520.400648] btrfs_dump_free_space+0x146/0x160 [btrfs]
[ 9520.400974] dump_space_info+0x2cd/0x310 [btrfs]
[ 9520.401287] btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
[ 9520.401609] __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
[ 9520.401952] ? btrfs_update_time+0x180/0x180 [btrfs]
[ 9520.402232] ? _raw_spin_unlock+0x27/0x40
[ 9520.402522] ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
[ 9520.402882] btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
[ 9520.403261] cache_save_setup+0x42e/0x580 [btrfs]
[ 9520.403570] ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
[ 9520.403871] ? lock_downgrade+0x2f0/0x2f0
[ 9520.404161] ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
[ 9520.404481] ? kasan_check_read+0x11/0x20
[ 9520.404732] ? do_raw_spin_unlock+0xa8/0x140
[ 9520.405026] btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
[ 9520.405375] ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
[ 9520.405694] ? do_raw_spin_unlock+0xa8/0x140
[ 9520.405958] ? _raw_spin_unlock+0x27/0x40
[ 9520.406243] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.406574] commit_cowonly_roots+0x4b9/0x610 [btrfs]
[ 9520.406899] ? commit_fs_roots+0x350/0x350 [btrfs]
[ 9520.407253] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.407589] btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
[ 9520.407925] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[ 9520.408262] ? start_transaction+0x168/0x6c0 [btrfs]
[ 9520.408582] transaction_kthread+0x21c/0x240 [btrfs]
[ 9520.408870] kthread+0x1d2/0x1f0
[ 9520.409138] ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
[ 9520.409440] ? kthread_park+0xb0/0xb0
[ 9520.409682] ret_from_fork+0x3a/0x50
[ 9520.410508] Dumping ftrace buffer:
[ 9520.410764] (ftrace buffer empty)
[ 9520.411007] CR2: 0000000000000011
[ 9520.411297] ---[ end trace 01a0863445cf360a ]---
[ 9520.411568] RIP: 0010:rb_next+0x3c/0x90
[ 9520.412644] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
[ 9520.412932] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
[ 9520.413274] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
[ 9520.413616] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
[ 9520.414007] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
[ 9520.414349] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
[ 9520.416074] FS: 0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
[ 9520.416536] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9520.416848] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
[ 9520.418477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9520.418846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9520.419204] Kernel panic - not syncing: Fatal exception
[ 9520.419666] Dumping ftrace buffer:
[ 9520.419930] (ftrace buffer empty)
[ 9520.420168] Kernel Offset: disabled
[ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]---
Fix this by acquiring the respective lock before iterating the rbtree.
Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ad22cf6ea47fa20fbe11ac324a0a15c0a9a4a2a9 upstream.
We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case. Fix up all the logic to make this
consistent.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 553cceb49681d60975d00892877d4c871bf220f9 upstream.
We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b77000ed558daa3bef0899d29bf171b8c9b5e6a8 ]
If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add label and use existing unlocking code ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
For many printks, we want to know which file system issued the message.
This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.
fs/btrfs/check-integrity.c is left alone for another day.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This patch converts printk(KERN_* style messages to use the pr_* versions.
One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."
This patch unsplits user-visible strings.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
__btrfs_abort_transaction doesn't use its root parameter except to
obtain an fs_info pointer. We can obtain that from trans->root->fs_info
for now and from trans->fs_info in a later patch.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_test_opt and friends only use the root pointer to access
the fs_info. Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
On ppc64, bytes_per_bitmap will be (65536*8*65536). Hence append UL to
fix integer overflow.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
entries
On a ppc64 machine using 64K as the block size, assume that the RB
tree at btrfs_free_space_ctl->free_space_offset contains following
two entries:
1. A bitmap entry having an offset value of 0 and having the bits
corresponding to the address range [128M+512K, 128M+768K] set.
2. An extent entry corresponding to the address range
[128M-256K, 128M-128K]
In such a scenario, test_check_exists() invoked for checking the
existence of address range [128M+768K, 256M] can lead to an
infinite loop as explained below:
- Checking for the extent entry fails.
- Checking for a bitmap entry results in the free space info in
range [128M+512K, 128M+768K] beng returned.
- rb_prev(info) returns NULL because the bitmap entry starting from
offset 0 comes first in the RB tree.
- current_node = bitmap node.
- while (current_node)
tmp = rb_next(bitmap_node);/*tmp is extent based free space entry*/
Since extent based free space entry's last address is smaller
than the address being searched for (i.e. 128M+768K) we
incorrectly again obtain the extent node as the "next right node"
of the RB tree and thus end up looping infinitely.
This patch fixes the issue by checking the "tmp" variable which point
to the most recently searched free space node.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
|
|
* struct extent_io_ops
* struct btrfs_free_space_op
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"A couple of small fixes"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: check prepare_uptodate_page() error code earlier
Btrfs: check for empty bitmap list in setup_cluster_bitmaps
btrfs: fix misleading warning when space cache failed to load
Btrfs: fix transaction handle leak in balance
Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4
|
|
Dave Jones found a warning from kasan in setup_cluster_bitmaps()
==================================================================
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr ffff88039bef6828
Read of size 8 by task nfsd/1009
page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null)
index:0x0
flags: 0x8000000000000000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: G W
4.4.0-rc3-backup-debug+ #1
ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
Call Trace:
[<ffffffffa680a43e>] dump_stack+0x4b/0x6d
[<ffffffffa62638d1>] kasan_report_error+0x501/0x520
[<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
[<ffffffffa6263948>] kasan_report+0x58/0x60
[<ffffffffa6814b00>] ? rb_last+0x10/0x40
[<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa6262ead>] __asan_load8+0x5d/0x70
[<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
[<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
[<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
[<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
[<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
[<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520
Andrey noticed this was because we were doing list_first_entry on a list
that might be empty. Rework the tests a bit so we don't do that.
Signed-off-by: Chris Mason <clm@fb.com>
Reprorted-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: Dave Jones <dsj@fb.com>
|
|
When an inconsistent space cache is detected during loading we log a
warning that users frequently mistake as instruction to invalidate the
cache manually, even though this is not required. Fix the message to
indicate that the cache will be rebuilt automatically.
Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Acked-by: Filipe Manana <fdmanana@suse.com>
|
|
We've always passed 0. Stack usage will slightly decrease.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
|
|
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.
Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When we make ctl->unit allocations from a bitmap there is no point in searching
for the next 0 in the bitmap. If we've found a bit we're done and can just exit
the loop. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We can waste a lot of time searching through bitmaps when we are heavily
fragmented trying to find large contiguous areas that don't exist in the bitmap.
So keep track of the max extent size when we do a full search of a bitmap so
that next time around we can just skip the expensive searching if our max size
is less than what we are looking for. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If we are extremely fragmented then we won't be able to create a free_cluster.
So if this happens set last_ptr->fragmented so that all future allcations will
give up trying to create a cluster. When we unpin extents we will unset
->fragmented if we free up a sufficient amount of space in a block group.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
In tracking down these weird bitmap problems it was helpful to artificially
create an extremely fragmented file system. These mount options let us either
fragment data or metadata or both. With these options I could reproduce all
sorts of weird latencies and hangs that occur under extreme fragmentation and
get them fixed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
|
|
Just fix a typo in the code comment.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.
The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit. As a result, any extents that
would be freed when the block group is removed aren't discarded. In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.
To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If the call to btrfs_truncate_inode_items() failed and we don't have a block
group, we were unlocking the cache_write_mutex without having locked it (we
do it only if we have a block group).
Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout
outside critical section in commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If the writeback of an inode cache failed we were unnecessarilly
attempting to release again the delalloc metadata that we previously
reserved. However attempting to do this a second time triggers an
assertion at drop_outstanding_extent() because we have no more
outstanding extents for our inode cache's inode. If we were able
to start writeback of the cache the reserved metadata space is
released at btrfs_finished_ordered_io(), even if an error happens
during writeback.
So make sure we don't repeat the metadata space release if writeback
started for our inode cache.
This issue was trivial to reproduce by running the fstest btrfs/088
with "-o inode_cache", which triggered the assertion leading to a
BUG() call and requiring a reboot in order to run the remaining
fstests. Trace produced by btrfs/088:
[255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
[255289.388094] ------------[ cut here ]------------
[255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
[255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[255289.392068] Call Trace:
[255289.392068] [<ffffffffa035e774>] drop_outstanding_extent+0x3d/0x6d [btrfs]
[255289.392068] [<ffffffffa0364988>] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
[255289.392068] [<ffffffffa03b4174>] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
[255289.392068] [<ffffffffa036f5c4>] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
[255289.392068] [<ffffffffa03e2d83>] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
[255289.392068] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[255289.392068] [<ffffffffa037841f>] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
[255289.392068] [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
[255289.392068] [<ffffffffa037842e>] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We were passing a flags value that differed from the intention in commit
2b108268006e ("Btrfs: don't use highmem for free space cache pages").
This caused problems in a ARM machine, leaving btrfs unusable there.
Reported-by: Merlijn Wajer <merlijn@wizzup.org>
Tested-by: Merlijn Wajer <merlijn@wizzup.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
When we try to access them later, things will blow up in various ways.
Also fix the comment about the return value, which is an errno on error,
not -1, and update the cases where it was not.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
__btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
a list of bitmaps to record in the free space cache. It was dropping
the lock while it worked on other components, which made a window for
free_bitmap() to free the bitmap struct without removing it from the
list.
This changes things to hold the lock the whole time, and also makes sure
we hold the lock during enospc cleanup.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
The code to fix stalls during free spache cache IO wasn't using
the correct root when waiting on the IO for inode caches. This
is only a problem when the inode cache is enabled with
mount -o inode_cache
This fixes the inode cache writeout to preserve any error values and
makes sure not to override the root when inode cache writeout is done.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We loop through all of the dirty block groups during commit and write
the free space cache. In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.
If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.
This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit. We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.
Signed-off-by: Chris Mason <clm@fb.com>
|
|
In order to create the free space cache concurrently with FS modifications,
we need to take a few block group locks.
The cache code also does kmap, which would schedule with the locks held.
Instead of going through kmap_atomic, lets just use lowmem for the cache
pages.
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Block group cache writeout is currently waiting on the pages for each
block group cache before moving on to writing the next one. This commit
switches things around to send down all the caches and then wait on them
in batches.
The end result is much faster, since we're keeping the disk pipeline
full.
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We'll need to put the io_ctl into the block_group cache struct, so
name it struct btrfs_io_ctl and move it into ctree.h
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When we are deleting large files with large extents, we are building up
a huge set of delayed refs for processing. Truncate isn't checking
often enough to see if we need to back off and process those, or let
a commit proceed.
The end result is long stalls after the rm, and very long commit times.
During the commits, other processes back up waiting to start new
transactions and we get into trouble.
Signed-off-by: Chris Mason <clm@fb.com>
|