Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 5c9ee4cbf2a945271f25b89b137f2c03bbc3be33 upstream.
When resizing, it firstly extends the last gd. Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.
But it currently doesn't consider the situation that the backup super is
already done. And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:
BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);
So check whether the backup super is done and then do the updates.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 4ad78628445d26e5e9487b2e8f23274ad7b0f5d3 upstream.
For block devices the pagecache is associated with the inode
on bdevfs, not with the aliasing ones on the mountable filesystems.
The latter have its own ->i_data empty and ->i_mapping pointing
to the (unique per major/minor) bdevfs inode. That guarantees
cache coherence between all block device inodes with the same
device number.
Eviction of an alias inode has no business trying to evict the
pages belonging to bdevfs one; moreover, ->i_mapping is only
safe to access when the thing is opened. At the time of
->evict_inode() the victim is definitely *not* opened. We are
about to kill the address space embedded into struct inode
(inode->i_data) and that's what we need to empty of any pages.
9p instance tries to empty inode->i_mapping instead, which is
both unsafe and bogus - if we have several device nodes with
the same device number in different places, closing one of them
should not try to empty the (shared) page cache.
Fortunately, other instances in the tree are OK; they are
evicting from &inode->i_data instead, as 9p one should.
Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Tested-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit bc23f0c8d7ccd8d924c4e70ce311288cb3e61ea8 upstream.
Ted and Namjae have reported that truncated pages don't get timely
reclaimed after being truncated in data=journal mode. The following test
triggers the issue easily:
for (i = 0; i < 1000; i++) {
pwrite(fd, buf, 1024*1024, 0);
fsync(fd);
fsync(fd);
ftruncate(fd, 0);
}
The reason is that journal_unmap_buffer() finds that truncated buffers
are not journalled (jh->b_transaction == NULL), they are part of
checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
been already written out (!buffer_dirty(bh)). We clean such buffers but
we leave them in the checkpoint list. Since checkpoint transaction holds
a reference to the journal head, these buffers cannot be released until
the checkpoint transaction is cleaned up. And at that point we don't
call release_buffer_page() anymore so pages detached from mapping are
lingering in the system waiting for reclaim to find them and free them.
Fix the problem by removing buffers from transaction checkpoint lists
when journal_unmap_buffer() finds out they don't have to be there
anymore.
Reported-and-tested-by: Namjae Jeon <namjae.jeon@samsung.com>
Fixes: de1b794130b130e77ffa975bb58cb843744f9ae5
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit a4dad1ae24f850410c4e60f22823cba1289b8d52 upstream.
In ext4, the bottom two bits of {a,c,m}time_extra are used to extend
the {a,c,m}time fields, deferring the year 2038 problem to the year
2446.
When decoding these extended fields, for times whose bottom 32 bits
would represent a negative number, sign extension causes the 64-bit
extended timestamp to be negative as well, which is not what's
intended. This patch corrects that issue, so that the only negative
{a,c,m}times are those between 1901 and 1970 (as per 32-bit signed
timestamps).
Some older kernels might have written pre-1970 dates with 1,1 in the
extra bits. This patch treats those incorrectly-encoded dates as
pre-1970, instead of post-2311, until kernel 4.20 is released.
Hopefully by then e2fsck will have fixed up the bad data.
Also add a comment explaining the encoding of ext4's extra {a,c,m}time
bits.
Signed-off-by: David Turner <novalis@novalis.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Mark Harris <mh8928@yahoo.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=23732
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 3ca8138f014a913f98e6ef40e939868e1e9ea876 upstream.
I got a report about unkillable task eating CPU. Further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.
Fix this by calling iov_iter_advance() before repeating an attempt to
copy data from userspace.
A similar problem is described in 124d3b7041f ("fix writev regression:
pan hanging unkillable and un-straceable"). If zero-length segmend
is followed by segment with invalid address,
iov_iter_fault_in_readable() checks only first segment (zero-length),
iov_iter_copy_from_user_atomic() skips it, fails at second and
returns zero -> goto again without skipping zero-length segment.
Patch calls iov_iter_advance() before goto again: we'll skip zero-length
segment at second iteraction and iov_iter_fault_in_readable() will detect
invalid address.
Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
description.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 0ebf7f10d67a70e120f365018f1c5fce9ddc567d upstream.
The thing got broken back in 2002 - sysvfs does *not* have inline
symlinks; even short ones have bodies stored in the first block
of file. sysv_symlink() handles that correctly; unfortunately,
attempting to look an existing symlink up will end up confusing
them for inline symlinks, and interpret the block number containing
the body as the body itself.
Nobody has noticed until now, which says something about the level
of testing sysvfs gets ;-/
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit c812012f9ca7cf89c9e1a1cd512e6c3b5be04b85 upstream.
If we pass in an empty nfs_fattr struct to nfs_update_inode, it will
(correctly) not update any of the attributes, but it then clears the
NFS_INO_INVALID_ATTR flag, which indicates that the attributes are
up to date. Don't clear the flag if the fattr struct has no valid
attrs to apply.
Reviewed-by: Steve French <steve.french@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit c2489e07c0a71a56fb2c84bc0ee66cddfca7d068 upstream.
The following test program from Dmitry can cause softlockups or RCU
stalls as it copies 1GB from tmpfs into eventfd and we don't have any
scheduling point at that path in sendfile(2) implementation:
int r1 = eventfd(0, 0);
int r2 = memfd_create("", 0);
unsigned long n = 1<<30;
fallocate(r2, 0, 0, n);
sendfile(r1, r2, 0, n);
Add cond_resched() into __splice_from_pipe() to fix the problem.
CC: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit c725bfce7968009756ed2836a8cd7ba4dc163011 upstream.
Commit 296291cdd162 (mm: make sendfile(2) killable) fixed an issue where
sendfile(2) was doing a lot of tiny writes into a filesystem and thus
was unkillable for a long time. However sendfile(2) can be (mis)used to
issue lots of writes into arbitrary file descriptor such as evenfd or
similar special file descriptors which never hit the standard filesystem
write path and thus are still unkillable. E.g. the following example
from Dmitry burns CPU for ~16s on my test system without possibility to
be killed:
int r1 = eventfd(0, 0);
int r2 = memfd_create("", 0);
unsigned long n = 1<<30;
fallocate(r2, 0, 0, n);
sendfile(r1, r2, 0, n);
There are actually quite a few tests for pending signals in sendfile
code however we data to write is always available none of them seems to
trigger. So fix the problem by adding a test for pending signal into
splice_from_pipe_next() also before the loop waiting for pipe buffers to
be available. This should fix all the lockup issues with sendfile of the
do-ton-of-tiny-writes nature.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit b582ef5c53040c5feef4c96a8f9585b6831e2441 upstream.
Do not clobber the buffer space passed from `search_binary_handler' and
originally preloaded by `prepare_binprm' with the executable's file
header by overwriting it with its interpreter's file header. Instead
keep the buffer space intact and directly use the data structure locally
allocated for the interpreter's file header, fixing a bug introduced in
2.1.14 with loadable module support (linux-mips.org commit beb11695
[Import of Linux/MIPS 2.1.14], predating kernel.org repo's history).
Adjust the amount of data read from the interpreter's file accordingly.
This was not an issue before loadable module support, because back then
`load_elf_binary' was executed only once for a given ELF executable,
whether the function succeeded or failed.
With loadable module support supported and enabled, upon a failure of
`load_elf_binary' -- which may for example be caused by architecture
code rejecting an executable due to a missing hardware feature requested
in the file header -- a module load is attempted and then the function
reexecuted by `search_binary_handler'. With the executable's file
header replaced with its interpreter's file header the executable can
then be erroneously accepted in this subsequent attempt.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d upstream.
If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.
This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).
The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0
(Note the implicit hole for inode 257 regarding the [0, 8K[ range)
CPU 1 CPU 2
run_dealloc_nocow()
btrfs_lookup_file_extent()
--> searches for a key with value
(257 EXTENT_DATA 4096) in the
fs/subvol tree
--> returns us a path with
path->nodes[0] == leaf X and
path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it
calls btrfs_next_leaf()
btrfs_next_leaf()
--> releases the path
hard link added to our inode,
with key (257 INODE_REF 500)
added to the end of leaf X,
so leaf X now has N + 1 keys
--> searches for the key
(257 INODE_REF 256), because
it was the last key in leaf X
before it released the path,
with path->keep_locks set to 1
--> ends up at leaf X again and
it verifies that the key
(257 INODE_REF 256) is no longer
the last key in the leaf, so it
returns with path->nodes[0] ==
leaf X and path->slots[0] == N,
pointing to the new item with
key (257 INODE_REF 500)
the loop iteration of run_dealloc_nocow()
does not break out the loop and continues
because the key referenced in the path
at path->nodes[0] and path->slots[0] is
for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
and its offset (500) is less then our delalloc
range's end (8192)
the item pointed by the path, an inode reference item,
is (incorrectly) interpreted as a file extent item and
we get an invalid extent type, leading to the BUG_ON(1):
if (extent_type == BTRFS_FILE_EXTENT_REG ||
extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
} else {
BUG_ON(1)
}
The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.
So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream.
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option. But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.
Task A Task B
ext4_handle_error()
-> jbd2_journal_abort()
-> __journal_abort_soft()
-> __jbd2_journal_abort_hard()
| -> journal->j_flags |= JBD2_ABORT;
|
| __ext4_abort()
| -> jbd2_journal_abort()
| | -> __journal_abort_soft()
| | -> if (journal->j_flags & JBD2_ABORT)
| | return;
| -> panic()
|
-> jbd2_journal_update_sb_errno()
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 102f4d900c9c8f5ed89ae4746d493fe3ebd7ba64 upstream.
Handle a write being requested to the page immediately beyond the EOF
marker on a cache object. Currently this gets an assertion failure in
CacheFiles because the EOF marker is used there to encode information about
a partial page at the EOF - which could lead to an unknown blank spot in
the file if we extend the file over it.
The problem is actually in fscache where we check the index of the page
being written against store_limit. store_limit is set to the number of
pages that we're allowed to store by fscache_set_store_limit() - which
means it's one more than the index of the last page we're allowed to store.
The problem is that we permit writing to a page with an index _equal_ to
the store limit - when we should reject that case.
Whilst we're at it, change the triggered assertion in CacheFiles to just
return -ENOBUFS instead.
The assertion failure looks something like this:
CacheFiles: Assertion failed
1000 < 7b1 is false
------------[ cut here ]------------
kernel BUG at fs/cachefiles/rdwr.c:962!
...
RIP: 0010:[<ffffffffa02c9e83>] [<ffffffffa02c9e83>] cachefiles_write_page+0x273/0x2d0 [cachefiles]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit b130ed5998e62879a66bad08931a2b5e832da95c upstream.
Only override netfs->primary_index when registering success.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[lizf: Backported to 3.4: there are no n_active and flags in primary_index]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 86108c2e34a26e4bec3c6ddb23390bf8cedcf391 upstream.
If netfs exist, fscache should not increase the reference of parent's
usage and n_children, otherwise, never be decreased.
v2: thanks David's suggest,
move increasing reference of parent if success
use kmem_cache_free() freeing primary_index directly
v3: don't move "netfs->primary_index->parent = &fscache_fsdef_index;"
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 0ff28d9f4674d781e492bcff6f32f0fe48cf0fed upstream.
Using sendfile with below small program to get MD5 sums of some files,
it appear that big files (over 64kbytes with 4k pages system) get a
wrong MD5 sum while small files get the correct sum.
This program uses sendfile() to send a file to an AF_ALG socket
for hashing.
/* md5sum2.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/if_alg.h>
int main(int argc, char **argv)
{
int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct stat st;
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "hash",
.salg_name = "md5",
};
int n;
bind(sk, (struct sockaddr*)&sa, sizeof(sa));
for (n = 1; n < argc; n++) {
int size;
int offset = 0;
char buf[4096];
int fd;
int sko;
int i;
fd = open(argv[n], O_RDONLY);
sko = accept(sk, NULL, 0);
fstat(fd, &st);
size = st.st_size;
sendfile(sko, fd, &offset, size);
size = read(sko, buf, sizeof(buf));
for (i = 0; i < size; i++)
printf("%2.2x", buf[i]);
printf(" %s\n", argv[n]);
close(fd);
close(sko);
}
exit(0);
}
Test below is done using official linux patch files. First result is
with a software based md5sum. Second result is with the program above.
root@vgoip:~# ls -l patch-3.6.*
-rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz
-rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz
root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz
root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz
After investivation, it appears that sendfile() sends the files by blocks
of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
is reset as if it was the end of the file.
This patch adds SPLICE_F_MORE to the flags when more data is pending.
With the patch applied, we get the correct sums:
root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz
root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
Quoting the RHEL advisory:
> It was found that the fix for CVE-2015-1805 incorrectly kept buffer
> offset and buffer length in sync on a failed atomic read, potentially
> resulting in a pipe buffer state corruption. A local, unprivileged user
> could use this flaw to crash the system or leak kernel memory to user
> space. (CVE-2016-0774, Moderate)
The same flawed fix was applied to stable branches from 2.6.32.y to
3.14.y inclusive, and I was able to reproduce the issue on 3.2.y.
We need to give pipe_iov_copy_to_user() a separate offset variable
and only update the buffer offset if it succeeds.
References: https://rhn.redhat.com/errata/RHSA-2016-0103.html
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 012572d4fc2e4ddd5c8ec8614d51414ec6cae02a upstream.
The order of the following three spinlocks should be:
dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock
But dlm_dispatch_assert_master() is called while holding
dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
dlm_grab() which will take dlm_domain_lock.
Once another thread (for example, dlm_query_join_handler) has already
taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
happens.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: "Junxiao Bi" <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 98ce94c8df762d413b3ecb849e2b966b21606d04 upstream.
Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
10.10.5) share fails in case the clocks differ more than +/-2h:
digest-service: digest-request: od failed with 2 proto=ntlmv2
digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2
Fix this by (re-)using the given server timestamp for the
ntlmv2 authentication (as Windows 7 does).
A related problem was also reported earlier by Namjae Jaen (see below):
Windows machine has extended security feature which refuse to allow
authentication when there is time difference between server time and
client time when ntlmv2 negotiation is used. This problem is prevalent
in embedded enviornment where system time is set to default 1970.
Modern servers send the server timestamp in the TargetInfo Av_Pair
structure in the challenge message [see MS-NLMP 2.2.2.1]
In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must
use the server provided timestamp if present OR current time if it is
not
Reported-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Signed-off-by: Steve French <smfrench@gmail.com>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc upstream.
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.
filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens
next depends on whether there's an open file handle associated with the
inode. If there is, we write to the block device, which is unexpected
behavior. If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.
Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream.
Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert().
This is an identical change to the corresponding hfs b-tree code to Sergei
Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
to keep similar code paths in the hfs and hfsplus drivers in sync, where
appropriate.
Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Sergei Antonov <saproj@gmail.com>
Cc: Joe Perches <joe@perches.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream.
Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().
This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state". Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:
$ df -i / /home /mnt
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/fedora-root 3276800 551665 2725135 17% /
/dev/mapper/fedora-home 52879360 716221 52163139 2% /home
/dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt
After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.
There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years. [1]
The unpatched code [2] has always been wrong since it entered the kernel
tree. The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.
The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents. When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging. There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.
References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761
[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton <akpm@osdl.org>
Date: Wed Feb 25 16:17:36 2004 -0800
[PATCH] HFS rewrite
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd
commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton <akpm@osdl.org>
Date: Wed Feb 25 16:17:48 2004 -0800
[PATCH] HFS+ support
[3]:
http://sourceforge.net/projects/linux-hfsplus/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/
http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5
Date: Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.
[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d
commit a5e3985fa014029eb6795664c704953720cc7f7d
Author: Roman Zippel <zippel@linux-m68k.org>
Date: Tue Sep 6 15:18:47 2005 -0700
[PATCH] hfs: remove debug code
Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Sougata Santra <sougata@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.
Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:
$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory
This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.
[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4:
- adjust context
- one more place in ceph needs to be changed
- drop changes to overlayfs
- drop showing vers in cifs]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit f49a26e7718dd30b49e3541e3e25aecf5e7294e2 upstream.
Update ctime and mtime when a directory is modified. (though OS/2 doesn't
update them anyway)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 5556e7e6d30e8e9b5ee51b0e5edd526ee80e5e36 upstream.
Consider eCryptfs dcache entries to be stale when the corresponding
lower inode's i_nlink count is zero. This solves a problem caused by the
lower inode being directly modified, without going through the eCryptfs
mount, leaving stale eCryptfs dentries cached and the eCryptfs inode's
i_nlink count not being cleared.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Richard Weinberger <richard@nod.at>
[bwh: Backported to 3.2:
- Test d_revalidate pointer directly rather than a DCACHE_OP flag
- Open-code d_inode()
- Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit efcbc04e16dfa95fef76309f89710dd1d99a5453 upstream.
It is unusual to combine the open flags O_RDONLY and O_EXCL, but
it appears that libre-office does just that.
[pid 3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
[pid 3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL <unfinished ...>
NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
probably to reset the timestamps.
When it was an O_RDONLY open, the SETATTR command does not
identify any actual attributes to change.
If no delegation was provided to the open, the SETATTR uses the
all-zeros stateid and the request is accepted (at least by the
Linux NFS server - no harm, no foul).
If a read-delegation was provided, this is used in the SETATTR
request, and a Netapp filer will justifiably claim
NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
to retry - indefinitely.
So only treat O_EXCL specially if O_CREAT was also given.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit eee5cc2702929fd41cce28058dc6d6717f723f87 upstream.
The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 209f7512d007980fd111a74a064d70a3656079cf upstream.
The "BUG_ON(list_empty(&osb->blocked_lock_list))" in
ocfs2_downconvert_thread_do_work can be triggered in the following case:
ocfs2dc has firstly saved osb->blocked_lock_count to local varibale
processed, and then processes the dentry lockres. During the dentry
put, it calls iput and then deletes rw, inode and open lockres from
blocked list in ocfs2_mark_lockres_freeing. And this causes the
variable `processed' to not reflect the number of blocked lockres to be
processed, which triggers the BUG.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 0a73d0a204a4a04a1e110539c5a524ae51f91d6d upstream.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit a28e4b2b18ccb90df402da3f21e1a83c9d4f8ec1 upstream.
Removing unnecessary static buffers is good.
Use the vsprintf %pV extension instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Mikulas:
- The bug corrected by the patch is - if hpfs_error is called concurrently
on multiple filesystems, it could corrupt the string because the text
buffer is shared. That's why I marked the patch for stable.]
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit ce657611baf902f14ae559ce4e0787ead6712067 upstream.
There is a possibility of nothing being allocated to the new_opts in
case of memory pressure, therefore return ENOMEM for such case.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 7444a072c387a93ebee7066e8aee776954ab0e41 upstream.
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 upstream.
Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
deadlocks in the page writeback path.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 3b5e6454aaf6b4439b19400d8365e2ec2d24e411 upstream.
A buffer cache is allocated from movable area because it is referred
for a while and released soon. But some filesystems are taking buffer
cache for a long time and it can disturb page migration.
New APIs are introduced to allocate buffer cache with user specific
flag. *_gfp APIs are for user want to set page allocation flag for
page cache allocation. And *_unmovable APIs are for the user wants to
allocate page cache from non-movable area.
Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa upstream.
The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:
/*
* (...)
*
* Don't free memory not originally allocated by kmalloc()
* or you will run into trouble.
*/
So better be safe and use kmem_cache_free().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
mount
commit a41cbe86df3afbc82311a1640e20858c0cd7e065 upstream.
A test case is as the description says:
open(foobar, O_WRONLY);
sleep() --> reboot the server
close(foobar)
The bug is because in nfs4state.c in nfs4_reclaim_open_state() a few
line before going to restart, there is
clear_bit(NFS4CLNT_RECLAIM_NOGRACE, &state->flags).
NFS4CLNT_RECLAIM_NOGRACE is a flag for the client states not open
owner states. Value of NFS4CLNT_RECLAIM_NOGRACE is 4 which is the
value of NFS_O_WRONLY_STATE in nfs4_state->flags. So clearing it wipes
out state and when we go to close it, “call_close” doesn’t get set as
state flag is not set and CLOSE doesn’t go on the wire.
Signed-off-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 397d425dc26da728396e66d392d5dcb8dac30c37 upstream.
In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.
Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected. We certainly can not match
the current behavior as the current behavior is a security hole.
Therefore when encounting .. when following an unconnected path
return -ENOENT.
- Add a function path_connected to verify path->dentry is reachable
from path->mnt.mnt_root. AKA to validate that rename did not do
something nasty to the bind mount.
To avoid races path_connected must be called after following a path
component to it's next path component.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
commit cde93be45a8a90d8c264c776fab63487b5038a65 upstream.
A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root. For lack of a better term
I call this an escaped path.
prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.
__d_path only wants to see paths are connected to the root it passes
in. So __d_path needs prepend_path to return an error.
d_absolute_path similarly wants to see paths that are connected to
some root. Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1. So escaped paths will be treated like paths on lazily
unmounted mounts.
getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.
d_path is the interesting hold out. d_path just wants to print
something, and does not care about the weird cases. Which raises
the question what should be printed?
Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths. As there are not really any meaninful path components when
considered from the perspective of a mount tree.
So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 841df7df196237ea63233f0f9eaa41db53afd70f upstream.
Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.
Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 0ad0b3255a08020eaf50e34ef0d6df5bdf5e09ed upstream.
fc->release is called from fuse_conn_put() which was used in the error
cleanup before fc->release was initialized.
[Jeremiah Mahler <jmmahler@gmail.com>: assign fc->release after calling
fuse_conn_init(fc) instead of before.]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fixes: a325f9b92273 ("fuse: update fuse_conn_init() and separate out fuse_conn_kill()")
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream.
ext4 isn't willing to map clusters to a non-extent file. Don't signal
this with an out of space error, since the FS will retry the
allocation (which didn't fail) forever. Instead, return EUCLEAN so
that the operation will fail immediately all the way back to userspace.
(The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream.
Normally all of the buffers will have been forced out to disk before
we call invalidate_bdev(), but there will be some cases, where a file
system operation was aborted due to an ext4_error(), where there may
still be some dirty buffers in the buffer cache for the device. So
try to force them out to memory before calling invalidate_bdev().
This fixes a warning triggered by generic/081:
WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a upstream.
If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case. In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts. So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first. And only after recovering
journal it can do the new updates.
The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
http://comments.gmane.org/gmane.comp.file-systems.ext4/48841
[ Fixed bug in patch which allowed a non-negative error return from
jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
was causing xfstests ext4/306 to fail. -- Ted ]
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit b4f1afcd068f6e533230dfed00782cd8a907f96b upstream.
jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
So allocations should be done with GFP_NOFS
[Full stack trace snipped from 3.10-rh7]
[<ffffffff815c4bd4>] dump_stack+0x19/0x1b
[<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80
[<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20
[<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
[<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210
[<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20
[<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20
[<ffffffff81147939>] mempool_alloc+0x69/0x170
[<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20
[<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150
[<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0
[<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120
[<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
[<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
[<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
[<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2]
[<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30
[<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210
[<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2]
[<ffffffff811532ce>] ? lru_cache_add+0xe/0x10
[<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4]
[<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4]
[<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4]
[<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270
[<ffffffff81146918>] __generic_file_write_iter+0x178/0x390
[<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0
[<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0
[<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4]
[<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0
[<ffffffff811b93f0>] do_sync_write+0x90/0xe0
[<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0
[<ffffffff811ba5b8>] SyS_write+0x58/0xb0
[<ffffffff815d4799>] system_call_fastpath+0x16/0x1b
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream.
The commit cf108bca465d: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock. However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.
Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.
This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:
jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0
- and -
kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
...
Call Trace:
[<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
[<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
[<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
[<c027d883>] ? lock_buffer+0x36/0x36
[<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
[<c0229139>] do_invalidatepage+0x22/0x26
[<c0229198>] truncate_inode_page+0x5b/0x85
[<c022934b>] truncate_inode_pages_range+0x156/0x38c
[<c0229592>] truncate_inode_pages+0x11/0x15
[<c022962d>] truncate_pagecache+0x55/0x71
[<c02b913b>] ext4_setattr+0x4a9/0x560
[<c01ca542>] ? current_kernel_time+0x10/0x44
[<c026c4d8>] notify_change+0x1c7/0x2be
[<c0256a00>] do_truncate+0x65/0x85
[<c0226f31>] ? file_ra_state_init+0x12/0x29
- and -
WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
...
Call Trace:
[<c01b879f>] ? console_unlock+0x3a1/0x3ce
[<c082cbb4>] dump_stack+0x48/0x60
[<c0178b65>] warn_slowpath_common+0x89/0xa0
[<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
[<c0178bef>] warn_slowpath_null+0x14/0x18
[<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
[<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
[<c02b2f44>] write_end_fn+0x40/0x53
[<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
[<c02b59e7>] ext4_writepage+0x354/0x3b8
[<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
[<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
[<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
[<c02b5a5b>] __writepage+0x10/0x2e
[<c0225956>] write_cache_pages+0x22d/0x32c
[<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
[<c02b6ee8>] ext4_writepages+0x102/0x607
[<c019adfe>] ? sched_clock_local+0x10/0x10e
[<c01a8a7c>] ? __lock_is_held+0x2e/0x44
[<c01a8ad5>] ? lock_is_held+0x43/0x51
[<c0226dff>] do_writepages+0x1c/0x29
[<c0276bed>] __writeback_single_inode+0xc3/0x545
[<c0277c07>] writeback_sb_inodes+0x21f/0x36d
...
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit d683cc49daf7c5afca8cd9654aaa1bf63cdf2ad9 upstream.
When encoding the NFSACL SETACL operation, reserve just the estimated
size of the ACL rather than a fixed maximum. This eliminates needless
zero padding on the wire that the server ignores.
Fixes: ee5dc7732bd5 ('NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit e8d975e73e5fa05f983fbf2723120edcf68e0b38 upstream.
Problem: When an operation like WRITE receives a BAD_STATEID, even though
recovery code clears the RECLAIM_NOGRACE recovery flag before recovering
the open state, because of clearing delegation state for the associated
inode, nfs_inode_find_state_and_recover() gets called and it makes the
same state with RECLAIM_NOGRACE flag again. As a results, when we restart
looking over the open states, we end up in the infinite loop instead of
breaking out in the next test of state flags.
Solution: unset the RECLAIM_NOGRACE set because of
calling of nfs_inode_find_state_and_recover() after returning from calling
recover_open() function.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec,
the first time atomically and the second time not. The second attempt
needs to continue from the iovec position, pipe buffer offset and
remaining length where the first attempt failed, but currently the
pipe buffer offset and remaining length are reset. This will corrupt
the piped data (possibly also leading to an information leak between
processes) and may also corrupt kernel memory.
This was fixed upstream by commits f0d1bec9d58d ("new helper:
copy_page_from_iter()") and 637b58c2887e ("switch pipe_read() to
copy_page_to_iter()"), but those aren't suitable for stable. This fix
for older kernel versions was made by Seth Jennings for RHEL and I
have extracted it from their update.
CVE-2015-1805
References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 23b133bdc452aa441fcb9b82cbf6dd05cfd342d0 upstream.
Check length of extended attributes and allocation descriptors when
loading inodes from disk. Otherwise corrupted filesystems could confuse
the code and make the kernel oops.
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: Jan Kara <jack@suse.cz>
[lizf: Backported to 3.4:
- call make_bad_inode() and then return
- relace bs with inode->i_sb->s_blocksize]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 161f873b89136eb1e69477c847d5a5033239d9ba upstream.
We used to read file_handle twice. Once to get the amount of extra
bytes, and once to fetch the entire structure.
This may be problematic since we do size verifications only after the
first read, so if the number of extra bytes changes in userspace between
the first and second calls, we'll have an incoherent view of
file_handle.
Instead, read the constant size once, and copy that over to the final
structure without having to re-read it again.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|