summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2016-03-04bio: return EINTR if copying to user space got interruptedHannes Reinecke1-4/+8
commit 2d99b55d378c996b9692a0c93dd25f4ed5d58934 upstream. Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for current->mm to see if we have a user space context and only copies data if we do. Now if an IO gets interrupted by a signal data isn't copied into user space any more (as we don't have a user space context) but user space isn't notified about it. This patch modifies the behaviour to return -EINTR from bio_uncopy_user() to notify userland that a signal has interrupted the syscall, otherwise it could lead to a situation where the caller may get a buffer with no data returned. This can be reproduced by issuing SG_IO ioctl()s in one thread while constantly sending signals to it. [js] backport to 3.12 Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03locks: fix unlock when fcntl_setlk races with a closeJeff Layton1-21/+30
commit 7f3697e24dc3820b10f445a4a7d914fc356012d1 upstream. Dmitry reported that he was able to reproduce the WARN_ON_ONCE that fires in locks_free_lock_context when the flc_posix list isn't empty. The problem turns out to be that we're basically rebuilding the file_lock from scratch in fcntl_setlk when we discover that the setlk has raced with a close. If the l_whence field is SEEK_CUR or SEEK_END, then we may end up with fl_start and fl_end values that differ from when the lock was initially set, if the file position or length of the file has changed in the interim. Fix this by just reusing the same lock request structure, and simply override fl_type value with F_UNLCK as appropriate. That ensures that we really are unlocking the lock that was initially set. While we're there, make sure that we do pop a WARN_ON_ONCE if the removal ever fails. Also return -EBADF in this event, since that's what we would have returned if the close had happened earlier. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Fixes: c293621bbf67 (stale POSIX lock handling) Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03do_last(): don't let a bogus return value from ->open() et.al. to confuse usAl Viro1-0/+4
commit c80567c82ae4814a41287618e315a60ecf513be6 upstream. ... into returning a positive to path_openat(), which would interpret that as "symlink had been encountered" and proceed to corrupt memory, etc. It can only happen due to a bug in some ->open() instance or in some LSM hook, etc., so we report any such event *and* make sure it doesn't trick us into further unpleasantness. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03NFSv4: Fix a dentry leak on alias useBenjamin Coddington1-2/+2
commit d9dfd8d741683347ee159d25f5b50c346a0df557 upstream. In the case where d_add_unique() finds an appropriate alias to use it will have already incremented the reference count. An additional dget() to swap the open context's dentry is unnecessary and will leak a reference. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: 275bb307865a3 ("NFSv4: Move dentry instantiation into the NFSv4-...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03uml: fix hostfs mknod()Vegard Nossum1-3/+1
commit 9f2dfda2f2f1c6181c3732c16b85c59ab2d195e0 upstream. An inverted return value check in hostfs_mknod() caused the function to return success after handling it as an error (and cleaning up). It resulted in the following segfault when trying to bind() a named unix socket: Pid: 198, comm: a.out Not tainted 4.4.0-rc4 RIP: 0033:[<0000000061077df6>] RSP: 00000000daae5d60 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 000000006092a460 RCX: 00000000dfc54208 RDX: 0000000061073ef1 RSI: 0000000000000070 RDI: 00000000e027d600 RBP: 00000000daae5de0 R08: 00000000da980ac0 R09: 0000000000000000 R10: 0000000000000003 R11: 00007fb1ae08f72a R12: 0000000000000000 R13: 000000006092a460 R14: 00000000daaa97c0 R15: 00000000daaa9a88 Kernel panic - not syncing: Kernel mode fault at addr 0x40, ip 0x61077df6 CPU: 0 PID: 198 Comm: a.out Not tainted 4.4.0-rc4 #1 Stack: e027d620 dfc54208 0000006f da981398 61bee000 0000c1ed daae5de0 0000006e e027d620 dfcd4208 00000005 6092a460 Call Trace: [<60dedc67>] SyS_bind+0xf7/0x110 [<600587be>] handle_syscall+0x7e/0x80 [<60066ad7>] userspace+0x3e7/0x4e0 [<6006321f>] ? save_registers+0x1f/0x40 [<6006c88e>] ? arch_prctl+0x1be/0x1f0 [<60054985>] fork_handler+0x85/0x90 Let's also get rid of the "cosmic ray protection" while we're at it. Fixes: e9193059b1b3 "hostfs: fix races in dentry_name() and inode_name()" Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03Btrfs: fix number of transaction units required to create symlinkFilipe Manana1-1/+3
commit 9269d12b2d57d9e3d13036bb750762d1110d425c upstream. We weren't accounting for the insertion of an inline extent item for the symlink inode nor that we need to update the parent inode item (through the call to btrfs_add_nondir()). So fix this by including two more transaction units. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03Btrfs: send, don't BUG_ON() when an empty symlink is foundFilipe Manana1-1/+15
commit a879719b8c90e15c9e7fa7266d5e3c0ca962f9df upstream. When a symlink is successfully created it always has an inline extent containing the source path. However if an error happens when creating the symlink, we can leave in the subvolume's tree a symlink inode without any such inline extent item - this happens if after btrfs_symlink() calls btrfs_end_transaction() and before it calls the inode eviction handler (through the final iput() call), the transaction gets committed and a crash happens before the eviction handler gets called, or if a snapshot of the subvolume is made before the eviction handler gets called. Sadly we can't just avoid this by making btrfs_symlink() call btrfs_end_transaction() after it calls the eviction handler, because the later can commit the current transaction before it removes any items from the subvolume tree (if it encounters ENOSPC errors while reserving space for removing all the items). So make send fail more gracefully, with an -EIO error, and print a message to dmesg/syslog informing that there's an empty symlink inode, so that the user can delete the empty symlink or do something else about it. Reported-by: Stephen R. van den Berg <srb@cuci.nl> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03Btrfs: igrab inode in writepageJosef Bacik1-2/+15
commit be7bd730841e69fe8f70120098596f648cd1f3ff upstream. We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03Btrfs: add missing brelse when superblock checksum failsAnand Jain1-0/+1
commit b2acdddfad13c38a1e8b927d83c3cf321f63601a upstream. Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03vfs: Avoid softlockups with sendfile(2)Jan Kara1-0/+1
commit c2489e07c0a71a56fb2c84bc0ee66cddfca7d068 upstream. The following test program from Dmitry can cause softlockups or RCU stalls as it copies 1GB from tmpfs into eventfd and we don't have any scheduling point at that path in sendfile(2) implementation: int r1 = eventfd(0, 0); int r2 = memfd_create("", 0); unsigned long n = 1<<30; fallocate(r2, 0, 0, n); sendfile(r1, r2, 0, n); Add cond_resched() into __splice_from_pipe() to fix the problem. CC: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03lockd: create NSM handles per net namespaceAndrey Ryabinin6-19/+30
commit 0ad95472bf169a3501991f8f33f5147f792a8116 upstream. Commit cb7323fffa85 ("lockd: create and use per-net NSM RPC clients on MON/UNMON requests") introduced per-net NSM RPC clients. Unfortunately this doesn't make any sense without per-net nsm_handle. E.g. the following scenario could happen Two hosts (X and Y) in different namespaces (A and B) share the same nsm struct. 1. nsm_monitor(host_X) called => NSM rpc client created, nsm->sm_monitored bit set. 2. nsm_mointor(host-Y) called => nsm->sm_monitored already set, we just exit. Thus in namespace B ln->nsm_clnt == NULL. 3. host X destroyed => nsm->sm_count decremented to 1 4. host Y destroyed => nsm_unmonitor() => nsm_mon_unmon() => NULL-ptr dereference of *ln->nsm_clnt So this could be fixed by making per-net nsm_handles list, instead of global. Thus different net namespaces will not be able share the same nsm_handle. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x ↵Olga Kornievskaia1-1/+1
mount commit a41cbe86df3afbc82311a1640e20858c0cd7e065 upstream. A test case is as the description says: open(foobar, O_WRONLY); sleep() --> reboot the server close(foobar) The bug is because in nfs4state.c in nfs4_reclaim_open_state() a few line before going to restart, there is clear_bit(NFS4CLNT_RECLAIM_NOGRACE, &state->flags). NFS4CLNT_RECLAIM_NOGRACE is a flag for the client states not open owner states. Value of NFS4CLNT_RECLAIM_NOGRACE is 4 which is the value of NFS_O_WRONLY_STATE in nfs4_state->flags. So clearing it wipes out state and when we go to close it, “call_close” doesn’t get set as state flag is not set and CLOSE doesn’t go on the wire. Signed-off-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-03splice: sendfile() at once fails for big filesChristophe Leroy1-1/+11
commit 0ff28d9f4674d781e492bcff6f32f0fe48cf0fed upstream. Using sendfile with below small program to get MD5 sums of some files, it appear that big files (over 64kbytes with 4k pages system) get a wrong MD5 sum while small files get the correct sum. This program uses sendfile() to send a file to an AF_ALG socket for hashing. /* md5sum2.c */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <sys/socket.h> #include <sys/stat.h> #include <sys/types.h> #include <linux/if_alg.h> int main(int argc, char **argv) { int sk = socket(AF_ALG, SOCK_SEQPACKET, 0); struct stat st; struct sockaddr_alg sa = { .salg_family = AF_ALG, .salg_type = "hash", .salg_name = "md5", }; int n; bind(sk, (struct sockaddr*)&sa, sizeof(sa)); for (n = 1; n < argc; n++) { int size; int offset = 0; char buf[4096]; int fd; int sko; int i; fd = open(argv[n], O_RDONLY); sko = accept(sk, NULL, 0); fstat(fd, &st); size = st.st_size; sendfile(sko, fd, &offset, size); size = read(sko, buf, sizeof(buf)); for (i = 0; i < size; i++) printf("%2.2x", buf[i]); printf(" %s\n", argv[n]); close(fd); close(sko); } exit(0); } Test below is done using official linux patch files. First result is with a software based md5sum. Second result is with the program above. root@vgoip:~# ls -l patch-3.6.* -rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz -rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz 5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz After investivation, it appears that sendfile() sends the files by blocks of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each block, the SPLICE_F_MORE flag is missing, therefore the hashing operation is reset as if it was the end of the file. This patch adds SPLICE_F_MORE to the flags when more data is pending. With the patch applied, we get the correct sums: root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Jens Axboe <axboe@fb.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-03-02proc: Fix ptrace-based permission checks for accessing task mapsCorey Wright2-3/+3
Modify mm_access() calls in fs/proc/task_mmu.c and fs/proc/task_nommu.c to have the mode include PTRACE_MODE_FSCREDS so accessing /proc/pid/maps and /proc/pid/pagemap is not denied to all users. In backporting upstream commit caaee623 to pre-3.18 kernel versions it was overlooked that mm_access() is used in fs/proc/task_*mmu.c as those calls were removed in 3.18 (by upstream commit 29a40ace) and did not exist at the time of the original commit. Fixes: caaee6234d ("ptrace: use fsuid, fsgid, effective creds for fs access checks") Signed-off-by: Corey Wright <undefined@pobox.com> Acked-by: Jann Horn <jann@thejh.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-25xfs: inode recovery readahead can race with inode buffer creationDave Chinner2-5/+14
commit b79f4a1c68bb99152d0785ee4ea3ab4396cdacc6 upstream. When we do inode readahead in log recovery, we do can do the readahead before we've replayed the icreate transaction that stamps the buffer with inode cores. The inode readahead verifier catches this and marks the buffer as !done to indicate that it doesn't yet contain valid inodes. In adding buffer error notification (i.e. setting b_error = -EIO at the same time as as we clear the done flag) to such a readahead verifier failure, we can then get subsequent inode recovery failing with this error: XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32 This occurs when readahead completion races with icreate item replay such as: inode readahead find buffer lock buffer submit RA io .... icreate recovery xfs_trans_get_buffer find buffer lock buffer <blocks on RA completion> ..... <ra completion> fails verifier clear XBF_DONE set bp->b_error = -EIO release and unlock buffer <icreate gains lock> icreate initialises buffer marks buffer as done adds buffer to delayed write queue releases buffer At this point, we have an initialised inode buffer that is up to date but has an -EIO state registered against it. When we finally get to recovering an inode in that buffer: inode item recovery xfs_trans_read_buffer find buffer lock buffer sees XBF_DONE is set, returns buffer sees bp->b_error is set fail log recovery! Essentially, we need xfs_trans_get_buf_map() to clear the error status of the buffer when doing a lookup. This function returns uninitialised buffers, so the buffer returned can not be in an error state and none of the code that uses this function expects b_error to be set on return. Indeed, there is an ASSERT(!bp->b_error); in the transaction case in xfs_trans_get_buf_map() that would have caught this if log recovery used transactions.... This patch firstly changes the inode readahead failure to set -EIO on the buffer, and secondly changes xfs_buf_get_map() to never return a buffer with an error state set so this first change doesn't cause unexpected log recovery failures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-25libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correctDarrick J. Wong1-1/+1
commit 96f859d52bcb1c6ea6f3388d39862bf7143e2f30 upstream. Because struct xfs_agfl is 36 bytes long and has a 64-bit integer inside it, gcc will quietly round the structure size up to the nearest 64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE macro returning incorrect results for v5 filesystems on 64-bit machines (118 items instead of 119). As a result, a 32-bit xfs_repair will see garbage in AGFL item 119 and complain. Therefore, tell gcc not to pad the structure so that the AGFL size calculation is correct. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24fuse: break infinite loop in fuse_fill_write_pages()Roman Gushchin1-1/+1
commit 3ca8138f014a913f98e6ef40e939868e1e9ea876 upstream. I got a report about unkillable task eating CPU. Further investigation shows, that the problem is in the fuse_fill_write_pages() function. If iov's first segment has zero length, we get an infinite loop, because we never reach iov_iter_advance() call. Fix this by calling iov_iter_advance() before repeating an attempt to copy data from userspace. A similar problem is described in 124d3b7041f ("fix writev regression: pan hanging unkillable and un-straceable"). If zero-length segmend is followed by segment with invalid address, iov_iter_fault_in_readable() checks only first segment (zero-length), iov_iter_copy_from_user_atomic() skips it, fails at second and returns zero -> goto again without skipping zero-length segment. Patch calls iov_iter_advance() before goto again: we'll skip zero-length segment at second iteraction and iov_iter_fault_in_readable() will detect invalid address. Special thanks to Konstantin Khlebnikov, who helped a lot with the commit description. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Maxim Patlasov <mpatlasov@parallels.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: ea9b9907b82a ("fuse: implement perform_write") Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24udf: Check output buffer length when converting name to CS0Andrew Gabbasov1-2/+13
commit bb00c898ad1ce40c4bb422a8207ae562e9aea7ae upstream. If a name contains at least some characters with Unicode values exceeding single byte, the CS0 output should have 2 bytes per character. And if other input characters have single byte Unicode values, then the single input byte is converted to 2 output bytes, and the length of output becomes larger than the length of input. And if the input name is long enough, the output length may exceed the allocated buffer length. All this means that conversion from UTF8 or NLS to CS0 requires checking of output length in order to stop when it exceeds the given output buffer size. [JK: Make code return -ENAMETOOLONG instead of silently truncating the name] Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24udf: Prevent buffer overrun with multi-byte charactersAndrew Gabbasov1-1/+5
commit ad402b265ecf6fa22d04043b41444cdfcdf4f52d upstream. udf_CS0toUTF8 function stops the conversion when the output buffer length reaches UDF_NAME_LEN-2, which is correct maximum name length, but, when checking, it leaves the space for a single byte only, while multi-bytes output characters can take more space, causing buffer overflow. Similar error exists in udf_CS0toNLS function, that restricts the output length to UDF_NAME_LEN, while actual maximum allowed length is UDF_NAME_LEN-2. In these cases the output can override not only the current buffer length field, causing corruption of the name buffer itself, but also following allocation structures, causing kernel crash. Adjust the output length checks in both functions to prevent buffer overruns in case of multi-bytes UTF8 or NLS characters. Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24udf: limit the maximum number of indirect extents in a rowVegard Nossum1-0/+15
commit b0918d9f476a8434b055e362b83fa4fd1d462c3f upstream. udf_next_aext() just follows extent pointers while extents are marked as indirect. This can loop forever for corrupted filesystem. Limit number the of indirect extents we are willing to follow in a row. [JK: Updated changelog, limit, style] Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Jan Kara <jack@suse.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24nfs: Fix race in __update_open_stateid()Andrew Elble1-1/+1
commit 361cad3c89070aeb37560860ea8bfc092d545adc upstream. We've seen this in a packet capture - I've intermixed what I think was going on. The fix here is to grab the so_lock sooner. 1964379 -> #1 open (for write) reply seqid=1 1964393 -> #2 open (for read) reply seqid=2 __nfs4_close(), state->n_wronly-- nfs4_state_set_mode_locked(), changes state->state = [R] state->flags is [RW] state->state is [R], state->n_wronly == 0, state->n_rdonly == 1 1964398 -> #3 open (for write) call -> because close is already running 1964399 -> downgrade (to read) call seqid=2 (close of #1) 1964402 -> #3 open (for write) reply seqid=3 __update_open_stateid() nfs_set_open_stateid_locked(), changes state->flags state->flags is [RW] state->state is [R], state->n_wronly == 0, state->n_rdonly == 1 new sequence number is exposed now via nfs4_stateid_copy() next step would be update_open_stateflags(), pending so_lock 1964403 -> downgrade reply seqid=2, fails with OLD_STATEID (close of #1) nfs4_close_prepare() gets so_lock and recalcs flags -> send close 1964405 -> downgrade (to read) call seqid=3 (close of #1 retry) __update_open_stateid() gets so_lock * update_open_stateflags() updates state->n_wronly. nfs4_state_set_mode_locked() updates state->state state->flags is [RW] state->state is [RW], state->n_wronly == 1, state->n_rdonly == 1 * should have suppressed the preceding nfs4_close_prepare() from sending open_downgrade 1964406 -> write call 1964408 -> downgrade (to read) reply seqid=4 (close of #1 retry) nfs_clear_open_stateid_locked() state->flags is [R] state->state is [RW], state->n_wronly == 1, state->n_rdonly == 1 1964409 -> write reply (fails, openmode) Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24cifs: fix erroneous return valueAnton Protopopov1-1/+1
commit 4b550af519854421dfec9f7732cdddeb057134b2 upstream. The setup_ntlmv2_rsp() function may return positive value ENOMEM instead of -ENOMEM in case of kmalloc failure. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24cifs_dbg() outputs an uninitialized buffer in cifs_readdir()Vasily Averin1-0/+1
commit 01b9b0b28626db4a47d7f48744d70abca9914ef1 upstream. In some cases tmp_bug can be not filled in cifs_filldir and stay uninitialized, therefore its printk with "%s" modifier can leak content of kernelspace memory. If old content of this buffer does not contain '\0' access bejond end of allocated object can crash the host. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Steve French <steve.french@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24ptrace: use fsuid, fsgid, effective creds for fs access checksJann Horn3-13/+13
commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream. By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctlFilipe Manana1-4/+6
commit 0c0fe3b0fa45082cd752553fdb3a4b42503a118e upstream. While doing some tests I ran into an hang on an extent buffer's rwlock that produced the following trace: [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166] [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165] [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800016] irq event stamp: 0 [39389.800016] hardirqs last enabled at (0): [< (null)>] (null) [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last disabled at (0): [< (null)>] (null) [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000 [39389.800016] RIP: 0010:[<ffffffff810902af>] [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158 [39389.800016] RSP: 0018:ffff8800a185fb80 EFLAGS: 00000202 [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101 [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001 [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000 [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98 [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40 [39389.800016] FS: 00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000 [39389.800016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800016] Stack: [39389.800016] ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0 [39389.800016] ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895 [39389.800016] ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c [39389.800016] Call Trace: [39389.800016] [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60 [39389.800016] [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41 [39389.800016] [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44 [39389.800016] [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs] [39389.800016] [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs] [39389.800016] [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs] [39389.800016] [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs] [39389.800016] [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs] [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15 [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800016] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800016] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800016] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800016] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8 [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800012] irq event stamp: 0 [39389.800012] hardirqs last enabled at (0): [< (null)>] (null) [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last disabled at (0): [< (null)>] (null) [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G L 4.4.0-rc6-btrfs-next-18+ #1 [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000 [39389.800012] RIP: 0010:[<ffffffff81091e8d>] [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72 [39389.800012] RSP: 0018:ffff880034a639f0 EFLAGS: 00000206 [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000 [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000 [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98 [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00 [39389.800012] FS: 00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000 [39389.800012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800012] Stack: [39389.800012] ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98 [39389.800012] ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00 [39389.800012] ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58 [39389.800012] Call Trace: [39389.800012] [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c [39389.800012] [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41 [39389.800012] [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs] [39389.800012] [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs] [39389.800012] [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs] [39389.800012] [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs] [39389.800012] [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28 [39389.800012] [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs] [39389.800012] [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf [39389.800012] [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc [39389.800012] [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs] [39389.800012] [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs] [39389.800012] [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44 [39389.800012] [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs] [39389.800012] [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs] [39389.800012] [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs] [39389.800012] [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs] [39389.800012] [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs] [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800012] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800012] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800012] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800012] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00 This happens because in the code path executed by the inode_paths ioctl we end up nesting two calls to read lock a leaf's rwlock when after the first call to read_lock() and before the second call to read_lock(), another task (running the delayed items as part of a transaction commit) has already called write_lock() against the leaf's rwlock. This situation is illustrated by the following diagram: Task A Task B btrfs_ref_to_path() btrfs_commit_transaction() read_lock(&eb->lock); btrfs_run_delayed_items() __btrfs_commit_inode_delayed_items() __btrfs_update_delayed_inode() btrfs_lookup_inode() write_lock(&eb->lock); --> task waits for lock read_lock(&eb->lock); --> makes this task hang forever (and task B too of course) So fix this by avoiding doing the nested read lock, which is easily avoidable. This issue does not happen if task B calls write_lock() after task A does the second call to read_lock(), however there does not seem to exist anything in the documentation that mentions what is the expected behaviour for recursive locking of rwlocks (leaving the idea that doing so is not a good usage of rwlocks). Also, as a side effect necessary for this fix, make sure we do not needlessly read lock extent buffers when the input path has skip_locking set (used when called from send). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24btrfs: properly set the termination value of ctx->pos in readdirDavid Sterba3-3/+16
commit bc4ef7592f657ae81b017207a1098817126ad4cb upstream. The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24ext4: fix potential integer overflowInsu Yun1-1/+1
commit 46901760b46064964b41015d00c140c83aa05bcf upstream. Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data), integer overflow could be happened. Therefore, need to fix integer overflow sanitization. Signed-off-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24AIO: properly check iovec sizesGreg Kroah-Hartman1-2/+7
In Linus's tree, the iovec code has been reworked massively, but in older kernels the AIO layer should be checking this before passing the request on to other layers. Many thanks to Ben Hawkes of Google Project Zero for pointing out the issue. Reported-by: Ben Hawkes <hawkes@google.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24pty: make sure super_block is still valid in final /dev/tty closeHerton R. Krzesinski1-0/+20
commit 1f55c718c290616889c04946864a13ef30f64929 upstream. Considering current pty code and multiple devpts instances, it's possible to umount a devpts file system while a program still has /dev/tty opened pointing to a previosuly closed pty pair in that instance. In the case all ptmx and pts/N files are closed, umount can be done. If the program closes /dev/tty after umount is done, devpts_kill_index will use now an invalid super_block, which was already destroyed in the umount operation after running ->kill_sb. This is another "use after free" type of issue, but now related to the allocated super_block instance. To avoid the problem (warning at ida_remove and potential crashes) for this specific case, I added two functions in devpts which grabs additional references to the super_block, which pty code now uses so it makes sure the super block structure is still valid until pty shutdown is done. I also moved the additional inode references to the same functions, which also covered similar case with inode being freed before /dev/tty final close/shutdown. Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Reviewed-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24binfmt_elf: Don't clobber passed executable's file headerMaciej W. Rozycki1-5/+5
commit b582ef5c53040c5feef4c96a8f9585b6831e2441 upstream. Do not clobber the buffer space passed from `search_binary_handler' and originally preloaded by `prepare_binprm' with the executable's file header by overwriting it with its interpreter's file header. Instead keep the buffer space intact and directly use the data structure locally allocated for the interpreter's file header, fixing a bug introduced in 2.1.14 with loadable module support (linux-mips.org commit beb11695 [Import of Linux/MIPS 2.1.14], predating kernel.org repo's history). Adjust the amount of data read from the interpreter's file accordingly. This was not an issue before loadable module support, because back then `load_elf_binary' was executed only once for a given ELF executable, whether the function succeeded or failed. With loadable module support supported and enabled, upon a failure of `load_elf_binary' -- which may for example be caused by architecture code rejecting an executable due to a missing hardware feature requested in the file header -- a module load is attempted and then the function reexecuted by `search_binary_handler'. With the executable's file header replaced with its interpreter's file header the executable can then be erroneously accepted in this subsequent attempt. Signed-off-by: Maciej W. Rozycki <macro@imgtec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24FS-Cache: Don't override netfs's primary_index if registering failedKinglong Mee1-17/+16
commit b130ed5998e62879a66bad08931a2b5e832da95c upstream. Only override netfs->primary_index when registering success. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-24FS-Cache: Increase reference of parent after registering, netfs successKinglong Mee1-5/+4
commit 86108c2e34a26e4bec3c6ddb23390bf8cedcf391 upstream. If netfs exist, fscache should not increase the reference of parent's usage and n_children, otherwise, never be decreased. v2: thanks David's suggest, move increasing reference of parent if success use kmem_cache_free() freeing primary_index directly v3: don't move "netfs->primary_index->parent = &fscache_fsdef_index;" Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-23ext4: Fix handling of extended tv_secDavid Turner1-7/+44
commit a4dad1ae24f850410c4e60f22823cba1289b8d52 upstream. In ext4, the bottom two bits of {a,c,m}time_extra are used to extend the {a,c,m}time fields, deferring the year 2038 problem to the year 2446. When decoding these extended fields, for times whose bottom 32 bits would represent a negative number, sign extension causes the 64-bit extended timestamp to be negative as well, which is not what's intended. This patch corrects that issue, so that the only negative {a,c,m}times are those between 1901 and 1970 (as per 32-bit signed timestamps). Some older kernels might have written pre-1970 dates with 1,1 in the extra bits. This patch treats those incorrectly-encoded dates as pre-1970, instead of post-2311, until kernel 4.20 is released. Hopefully by then e2fsck will have fixed up the bad data. Also add a comment explaining the encoding of ext4's extra {a,c,m}time bits. Signed-off-by: David Turner <novalis@novalis.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Mark Harris <mh8928@yahoo.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=23732 Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15fix sysvfs symlinksAl Viro1-8/+2
commit 0ebf7f10d67a70e120f365018f1c5fce9ddc567d upstream. The thing got broken back in 2002 - sysvfs does *not* have inline symlinks; even short ones have bodies stored in the first block of file. sysv_symlink() handles that correctly; unfortunately, attempting to look an existing symlink up will end up confusing them for inline symlinks, and interpret the block number containing the body as the body itself. Nobody has noticed until now, which says something about the level of testing sysvfs gets ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15fix calculation of meta_bg descriptor backupsAndy Leiserson1-2/+2
commit 904dad4742d211b7a8910e92695c0fa957483836 upstream. "group" is the group where the backup will be placed, and is initialized to zero in the declaration. This meant that backups for meta_bg descriptors were erroneously written to the backup block group descriptors in groups 1 and (desc_per_block-1). Reproduction information: mke2fs -Fq -t ext4 -b 1024 -O ^resize_inode /tmp/foo.img 16G truncate -s 24G /tmp/foo.img losetup /dev/loop0 /tmp/foo.img mount /dev/loop0 /mnt resize2fs /dev/loop0 umount /dev/loop0 dd if=/dev/zero of=/dev/loop0 bs=1024 count=2 e2fsck -fy /dev/loop0 losetup -d /dev/loop0 Signed-off-by: Andy Leiserson <andy@leiserson.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15jbd2: Fix unreclaimed pages after truncate in data=journal modeJan Kara1-0/+2
commit bc23f0c8d7ccd8d924c4e70ce311288cb3e61ea8 upstream. Ted and Namjae have reported that truncated pages don't get timely reclaimed after being truncated in data=journal mode. The following test triggers the issue easily: for (i = 0; i < 1000; i++) { pwrite(fd, buf, 1024*1024, 0); fsync(fd); fsync(fd); ftruncate(fd, 0); } The reason is that journal_unmap_buffer() finds that truncated buffers are not journalled (jh->b_transaction == NULL), they are part of checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have been already written out (!buffer_dirty(bh)). We clean such buffers but we leave them in the checkpoint list. Since checkpoint transaction holds a reference to the journal head, these buffers cannot be released until the checkpoint transaction is cleaned up. And at that point we don't call release_buffer_page() anymore so pages detached from mapping are lingering in the system waiting for reclaim to find them and free them. Fix the problem by removing buffers from transaction checkpoint lists when journal_unmap_buffer() finds out they don't have to be there anymore. Reported-and-tested-by: Namjae Jeon <namjae.jeon@samsung.com> Fixes: de1b794130b130e77ffa975bb58cb843744f9ae5 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanupxuejiufei1-0/+2
commit c95a51807b730e4681e2ecbdfd669ca52601959e upstream. When recovery master down, dlm_do_local_recovery_cleanup() only remove the $RECOVERY lock owned by dead node, but do not clear the refmap bit. Which will make umount thread falling in dead loop migrating $RECOVERY to the dead node. Signed-off-by: xuejiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15ocfs2/dlm: ignore cleaning the migration mle that is inusexuejiufei1-11/+15
commit bef5502de074b6f6fa647b94b73155d675694420 upstream. We have found that migration source will trigger a BUG that the refcount of mle is already zero before put when the target is down during migration. The situation is as follows: dlm_migrate_lockres dlm_add_migration_mle dlm_mark_lockres_migrating dlm_get_mle_inuse <<<<<< Now the refcount of the mle is 2. dlm_send_one_lockres and wait for the target to become the new master. <<<<<< o2hb detect the target down and clean the migration mle. Now the refcount is 1. dlm_migrate_lockres woken, and put the mle twice when found the target goes down which trigger the BUG with the following message: "ERROR: bad mle: ". Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15fat: fix fake_offset handling on error pathOGAWA Hirofumi1-5/+11
commit 928a477102c4fc6739883415b66987207e3502f4 upstream. For the root directory, . and .. are faked (using dir_emit_dots()) and ctx->pos is reset from 2 to 0. A corrupted root directory could cause fat_get_entry() to fail, but ->iterate() (fat_readdir()) reports progress to the VFS (with ctx->pos rewound to 0), so any following calls to ->iterate() continue to return the same entries again and again. The result is that userspace will never see the end of the directory, causing e.g. 'ls' to hang in a getdents() loop. [hirofumi@mail.parknet.co.jp: cleanup and make sure to correct fake_offset] Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Richard Weinberger <richard.weinberger@gmail.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15proc: actually make proc_fd_permission() thread-friendlyOleg Nesterov1-3/+11
commit 54708d2858e79a2bdda10bf8a20c80eb96c20613 upstream. The commit 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly") fixed the access to /proc/self/fd from sub-threads, but introduced another problem: a sub-thread can't access /proc/<tid>/fd/ or /proc/thread-self/fd if generic_permission() fails. Change proc_fd_permission() to check same_thread_group(pid_task(), current). Fixes: 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly") Reported-by: "Jin, Yihua" <yihua.jin@intel.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15Revert "ocfs2: fix umask ignored issue"Jiri Slaby1-2/+0
This reverts commit e1f20b83cc2d70aafb060e302564f968d2da9d43, upstream commit 8f1eb48758aacf6c1ffce18179295adbf3bd7640. This commit fixes 702e5bc ("ocfs2: use generic posix ACL infrastructure"), which is only in 3.14. So this commit should have never been applied to 3.12 and it can cause sgid inheritance issues. Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-15pipe: Fix buffer offset after partially failed readBen Hutchings1-1/+4
Quoting the RHEL advisory: > It was found that the fix for CVE-2015-1805 incorrectly kept buffer > offset and buffer length in sync on a failed atomic read, potentially > resulting in a pipe buffer state corruption. A local, unprivileged user > could use this flaw to crash the system or leak kernel memory to user > space. (CVE-2016-0774, Moderate) The same flawed fix was applied to stable branches from 2.6.32.y to 3.14.y inclusive, and I was able to reproduce the issue on 3.2.y. We need to give pipe_iov_copy_to_user() a separate offset variable and only update the buffer offset if it succeeds. References: https://rhn.redhat.com/errata/RHSA-2016-0103.html Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-02-12dcache: use IS_ROOT to decide where dentry is hashedJ. Bruce Fields1-1/+6
commit 7632e465feb182cadc3c9aa1282a057201818a8c upstream. Every hashed dentry is either hashed in the dentry_hashtable, or a superblock's s_anon list. __d_drop() assumes it can determine which is the case by checking DCACHE_DISCONNECTED; this is not true. It is true that when DCACHE_DISCONNECTED is cleared, the dentry is not only hashed on dentry_hashtable, but is fully connected to its parents back to the root. But the converse is *not* true: fs/exportfs/expfs.c:reconnect_path() attempts to connect a directory (found by filehandle lookup) back to root by ascending to parents and performing lookups one at a time. It does not clear DCACHE_DISCONNECTED until it's done, and that is not at all an atomic process. In particular, it is possible for DCACHE_DISCONNECTED to be set on a dentry which is hashed on the dentry_hashtable. Instead, use IS_ROOT() to check which hash chain a dentry is on. This *does* work: Dentries are hashed only by: - d_obtain_alias, which adds an IS_ROOT() dentry to sb_anon. - __d_rehash, called by _d_rehash: hashes to the dentry's parent, and all callers of _d_rehash appear to have d_parent set to a "real" parent. - __d_rehash, called by __d_move: rehashes the moved dentry to hash chain determined by target, and assigns target's d_parent to its d_parent, before dropping the dentry's d_lock. Therefore I believe it's safe for a holder of a dentry's d_lock to assume that it is hashed on sb_anon if and only if IS_ROOT(dentry) is true. I believe the incorrect assumption about DCACHE_DISCONNECTED was originally introduced by ceb5bdc2d246 "fs: dcache per-bucket dcache hash locking". Also add a comment while we're here. Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-01-09dlm: make posix locks interruptibleEric Ren1-1/+1
commit a6b1533e9a57d76cd3d9b7649d29ac604b1874b8 upstream. Replace wait_event_killable with wait_event_interruptible so that a program waiting for a posix lock can be interrupted by a signal. With the killable version, a program was not interruptible by a signal if it had a signal handler set for it, overriding the default action of terminating the process. Signed-off-by: Eric Ren <zren@suse.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-01-05ocfs2: fix umask ignored issueJunxiao Bi1-0/+2
commit 8f1eb48758aacf6c1ffce18179295adbf3bd7640 upstream. New created file's mode is not masked with umask, and this makes umask not work for ocfs2 volume. Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-01-05nfs: if we have no valid attrs, then don't declare the attribute cache validJeff Layton1-1/+5
commit c812012f9ca7cf89c9e1a1cd512e6c3b5be04b85 upstream. If we pass in an empty nfs_fattr struct to nfs_update_inode, it will (correctly) not update any of the attributes, but it then clears the NFS_INO_INVALID_ATTR flag, which indicates that the attributes are up to date. Don't clear the flag if the fattr struct has no valid attrs to apply. Reviewed-by: Steve French <steve.french@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-01-05nfs4: start callback_ident at idr 1Benjamin Coddington1-1/+1
commit c68a027c05709330fe5b2f50c50d5fa02124b5d8 upstream. If clp->cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing it when the nfs_client is freed. A decoding or server bug can then find and try to put that first nfs_client which would lead to a crash. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: d6870312659d ("nfs4client: convert to idr_alloc()") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-01-05ext4, jbd2: ensure entering into panic after recording an error in superblockDaeho Jeong2-3/+15
commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream. If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the journaling will be aborted first and the error number will be recorded into JBD2 superblock and, finally, the system will enter into the panic state in "errors=panic" option. But, in the rare case, this sequence is little twisted like the below figure and it will happen that the system enters into panic state, which means the system reset in mobile environment, before completion of recording an error in the journal superblock. In this case, e2fsck cannot recognize that the filesystem failure occurred in the previous run and the corruption wouldn't be fixed. Task A Task B ext4_handle_error() -> jbd2_journal_abort() -> __journal_abort_soft() -> __jbd2_journal_abort_hard() | -> journal->j_flags |= JBD2_ABORT; | | __ext4_abort() | -> jbd2_journal_abort() | | -> __journal_abort_soft() | | -> if (journal->j_flags & JBD2_ABORT) | | return; | -> panic() | -> jbd2_journal_update_sb_errno() Tested-by: Hobin Woo <hobin.woo@samsung.com> Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-01-05ext4: fix potential use after free in __ext4_journal_stopLukas Czerner1-3/+3
commit 6934da9238da947628be83635e365df41064b09b upstream. There is a use-after-free possibility in __ext4_journal_stop() in the case that we free the handle in the first jbd2_journal_stop() because we're referencing handle->h_err afterwards. This was introduced in 9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by storing the handle->h_err value beforehand and avoid referencing potentially freed handle. Fixes: 9705acd63b125dee8b15c705216d7186daea4625 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2016-01-05Btrfs: fix race leading to BUG_ON when running delalloc for nodatacowFilipe Manana1-2/+8
commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d upstream. If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG || extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>