summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2017-09-05media: get rid of removed DMX_GET_CAPS and DMX_SET_SOURCE leftoversMauro Carvalho Chehab1-2/+0
Those two ioctls were never used within the Kernel. Still, there used to have compat32 code there (and an if #0 block at the core). Get rid of them. Fixes: 286fe1ca3fa1 ("media: dmx.h: get rid of DMX_GET_CAPS") Fixes: 13adefbe9e56 ("media: dmx.h: get rid of DMX_SET_SOURCE") Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05ovl: don't allow writing ioctl on lower layerMiklos Szeredi4-9/+70
Problem with ioctl() is that it's a file operation, yet often used as an inode operation (i.e. modify the inode despite the file being opened for read-only). mnt_want_write_file() is used by filesystems in such cases to get write access on an arbitrary open file. Since overlayfs lets filesystems do all file operations, including ioctl, this can lead to mnt_want_write_file() returning OK for a lower file and modification of that lower file. This patch prevents modification by checking if the file is from an overlayfs lower layer and returning EPERM in that case. Need to introduce a mnt_want_write_file_path() variant that still does the old thing for inode operations that can do the copy up + modification correctly in such cases (fchown, fsetxattr, fremovexattr). This does not address the correctness of such ioctls on overlayfs (the correct way would be to copy up and attempt to perform ioctl on upper file). In theory this could be a regression. We very much hope that nobody is relying on such a hack in any sane setup. While this patch meddles in VFS code, it has no effect on non-overlayfs filesystems. Reported-by: "zhangyi (F)" <yi.zhang@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-05ovl: fix relatime for directoriesMiklos Szeredi2-4/+20
Need to treat non-regular overlayfs files the same as regular files when checking for an atime update. Add a d_real() flag to make it return the upper dentry for all file types. Reported-by: "zhangyi (F)" <yi.zhang@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-05cifs: Check for timeout on Negotiate stageSamuel Cabrero3-1/+26
Some servers seem to accept connections while booting but never send the SMBNegotiate response neither close the connection, causing all processes accessing the share hang on uninterruptible sleep state. This happens when the cifs_demultiplex_thread detects the server is unresponsive so releases the socket and start trying to reconnect. At some point, the faulty server will accept the socket and the TCP status will be set to NeedNegotiate. The first issued command accessing the share will start the negotiation (pid 5828 below), but the response will never arrive so other commands will be blocked waiting on the mutex (pid 55352). This patch checks for unresponsive servers also on the negotiate stage releasing the socket and reconnecting if the response is not received and checking again the tcp state when the mutex is acquired. PID: 55352 TASK: ffff880fd6cc02c0 CPU: 0 COMMAND: "ls" #0 [ffff880fd9add9f0] schedule at ffffffff81467eb9 #1 [ffff880fd9addb38] __mutex_lock_slowpath at ffffffff81468fe0 #2 [ffff880fd9addba8] mutex_lock at ffffffff81468b1a #3 [ffff880fd9addbc0] cifs_reconnect_tcon at ffffffffa042f905 [cifs] #4 [ffff880fd9addc60] smb_init at ffffffffa042faeb [cifs] #5 [ffff880fd9addca0] CIFSSMBQPathInfo at ffffffffa04360b5 [cifs] .... Which is waiting a mutex owned by: PID: 5828 TASK: ffff880fcc55e400 CPU: 0 COMMAND: "xxxx" #0 [ffff880fbfdc19b8] schedule at ffffffff81467eb9 #1 [ffff880fbfdc1b00] wait_for_response at ffffffffa044f96d [cifs] #2 [ffff880fbfdc1b60] SendReceive at ffffffffa04505ce [cifs] #3 [ffff880fbfdc1bb0] CIFSSMBNegotiate at ffffffffa0438d79 [cifs] #4 [ffff880fbfdc1c50] cifs_negotiate_protocol at ffffffffa043b383 [cifs] #5 [ffff880fbfdc1c80] cifs_reconnect_tcon at ffffffffa042f911 [cifs] #6 [ffff880fbfdc1d20] smb_init at ffffffffa042faeb [cifs] #7 [ffff880fbfdc1d60] CIFSSMBQFSInfo at ffffffffa0434eb0 [cifs] .... Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurélien Aptel <aaptel@suse.de> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2017-09-05fs: unexport vfs_readv and vfs_writevChristoph Hellwig1-3/+1
We've got no modular users left, and any potential modular user is better of with iov_iter based variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: unexport vfs_read and vfs_writeChristoph Hellwig1-4/+0
No modular users left. Given that they take user pointers there is no good reason to export it to drivers to start with. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: unexport __vfs_read/__vfs_writeChristoph Hellwig1-2/+0
No modular users left, and any new ones should use kernel_read/write or iov_iter variants instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05btrfs: switch write_buf to kernel_writeChristoph Hellwig1-14/+4
Instead of playing with the addressing limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: make the buf argument to __kernel_write a void pointerChristoph Hellwig1-1/+1
This matches kernel_read and kernel_write and avoids any need for casts in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: fix kernel_write prototypeChristoph Hellwig2-4/+4
Make the position an in/out argument like all the other read/write helpers and and make the buf argument a void pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: fix kernel_read prototypeChristoph Hellwig9-45/+43
Use proper ssize_t and size_t types for the return value and count argument, move the offset last and make it an in/out argument like all other read/write helpers, and make the buf argument a void pointer to get rid of lots of casts in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: move kernel_read to fs/read_write.cChristoph Hellwig2-17/+16
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: move kernel_write to fs/read_write.cChristoph Hellwig2-17/+16
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05autofs4: switch autofs4_write to __kernel_writeChristoph Hellwig1-8/+1
Instead of playing games with the address limit.. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05block_dev: support RFW_NOWAIT on block device nodesChristoph Hellwig1-0/+5
All support is already there in the generic code, we just need to wire it up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-05fs: support RWF_NOWAIT for buffered readsChristoph Hellwig4-12/+17
This is based on the old idea and code from Milosz Tanski. With the aio nowait code it becomes mostly trivial now. Buffered writes continue to return -EOPNOTSUPP if RWF_NOWAIT is passed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04vfs: add flags to d_real()Miklos Szeredi2-4/+4
Add a separate flags argument (in addition to the open flags) to control the behavior of d_real(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-04cifs: Add support for writing attributes on SMB2+Ronnie Sahlberg7-4/+79
This adds support for writing extended attributes on SMB2+ shares. Attributes can be written using the setfattr command. RH-bz: 1110709 Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-04cifs: Add support for reading attributes on SMB2+Ronnie Sahlberg4-0/+169
SMB1 already has support to read attributes. This adds similar support to SMB2+. With this patch, tools such as 'getfattr' will now work with SMB2+ shares. RH-bz: 1110709 Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-04Merge branch 'locking-core-for-linus' of ↵Linus Torvalds2-16/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Add 'cross-release' support to lockdep, which allows APIs like completions, where it's not the 'owner' who releases the lock, to be tracked. It's all activated automatically under CONFIG_PROVE_LOCKING=y. - Clean up (restructure) the x86 atomics op implementation to be more readable, in preparation of KASAN annotations. (Dmitry Vyukov) - Fix static keys (Paolo Bonzini) - Add killable versions of down_read() et al (Kirill Tkhai) - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini) - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra) - Remove smp_mb__before_spinlock() and convert its usages, introduce smp_mb__after_spinlock() (Peter Zijlstra) * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits) locking/lockdep/selftests: Fix mixed read-write ABBA tests sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK() acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures smp: Avoid using two cache lines for struct call_single_data locking/lockdep: Untangle xhlock history save/restore from task independence locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being futex: Remove duplicated code and fix undefined behaviour Documentation/locking/atomic: Finish the document... locking/lockdep: Fix workqueue crossrelease annotation workqueue/lockdep: 'Fix' flush_work() annotation locking/lockdep/selftests: Add mixed read-write ABBA tests mm, locking/barriers: Clarify tlb_flush_pending() barriers locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive locking/lockdep: Explicitly initialize wq_barrier::done::map locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING locking/refcounts, x86/asm: Implement fast refcount overflow protection locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease ...
2017-09-04Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - fix affine wakeups (Peter Zijlstra) - improve CPU onlining (and general bootup) scalability on systems with ridiculous number (thousands) of CPUs (Peter Zijlstra) - sched/numa updates (Rik van Riel) - sched/deadline updates (Byungchul Park) - sched/cpufreq enhancements and related cleanups (Viresh Kumar) - sched/debug enhancements (Xie XiuQi) - various fixes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) sched/debug: Optimize sched_domain sysctl generation sched/topology: Avoid pointless rebuild sched/topology, cpuset: Avoid spurious/wrong domain rebuilds sched/topology: Improve comments sched/topology: Fix memory leak in __sdt_alloc() sched/completion: Document that reinit_completion() must be called after complete_all() sched/autogroup: Fix error reporting printk text in autogroup_create() sched/fair: Fix wake_affine() for !NUMA_BALANCING sched/debug: Intruduce task_state_to_char() helper function sched/debug: Show task state in /proc/sched_debug sched/debug: Use task_pid_nr_ns in /proc/$pid/sched sched/core: Remove unnecessary initialization init_idle_bootup_task() sched/deadline: Change return value of cpudl_find() sched/deadline: Make find_later_rq() choose a closer CPU in topology sched/numa: Scale scan period with tasks in group and shared/private sched/numa: Slow down scan rate if shared faults dominate sched/pelt: Fix false running accounting sched: Mark pick_next_task_dl() and build_sched_domain() as static sched/cpupri: Don't re-initialize 'struct cpupri' sched/deadline: Don't re-initialize 'struct cpudl' ...
2017-09-04ovl: cleanup d_real for negativeMiklos Szeredi1-3/+0
d_real() is never called with a negative dentry. So remove the d_is_negative() check (which would never trigger anyway, since d_is_reg() returns false for a negative dentry). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-04Merge branch 'linus' into locking/core, to fix up conflictsIngo Molnar13-98/+140
Conflicts: mm/page_alloc.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-04utimes: Make utimes y2038 safeDeepa Dinamani1-11/+12
struct timespec is not y2038 safe on 32 bit machines. Replace timespec with y2038 safe struct timespec64. Note that the patch only changes the internals without modifying the syscall interfaces. This will be part of a separate series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04Merge branch 'for-linus' of ↵Linus Torvalds2-26/+26
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc fixes from Al Viro: "Loose ends and regressions from the last merge window. Strictly speaking, only binfmt_flat thing is a build regression per se - the rest is 'only sparse cares about that' stuff" [ This came in before the 4.13 release and could have gone there, but it was late in the release and nothing seemed critical enough to care, so I'm pulling it in the 4.14 merge window instead - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: binfmt_flat: fix arch/m32r and arch/microblaze flat_put_addr_at_rp() compat_hdio_ioctl: Fix a declaration <linux/uaccess.h>: Fix copy_in_user() declaration annotate RWF_... flags teach SYSCALL_DEFINE/COMPAT_SYSCALL_DEFINE to handle __bitwise arguments
2017-09-03xfs: use kmem_free to free return value of kmem_zallocPan Bian1-1/+1
In function xfs_test_remount_options(), kfree() is used to free memory allocated by kmem_zalloc(). But it is better to use kmem_free(). Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-03xfs: open code end_buffer_async_write in xfs_finish_page_writebackChristoph Hellwig1-24/+47
Our loop in xfs_finish_page_writeback, which iterates over all buffer heads in a page and then calls end_buffer_async_write, which also iterates over all buffers in the page to check if any I/O is in flight is not only inefficient, but also potentially dangerous as end_buffer_async_write can cause the page and all buffers to be freed. Replace it with a single loop that does the work of end_buffer_async_write on a per-page basis. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-02xfs: don't set v3 xflags for v2 inodesChristoph Hellwig1-13/+25
Reject attempts to set XFLAGS that correspond to di_flags2 inode flags if the inode isn't a v3 inode, because di_flags2 only exists on v3. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-02xfs: fix compiler warningsDarrick J. Wong5-8/+11
Fix up all the compiler warnings that have crept in. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-09-02Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2-2/+27
Pull cifs version warning fix from Steve French: "As requested, additional kernel warning messages to clarify the default dialect changes" [ There is still some discussion about exactly which version should be the new default. Longer-term we have auto-negotiation coming, but that's not there yet.. - Linus ] * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: Fix warning messages when mounting to older servers
2017-09-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller18-153/+210
Three cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01xfs: simplify the rmap code in xfs_bmse_mergeDarrick J. Wong1-4/+3
In Christoph's patch to refactor xfs_bmse_merge, the updated rmap code does more work than it needs to (because map-extent auto-merges records). Remove the unnecessary unmap and save ourselves a deferred op. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-09-01xfs: remove unused flags arg from xfs_file_iomap_begin_delayEric Sandeen1-3/+1
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: fix incorrect log_flushed on fsyncAmir Goldstein1-7/+0
When calling into _xfs_log_force{,_lsn}() with a pointer to log_flushed variable, log_flushed will be set to 1 if: 1. xlog_sync() is called to flush the active log buffer AND/OR 2. xlog_wait() is called to wait on a syncing log buffers xfs_file_fsync() checks the value of log_flushed after _xfs_log_force_lsn() call to optimize away an explicit PREFLUSH request to the data block device after writing out all the file's pages to disk. This optimization is incorrect in the following sequence of events: Task A Task B ------------------------------------------------------- xfs_file_fsync() _xfs_log_force_lsn() xlog_sync() [submit PREFLUSH] xfs_file_fsync() file_write_and_wait_range() [submit WRITE X] [endio WRITE X] _xfs_log_force_lsn() xlog_wait() [endio PREFLUSH] The write X is not guarantied to be on persistent storage when PREFLUSH request in completed, because write A was submitted after the PREFLUSH request, but xfs_file_fsync() of task A will be notified of log_flushed=1 and will skip explicit flush. If the system crashes after fsync of task A, write X may not be present on disk after reboot. This bug was discovered and demonstrated using Josef Bacik's dm-log-writes target, which can be used to record block io operations and then replay a subset of these operations onto the target device. The test goes something like this: - Use fsx to execute ops of a file and record ops on log device - Every now and then fsync the file, store md5 of file and mark the location in the log - Then replay log onto device for each mark, mount fs and compare md5 of file to stored value Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <jbacik@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: disable per-inode DAX flagChristoph Hellwig1-1/+2
Currently flag switching can be used to easily crash the kernel. Disable the per-inode DAX flag until that is sorted out. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leavesChristoph Hellwig1-32/+12
Use the existing functionality instead of directly poking into the extent list. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extentChristoph Hellwig2-11/+11
This avoids poking into the internals of the extent list. Also return the number of extents as the return value instead of an additional by reference argument, and make it available to callers outside of xfs_bmap_util.c Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_atChristoph Hellwig1-16/+4
This abstracts the function away from details of the low-level extent list implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extentsChristoph Hellwig1-92/+88
This abstracts the function away from details of the low-level extent list implementation. Note that it seems like the previous implementation of rmap for the merge case was completely broken, but it no seems appear to trigger that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: move some code around inside xfs_bmap_shift_extentsChristoph Hellwig1-25/+29
For the first right move we need to look up next_fsb. That means our last fsb that contains next_fsb must also be the current extent, so take advantage of that by moving the code around a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()Oleg Nesterov1-16/+26
The race was introduced by me in commit 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead"). I did not realize that nothing can protect eventpoll after ep_poll_callback() sets ->whead = NULL, only whead->lock can save us from the race with ep_free() or ep_remove(). Move ->whead = NULL to the end of ep_poll_callback() and add the necessary barriers. TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even before this patch. Hopefully this explains use-after-free reported by syzcaller: BUG: KASAN: use-after-free in debug_spin_lock_before ... _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148 this is spin_lock(eventpoll->lock), ... Freed by task 17774: ... kfree+0xe8/0x2c0 mm/slub.c:3883 ep_free+0x22c/0x2a0 fs/eventpoll.c:865 Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead") Reported-by: 范龙飞 <long7573@126.com> Cc: stable@vger.kernel.org Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01Introduce v3 namespaced file capabilitiesSerge E. Hallyn1-0/+6
Root in a non-initial user ns cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host. However supporting file capabilities in a user namespace is very desirable. Not doing so means that any programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must know how to drop partial capabilities, and do so only if setuid-root. This patch introduces v3 of the security.capability xattr. It builds a vfs_ns_cap_data struct by appending a uid_t rootid to struct vfs_cap_data. This is the absolute uid_t (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns) of the root id in whose namespaces the file capabilities may take effect. When a task asks to write a v2 security.capability xattr, if it is privileged with respect to the userns which mounted the filesystem, then nothing should change. Otherwise, the kernel will transparently rewrite the xattr as a v3 with the appropriate rootid. This is done during the execution of setxattr() to catch user-space-initiated capability writes. Subsequently, any task executing the file which has the noted kuid as its root uid, or which is in a descendent user_ns of such a user_ns, will run the file with capabilities. Similarly when asking to read file capabilities, a v3 capability will be presented as v2 if it applies to the caller's namespace. If a task writes a v3 security.capability, then it can provide a uid for the xattr so long as the uid is valid in its own user namespace, and it is privileged with CAP_SETFCAP over its namespace. The kernel will translate that rootid to an absolute uid, and write that to disk. After this, a task in the writer's namespace will not be able to use those capabilities (unless rootid was 0), but a task in a namespace where the given uid is root will. Only a single security.capability xattr may exist at a time for a given file. A task may overwrite an existing xattr so long as it is privileged over the inode. Note this is a departure from previous semantics, which required privilege to remove a security.capability xattr. This check can be re-added if deemed useful. This allows a simple setxattr to work, allows tar/untar to work, and allows us to tar in one namespace and untar in another while preserving the capability, without risking leaking privilege into a parent namespace. Example using tar: $ cp /bin/sleep sleepx $ mkdir b1 b2 $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1 $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2 $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx b2/sleepx = cap_sys_admin+ep # /opt/ltp/testcases/bin/getv3xattr b2/sleepx v3 xattr, rootid is 100001 A patch to linux-test-project adding a new set of tests for this functionality is in the nsfscaps branch at github.com/hallyn/ltp Changelog: Nov 02 2016: fix invalid check at refuse_fcap_overwrite() Nov 07 2016: convert rootid from and to fs user_ns (From ebiederm: mar 28 2017) commoncap.c: fix typos - s/v4/v3 get_vfs_caps_from_disk: clarify the fs_ns root access check nsfscaps: change the code split for cap_inode_setxattr() Apr 09 2017: don't return v3 cap for caps owned by current root. return a v2 cap for a true v2 cap in non-init ns Apr 18 2017: . Change the flow of fscap writing to support s_user_ns writing. . Remove refuse_fcap_overwrite(). The value of the previous xattr doesn't matter. Apr 24 2017: . incorporate Eric's incremental diff . move cap_convert_nscap to setxattr and simplify its usage May 8, 2017: . fix leaking dentry refcount in cap_inode_getsecurity Signed-off-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-09-01Merge tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds2-18/+18
Pull ceph fix from Ilya Dryomov: "ceph fscache page locking fix from Zheng, marked for stable" * tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client: ceph: fix readpage from fscache
2017-09-01xfs: use xfs_iext_get_extent in xfs_bmap_first_unusedChristoph Hellwig1-5/+7
Use the bmap abstraction instead of open-coding bmbt details here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insertChristoph Hellwig1-4/+7
Use the helper instead of open coding it, to provide a better abstraction for the scalable extent list work. This also gets an additional assert and trace point for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: add a xfs_iext_update_extent helperChristoph Hellwig3-6/+20
This helper is used to update an extent record based on the extent index, and can be used to provide a level of abstractions between callers that want to modify in-core extent records and the details of the extent list implementation. Also switch all users of the xfs_bmbt_set_all(xfs_iext_get_ext(...)) pattern to this new helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: consolidate the various page fault handlersChristoph Hellwig2-65/+60
Add a new __xfs_filemap_fault helper that implements all four page fault callouts, and make these methods themselves small stubs that set the correct write_fault flag, and exit early for the non-DAX case for the hugepage related ones. Also remove the extra size checking in the pfn_fault path, which is now handled in the core DAX code. Life would be so much simpler if we only had one method for all this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01iomap: return VM_FAULT_* codes from iomap_page_mkwriteChristoph Hellwig2-3/+2
All callers will need the VM_FAULT_* flags, so convert in the helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: relog dirty buffers during swapext bmbt owner changeBrian Foster2-18/+65
The owner change bmbt scan that occurs during extent swap operations does not handle ordered buffer failures. Buffers that cannot be marked ordered must be physically logged so previously dirty ranges of the buffer can be relogged in the transaction. Since the bmbt scan may need to process and potentially log a large number of blocks, we can't expect to complete this operation in a single transaction. Update extent swap to use a permanent transaction with enough log reservation to physically log a buffer. Update the bmbt scan to physically log any buffers that cannot be ordered and to terminate the scan with -EAGAIN. On -EAGAIN, the caller rolls the transaction and restarts the scan. Finally, update the bmbt scan helper function to skip bmbt blocks that already match the expected owner so they are not reprocessed after scan restarts. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [darrick: fix the xfs_trans_roll call] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: disallow marking previously dirty buffers as orderedBrian Foster2-3/+6
Ordered buffers are used in situations where the buffer is not physically logged but must pass through the transaction/logging pipeline for a particular transaction. As a result, ordered buffers are not unpinned and written back until the transaction commits to the log. Ordered buffers have a strict requirement that the target buffer must not be currently dirty and resident in the log pipeline at the time it is marked ordered. If a dirty+ordered buffer is committed, the buffer is reinserted to the AIL but not physically relogged at the LSN of the associated checkpoint. The buffer log item is assigned the LSN of the latest checkpoint and the AIL effectively releases the previously logged buffer content from the active log before the buffer has been written back. If the tail pushes forward and a filesystem crash occurs while in this state, an inconsistent filesystem could result. It is currently the caller responsibility to ensure an ordered buffer is not already dirty from a previous modification. This is unclear and error prone when not used in situations where it is guaranteed a buffer has not been previously modified (such as new metadata allocations). To facilitate general purpose use of ordered buffers, update xfs_trans_ordered_buf() to conditionally order the buffer based on state of the log item and return the status of the result. If the bli is dirty, do not order the buffer and return false. The caller must either physically log the buffer (having acquired the appropriate log reservation) or push it from the AIL to clean it before it can be marked ordered in the current transaction. Note that ordered buffers are currently only used in two situations: 1.) inode chunk allocation where previously logged buffers are not possible and 2.) extent swap which will be updated to handle ordered buffer failures in a separate patch. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>