Age | Commit message (Collapse) | Author | Files | Lines |
|
When a device gets removed, block device inode unhashed so that it is not
used anymore (bdget() will not find it anymore). Later when a new device
gets created with the same device number, we create new block device
inode. However there may be file system device inodes whose i_bdev still
points to the original block device inode and thus we get two active
block device inodes for the same device. They will share the same
gendisk so the only visible differences will be that page caches will
not be coherent and BDIs will be different (the old block device inode
still points to unregistered BDI).
Fix the problem by checking in bd_acquire() whether i_bdev still points
to active block device inode and re-lookup the block device if not. That
way any open of a block device happening after the old device has been
removed will get correct block device inode.
Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Pull block layer updates from Jens Axboe:
- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.
- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.
- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.
- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.
- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.
- Support for ranged discard requests and discard merging, also from
Christoph.
- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.
- Error code fixup for gdrom from Christophe.
- cciss pci irq allocation cleanup from Christoph.
- Making the cdrom device operations read only, from Kees Cook.
- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.
- Set of fixes and updates for lightnvm, from Matias and Javier.
- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.
- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.
- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.
- SG gap merging fix from Ming Lei for block.
- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.
- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.
- Double free fix from cgroup writeback, from Tejun.
- Another double free fix in blkcg, from Hou Tao.
- Partition overflow fix for EFI from Alden Tondettar.
* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Robert Peterson:
"We've got eight GFS2 patches for this merge window:
- Andy Price submitted a patch to make gfs2_write_full_page a static
function.
- Dan Carpenter submitted a patch to fix a ERR_PTR thinko.
Three patches fix bugs related to deleting very large files, which
cause GFS2 to run out of journal space:
- The first one prevents GFS2 delete operation from requesting too
much journal space.
- The second one fixes a problem whereby GFS2 can hang because it
wasn't taking journal space demand into its calculations.
- The third one wakes up IO waiters when a flush is done to restart
processes stuck waiting for journal space to become available.
The final three patches are a performance improvement related to
spin_lock contention between multiple writers:
- The "tr_touched" variable was switched to a flag to be more atomic
and eliminate the possibility of some races.
- Function meta_lo_add was moved inline with its only caller to make
the code more readable and efficient.
- Contention on the gfs2_log_lock spinlock was greatly reduced by
avoiding the lock altogether in cases where we don't really need
it: buffers that already appear in the appropriate metadata list
for the journal. Many thanks to Steve Whitehouse for the ideas and
principles behind these patches"
* tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Make gfs2_write_full_page static
GFS2: Reduce contention on gfs2_log_lock
GFS2: Inline function meta_lo_add
GFS2: Switch tr_touched to flag in transaction
GFS2: Wake up io waiters whenever a flush is done
GFS2: Made logd daemon take into account log demand
GFS2: Limit number of transaction blocks requested for truncates
GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes and cleanups from Jan Kara:
"Several small UDF fixes and cleanups and a small cleanup of fanotify
code"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: simplify the code of fanotify_merge
udf: simplify udf_ioctl()
udf: fix ioctl errors
udf: allow implicit blocksize specification during mount
udf: check partition reference in udf_read_inode()
udf: atomically read inode size
udf: merge module informations in super.c
udf: remove next_epos from udf_update_extent_cache()
udf: Factor out trimming of crtime
udf: remove empty condition
udf: remove unneeded line break
udf: merge bh free
udf: use pointer for kernel_long_ad argument
udf: use __packed instead of __attribute__ ((packed))
udf: Make stat on symlink report symlink length as st_size
fs/udf: make #ifdef UDF_PREALLOCATE unconditional
fs: udf: Replace CURRENT_TIME with current_time()
|
|
Pull CIFS/SMB3 updates from Steve French:
"Includes support for a critical SMB3 security feature: per-share
encryption from Pavel, and a cleanup from Jean Delvare.
Will have another cifs/smb3 merge next week"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Allow to switch on encryption with seal mount option
CIFS: Add capability to decrypt big read responses
CIFS: Decrypt and process small encrypted packets
CIFS: Add copy into pages callback for a read operation
CIFS: Add mid handle callback
CIFS: Add transform header handling callbacks
CIFS: Encrypt SMB3 requests before sending
CIFS: Enable encryption during session setup phase
CIFS: Add capability to transform requests before sending
CIFS: Separate RFC1001 length processing for SMB2 read
CIFS: Separate SMB2 sync header processing
CIFS: Send RFC1001 length in a separate iov
CIFS: Make send_cancel take rqst as argument
CIFS: Make SendReceive2() takes resp iov
CIFS: Separate SMB2 header structure
CIFS: Fix splice read for non-cached files
cifs: Add soft dependencies
cifs: Only select the required crypto modules
cifs: Simplify SMB2 and SMB311 dependencies
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"For this cycle we add support for the shutdown ioctl, which is
primarily used for testing, but which can be useful on production
systems when a scratch volume is being destroyed and the data on it
doesn't need to be saved.
This found (and we fixed) a number of bugs with ext4's recovery to
corrupted file system --- the bugs increased the amount of data that
could be potentially lost, and in the case of the inline data feature,
could cause the kernel to BUG.
Also included are a number of other bug fixes, including in ext4's
fscrypt, DAX, inline data support"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
ext4: fix fencepost in s_first_meta_bg validation
ext4: don't BUG when truncating encrypted inodes on the orphan list
ext4: do not use stripe_width if it is not set
ext4: fix stripe-unaligned allocations
dax: assert that i_rwsem is held exclusive for writes
ext4: fix DAX write locking
ext4: add EXT4_IOC_GOINGDOWN ioctl
ext4: add shutdown bit and check for it
ext4: rename s_resize_flags to s_ext4_flags
ext4: return EROFS if device is r/o and journal replay is needed
ext4: preserve the needs_recovery flag when the journal is aborted
jbd2: don't leak modified metadata buffers on an aborted journal
ext4: fix inline data error paths
ext4: move halfmd4 into hash.c directly
ext4: fix use-after-iput when fscrypt contexts are inconsistent
jbd2: fix use after free in kjournald2()
ext4: fix data corruption in data=journal mode
ext4: trim allocation requests to group size
ext4: replace BUG_ON with WARN_ON in mb_find_extent()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Various cleanups for the file system encryption feature"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: constify struct fscrypt_operations
fscrypt: properly declare on-stack completion
fscrypt: split supp and notsupp declarations into their own headers
fscrypt: remove redundant assignment of res
fscrypt: make fscrypt_operations.key_prefix a string
fscrypt: remove unused 'mode' member of fscrypt_ctx
ext4: don't allow encrypted operations without keys
fscrypt: make test_dummy_encryption require a keyring key
fscrypt: factor out bio specific functions
fscrypt: pass up error codes from ->get_context()
fscrypt: remove user-triggerable warning messages
fscrypt: use EEXIST when file already uses different policy
fscrypt: use ENOTDIR when setting encryption policy on nondirectory
fscrypt: use ENOKEY when file cannot be created w/o key
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
|
|
It's very likely the file system independent ioctl name will be
FS_IOC_SHUTDOWN, so let's use the same name for the ext4 ioctl name.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Nothing exciting, just the usual pile of fixes, updates and cleanups:
- A bunch of clocksource driver updates
- Removal of CONFIG_TIMER_STATS and the related /proc file
- More posix timer slim down work
- A scalability enhancement in the tick broadcast code
- Math cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
hrtimer: Catch invalid clockids again
math64, tile: Fix build failure
clocksource/drivers/arm_arch_timer:: Mark cyclecounter __ro_after_init
timerfd: Protect the might cancel mechanism proper
timer_list: Remove useless cast when printing
time: Remove CONFIG_TIMER_STATS
clocksource/drivers/arm_arch_timer: Work around Hisilicon erratum 161010101
clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure
clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter
clocksource/drivers/arm_arch_timer: Add dt binding for hisilicon-161010101 erratum
clocksource/drivers/ostm: Add renesas-ostm timer driver
clocksource/drivers/ostm: Document renesas-ostm timer DT bindings
clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock
clocksource/drivers/gemini: Add driver for the Cortina Gemini
clocksource: add DT bindings for Cortina Gemini
clockevents: Add a clkevt-of mechanism like clksrc-of
tick/broadcast: Reduce lock cacheline contention
timers: Omit POSIX timer stuff from task_struct when disabled
x86/timer: Make delay() work during early bootup
delay: Add explanation of udelay() inaccuracy
...
|
|
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the
unused part of the pipe ring buffer. Previously splice_to_pipe() left
the flags value alone, which could result in incorrect behavior.
Uninitialized flags appears to have been there from the introduction of
the splice syscall.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 2.6.17+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d82718e348fe ("fuse_dev_splice_read(): switch to add_to_pipe()")
Cc: <stable@vger.kernel.org> # 4.9+
|
|
There is a potential race between fuse_dev_do_write()
and request_wait_answer() contexts as shown below:
TASK 1:
__fuse_request_send():
|--spin_lock(&fiq->waitq.lock);
|--queue_request();
|--spin_unlock(&fiq->waitq.lock);
|--request_wait_answer():
|--if (test_bit(FR_SENT, &req->flags))
<gets pre-empted after it is validated true>
TASK 2:
fuse_dev_do_write():
|--clears bit FR_SENT,
|--request_end():
|--sets bit FR_FINISHED
|--spin_lock(&fiq->waitq.lock);
|--list_del_init(&req->intr_entry);
|--spin_unlock(&fiq->waitq.lock);
|--fuse_put_request();
|--queue_interrupt();
<request gets queued to interrupts list>
|--wake_up_locked(&fiq->waitq);
|--wait_event_freezable();
<as FR_FINISHED is set, it returns and then
the caller frees this request>
Now, the next fuse_dev_do_read(), see interrupts list is not empty
and then calls fuse_read_interrupt() which tries to access the request
which is already free'd and gets the below crash:
[11432.401266] Unable to handle kernel paging request at virtual address
6b6b6b6b6b6b6b6b
...
[11432.418518] Kernel BUG at ffffff80083720e0
[11432.456168] PC is at __list_del_entry+0x6c/0xc4
[11432.463573] LR is at fuse_dev_do_read+0x1ac/0x474
...
[11432.679999] [<ffffff80083720e0>] __list_del_entry+0x6c/0xc4
[11432.687794] [<ffffff80082c65e0>] fuse_dev_do_read+0x1ac/0x474
[11432.693180] [<ffffff80082c6b14>] fuse_dev_read+0x6c/0x78
[11432.699082] [<ffffff80081d5638>] __vfs_read+0xc0/0xe8
[11432.704459] [<ffffff80081d5efc>] vfs_read+0x90/0x108
[11432.709406] [<ffffff80081d67f0>] SyS_read+0x58/0x94
As FR_FINISHED bit is set before deleting the intr_entry with input
queue lock in request completion path, do the testing of this flag and
queueing atomically with the same lock in queue_interrupt().
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: fd22d62ed0c3 ("fuse: no fc->lock for iqueue parts")
Cc: <stable@vger.kernel.org> # 4.2+
|
|
It is OK for s_first_meta_bg to be equal to the number of block group
descriptor blocks. (It rarely happens, but it shouldn't cause any
problems.)
https://bugzilla.kernel.org/show_bug.cgi?id=194567
Fixes: 3a4b77cd47bb837b8557595ec7425f281f2ca1fe
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
Fix a BUG when the kernel tries to mount a file system constructed as
follows:
echo foo > foo.txt
mke2fs -Fq -t ext4 -O encrypt foo.img 100
debugfs -w foo.img << EOF
write foo.txt a
set_inode_field a i_flags 0x80800
set_super_value s_last_orphan 12
quit
EOF
root@kvm-xfstests:~# mount -o loop foo.img /mnt
[ 160.238770] ------------[ cut here ]------------
[ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
[ 160.240106] invalid opcode: 0000 [#1] SMP
[ 160.240106] Modules linked in:
[ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227
[ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[ 160.240106] task: f4518000 task.stack: f47b6000
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
[ 160.240106] EFLAGS: 00010246 CPU: 0
[ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
[ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
[ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
[ 160.240106] Call Trace:
[ 160.240106] ext4_truncate+0x1e9/0x3e5
[ 160.240106] ext4_fill_super+0x286f/0x2b1e
[ 160.240106] ? set_blocksize+0x2e/0x7e
[ 160.240106] mount_bdev+0x114/0x15f
[ 160.240106] ext4_mount+0x15/0x17
[ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d
[ 160.240106] mount_fs+0x58/0x115
[ 160.240106] vfs_kern_mount+0x4b/0xae
[ 160.240106] do_mount+0x671/0x8c3
[ 160.240106] ? _copy_from_user+0x70/0x83
[ 160.240106] ? strndup_user+0x31/0x46
[ 160.240106] SyS_mount+0x57/0x7b
[ 160.240106] do_int80_syscall_32+0x4f/0x61
[ 160.240106] entry_INT80_32+0x2f/0x2f
[ 160.240106] EIP: 0xb76b919e
[ 160.240106] EFLAGS: 00000246 CPU: 0
[ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
[ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
[ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
[ 160.317241] ---[ end trace d6a773a375c810a5 ]---
The problem is that when the kernel tries to truncate an inode in
ext4_truncate(), it tries to clear any on-disk data beyond i_size.
Without the encryption key, it can't do that, and so it triggers a
BUG.
E2fsck does *not* provide this service, and in practice most file
systems have their orphan list processed by e2fsck, so to avoid
crashing, this patch skips this step if we don't have access to the
encryption key (which is the case when processing the orphan list; in
all other cases, we will have the encryption key, or the kernel
wouldn't have allowed the file to be opened).
An open question is whether the fact that e2fsck isn't clearing the
bytes beyond i_size causing problems --- and if we've lived with it
not doing it for so long, can we drop this from the kernel replay of
the orphan list in all cases (not just when we don't have the key for
encrypted inodes).
Addresses-Google-Bug: #35209576
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"This has two last minute fixes. The highest priority here is a
regression fix for the decompression code, but we also fixed up a
problem with the 32-bit compat ioctls.
The decompression bug could hand back the wrong data on big reads when
zlib was used. I have a larger cleanup to make the math here less
error prone, but at this stage in the release Omar's patch is the best
choice"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix btrfs_decompress_buf2page()
btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls
|
|
If btrfs_decompress_buf2page() is handed a bio with its page in the
middle of the working buffer, then we adjust the offset into the working
buffer. After we copy into the bio, we advance the iterator by the
number of bytes we copied. Then, we have some logic to handle the case
of discontiguous pages and adjust the offset into the working buffer
again. However, if we didn't advance the bio to a new page, we may enter
this case in error, essentially repeating the adjustment that we already
made when we entered the function. The end result is bogus data in the
bio.
Previously, we only checked for this case when we advanced to a new
page, but the conversion to bio iterators changed that. This restores
the old, correct behavior.
A case I saw when testing with zlib was:
buf_start = 42769
total_out = 46865
working_bytes = total_out - buf_start = 4096
start_byte = 45056
The condition (total_out > start_byte && buf_start < start_byte) is
true, so we adjust the offset:
buf_offset = start_byte - buf_start = 2287
working_bytes -= buf_offset = 1809
current_buf_start = buf_start = 42769
Then, we copy
bytes = min(bvec.bv_len, PAGE_SIZE - buf_offset, working_bytes) = 1809
buf_offset += bytes = 4096
working_bytes -= bytes = 0
current_buf_start += bytes = 44578
After bio_advance(), we are still in the same page, so start_byte is the
same. Then, we check (total_out > start_byte && current_buf_start < start_byte),
which is true! So, we adjust the values again:
buf_offset = start_byte - buf_start = 2287
working_bytes = total_out - start_byte = 1809
current_buf_start = buf_start + buf_offset = 45056
But note that working_bytes was already zero before this, so we should
have stopped copying.
Fixes: 974b1adc3b10 ("btrfs: use bio iterators for the decompression handlers")
Reported-by: Pat Erley <pat-lkml@erley.org>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
|
|
Pull nfsd revert from Bruce Fields:
"This patch turned out to have a couple problems. The problems are
fixable, but at least one of the fixes is a little ugly. The original
bug has always been there, so we can wait another week or two to get
this right"
* tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux:
nfsd: Revert "nfsd: special case truncates some more"
|
|
The handling of the might_cancel queueing is not properly protected, so
parallel operations on the file descriptor can race with each other and
lead to list corruptions or use after free.
Protect the context for these operations with a seperate lock.
The wait queue lock cannot be reused for this because that would create a
lock inversion scenario vs. the cancel lock. Replacing might_cancel with an
atomic (atomic_t or atomic bit) does not help either because it still can
race vs. the actual list operation.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-fsdevel@vger.kernel.org"
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Avoid using stripe_width for sbi->s_stripe value if it is not actually
set. It prevents using the stride for sbi->s_stripe.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When a filesystem is created using:
mkfs.ext4 -b 4096 -E stride=512 <dev>
and we try to allocate 64MB extent, we will end up directly in
ext4_mb_complex_scan_group(). This is because the request is detected
as power-of-two allocation (so we start in ext4_mb_regular_allocator()
with ac_criteria == 0) however the check before
ext4_mb_simple_scan_group() refuses the direct buddy scan because the
allocation request is too large. Since cr == 0, the check whether we
should use ext4_mb_scan_aligned() fails as well and we fall back to
ext4_mb_complex_scan_group().
Fix the problem by checking for upper limit on power-of-two requests
directly when detecting them.
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch incorrectly attempted nested mnt_want_write, and incorrectly
disabled nfsd's owner override for truncate. We'll fix those problems
and make another attempt soon, for the moment I think the safest is to
revert.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
We'll OOPS in ramoops_get_next_prz() if the platform didn't ask for any
ftrace zones (i.e., cxt->fprzs will be NULL). Let's just skip this
entire FTRACE section if there's no 'fprzs'.
Regression seen on a coreboot/depthcharge-based Chromebook.
Fixes: 2fbea82bbb89 ("pstore: Merge per-CPU ftrace records into one")
Cc: Joel Fernandes <joelaf@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Make sure all callers follow the same locking protocol, given that DAX
transparantly replaced the normal buffered I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
application, so we'll have to provide the traditional synchronіzation
of overlapping writes as we do for buffered writes.
This was broken historically for DAX, but got fixed for ext2 and XFS
as part of the iomap conversion. Fix up ext4 as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Commit 4c63c2454ef incorrectly assumed that returning -ENOIOCTLCMD would
cause the native ioctl to be called. The ->compat_ioctl callback is
expected to handle all ioctls, not just compat variants. As a result,
when using 32-bit userspace on 64-bit kernels, everything except those
three ioctls would return -ENOTTY.
Fixes: 4c63c2454ef ("btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl")
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Richard Weinberger <richard@nod.at>
|
|
Commit 6326fec1122c ("mm: Use owner_priv bit for PageSwapCache, valid
when PageSwapBacked") aliased PG_swapcache to PG_owner_priv_1 (and
depending on PageSwapBacked being true).
As a result, the KPF_SWAPCACHE bit in '/proc/kpageflags' should now be
synthesized, instead of being shown on unrelated pages which just happen
to have PG_owner_priv_1 set.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When a completion is declared on-stack we have to use
COMPLETION_INITIALIZER_ONSTACK().
Fixes: 0b81d07790726 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously, each filesystem configured without encryption support would
define all the public fscrypt functions to their notsupp_* stubs. This
list of #defines had to be updated in every filesystem whenever a change
was made to the public fscrypt functions. To make things more
maintainable now that we have three filesystems using fscrypt, split the
old header fscrypto.h into several new headers. fscrypt_supp.h contains
the real declarations and is included by filesystems when configured
with encryption support, whereas fscrypt_notsupp.h contains the inline
stubs and is included by filesystems when configured without encryption
support. fscrypt_common.h contains common declarations needed by both.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
res is assigned to sizeof(ctx), however, this is unused and res
is updated later on without that assigned value to res ever being
used. Remove this redundant assignment.
Fixes CoverityScan CID#1395546 "Unused value"
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This ioctl is modeled after the xfs's XFS_IOC_GOINGDOWN ioctl. (In
fact, it uses the same code points.)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We are currently using one bit in s_resize_flags; rename it in order
to allow more of the bits in that unsigned long for other purposes.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If the file system requires journal recovery, and the device is
read-ony, return EROFS to the mount system call. This allows xfstests
generic/050 to pass.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
If the journal is aborted, the needs_recovery feature flag should not
be removed. Otherwise, it's the journal might not get replayed and
this could lead to more data getting lost.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
If the journal has been aborted, we shouldn't mark the underlying
buffer head as dirty, since that will cause the metadata block to get
modified. And if the journal has been aborted, we shouldn't allow
this since it will almost certainly lead to a corrupted file system.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
The write_end() function must always unlock the page and drop its ref
count, even on an error.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
__alloc_pages_nodemask+0x436/0x4d0
alloc_pages_current+0x97/0x1b0
__page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728
pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331
grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773
iomap_write_begin+0x50/0xd0 fs/iomap.c:118
iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190
? iomap_write_end+0x80/0x80 fs/iomap.c:150
iomap_apply+0xb3/0x130 fs/iomap.c:79
iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
? iomap_write_end+0x80/0x80
xfs_file_buffered_aio_write+0x132/0x390 [xfs]
? remove_wait_queue+0x59/0x60
xfs_file_write_iter+0x90/0x130 [xfs]
__vfs_write+0xe5/0x140
vfs_write+0xc7/0x1f0
? syscall_trace_enter+0x1d0/0x380
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x200
entry_SYSCALL64_slow_path+0x25/0x25
the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other
callers of iomap_apply loop to complete the full request. We need to
check for fatal signals and back off with a short write instead.
As the iomap_apply delegates all the work down to the actor we have to
hook into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.
Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
"out" label was only returning error code.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently, lsattr for instance in udf directory gives
"udf: Invalid argument While reading flags on ..."
This patch returns -ENOIOCTLCMD
when command is unknown to have more accurate message like this:
"Inappropriate ioctl for device While reading flags on ..."
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
It only gets called from aops.c and doesn't appear in any headers.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Pull nfsd fixes from Bruce Fields:
"Three more miscellaneous nfsd bugfixes"
* tag 'nfsd-4.10-2' of git://linux-nfs.org/~bfields/linux:
svcrpc: fix oops in absence of krb5 module
nfsd: special case truncates some more
NFSD: Fix a null reference case in find_or_create_lock_stateid()
|
|
We don't always have easy access to the dentry of a file or directory we
created in debugfs. Add a helper which allows us to get a dentry we
previously created.
The motivation for this change is a problem with blktrace and the blk-mq
debugfs entries introduced in 07e4fead45e6 ("blk-mq: create debugfs
directory tree"). Namely, in some cases, the directory that blktrace
needs to create may already exist, but in other cases, it may not. We
_could_ rely on a bunch of implied knowledge to decide whether to create
the directory or not, but it's much cleaner on our end to just look it
up.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The "half md4" transform should not be used by any new code. And
fortunately, it's only used now by ext4. Since ext4 supports several
hashing methods, at some point it might be desirable to move to
something like SipHash. As an intermediate step, remove half md4 from
cryptohash.h and lib, and make it just a local function in ext4's
hash.c. There's precedent for doing this; the other function ext can use
for its hashes -- TEA -- is also implemented in the same place. Also, by
being a local function, this might allow gcc to perform some additional
optimizations.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
blk_get_backing_dev_info() is now a simple dereference. Remove that
function and simplify some code around that.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:
[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
Signed-off-by: Jens Axboe <axboe@fb.com>
|