summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2016-04-30ext4: clean up error handling when orphan list is corruptedTheodore Ts'o1-27/+22
Instead of just printing warning messages, if the orphan list is corrupted, declare the file system is corrupted. If there are any reserved inodes in the orphaned inode list, declare the file system corrupted and stop right away to avoid doing more potential damage to the file system. Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-30ext4: fix hang when processing corrupted orphaned inode listTheodore Ts'o1-4/+6
If the orphaned inode list contains inode #5, ext4_iget() returns a bad inode (since the bootloader inode should never be referenced directly). Because of the bad inode, we end up processing the inode repeatedly and this hangs the machine. This can be reproduced via: mke2fs -t ext4 /tmp/foo.img 100 debugfs -w -R "ssv last_orphan 5" /tmp/foo.img mount -o loop /tmp/foo.img /mnt (But don't do this if you are using an unpatched kernel if you care about the system staying functional. :-) This bug was found by the port of American Fuzzy Lop into the kernel to find file system problems[1]. (Since it *only* happens if inode #5 shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not surprising that AFL needed two hours before it found it.) [1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf Cc: stable@vger.kernel.org Reported by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-27ext4: remove trailing \n from ext4_warning/ext4_error callsJakub Wilk8-10/+10
Messages passed to ext4_warning() or ext4_error() don't need trailing newlines, because these function add the newlines themselves. Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
2016-04-26ext4: fix races between changing inode journal mode and ext4_writepagesDaeho Jeong3-3/+20
In ext4, there is a race condition between changing inode journal mode and ext4_writepages(). While ext4_writepages() is executed on a non-journalled mode inode, the inode's journal mode could be enabled by ioctl() and then, some pages dirtied after switching the journal mode will be still exposed to ext4_writepages() in non-journaled mode. To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan Kara's suggestion because we don't want to waste ext4_inode_info's space for this extra rare case. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-26ext4: handle unwritten or delalloc buffers before enabling data journalingDaeho Jeong1-11/+20
We already allocate delalloc blocks before changing the inode mode into "per-file data journal" mode to prevent delalloc blocks from remaining not allocated, but another issue concerned with "BH_Unwritten" status still exists. For example, by fallocate(), several buffers' status change into "BH_Unwritten", but these buffers cannot be processed by ext4_alloc_da_blocks(). So, they still remain in unwritten status after per-file data journaling is enabled and they cannot be changed into written status any more and, if they are journaled and eventually checkpointed, these unwritten buffer will cause a kernel panic by the below BUG_ON() function of submit_bh_wbc() when they are submitted during checkpointing. static int submit_bh_wbc(int rw, struct buffer_head *bh,... { ... BUG_ON(buffer_unwritten(bh)); Moreover, when "dioread_nolock" option is enabled, the status of a buffer is changed into "BH_Unwritten" after write_begin() completes and the "BH_Unwritten" status will be cleared after I/O is done. Therefore, if a buffer's status is changed into unwrutten but the buffer's I/O is not submitted and completed, it can cause the same problem after enabling per-file data journaling. You can easily generate this bug by executing the following command. ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269 To resolve these problems and define a boundary between the previous mode and per-file data journaling mode, we need to flush and wait all the I/O of buffers of a file before enabling per-file data journaling of the file. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-26ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()Theodore Ts'o1-2/+7
The function jbd2_journal_extend() takes as its argument the number of new credits to be added to the handle. We weren't taking into account the currently unused handle credits; worse, we would try to extend the handle by N credits when it had N credits available. In the case where jbd2_journal_extend() fails because the transaction is too large, when jbd2_journal_restart() gets called, the N credits owned by the handle gets returned to the transaction, and the transaction commit is asynchronously requested, and then start_this_handle() will be able to successfully attach the handle to the current transaction since the required credits are now available. This is mostly harmless, but since ext4_ext_truncate_extend_restart() returns EAGAIN, the truncate machinery will once again try to call ext4_ext_truncate_extend_restart(), which will do the above sequence over and over again until the transaction has committed. This was found while I was debugging a lockup in caused by running xfstests generic/074 in the data=journal case. I'm still not sure why we ended up looping forever, which suggests there may still be another bug hiding in the transaction accounting machinery, but this commit prevents us from looping in the first place. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24ext4: do not ask jbd2 to write data for delalloc buffersJan Kara4-5/+22
Currently we ask jbd2 to write all dirty allocated buffers before committing a transaction when doing writeback of delay allocated blocks. However this is unnecessary since we move all pages to writeback state before dropping a transaction handle and then submit all the necessary IO. We still need the transaction commit to wait for all the outstanding writeback before flushing disk caches during transaction commit to avoid data exposure issues though. Use the new jbd2 capability and ask it to only wait for outstanding writeback during transaction commit when writing back data in ext4_writepages(). Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24jbd2: add support for avoiding data writes during transaction commitsJan Kara5-7/+27
Currently when filesystem needs to make sure data is on permanent storage before committing a transaction it adds inode to transaction's inode list. During transaction commit, jbd2 writes back all dirty buffers that have allocated underlying blocks and waits for the IO to finish. However when doing writeback for delayed allocated data, we allocate blocks and immediately submit the data. Thus asking jbd2 to write dirty pages just unnecessarily adds more work to jbd2 possibly writing back other redirtied blocks. Add support to jbd2 to allow filesystem to ask jbd2 to only wait for outstanding data writes before committing a transaction and thus avoid unnecessary writes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24ext4: remove EXT4_STATE_ORDERED_MODEJan Kara2-5/+1
This flag is just duplicating what ext4_should_order_data() tells you and is used in a single place. Furthermore it doesn't reflect changes to inode data journalling flag so it may be possibly misleading. Just remove it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24ext4: fix data exposure after a crashJan Kara1-9/+15
Huang has reported that in his powerfail testing he is seeing stale block contents in some of recently allocated blocks although he mounts ext4 in data=ordered mode. After some investigation I have found out that indeed when delayed allocation is used, we don't add inode to transaction's list of inodes needing flushing before commit. Originally we were doing that but commit f3b59291a69d removed the logic with a flawed argument that it is not needed. The problem is that although for delayed allocated blocks we write their contents immediately after allocating them, there is no guarantee that the IO scheduler or device doesn't reorder things and thus transaction allocating blocks and attaching them to inode can reach stable storage before actual block contents. Actually whenever we attach freshly allocated blocks to inode using a written extent, we should add inode to transaction's ordered inode list to make sure we properly wait for block contents to be written before committing the transaction. So that is what we do in this patch. This also handles other cases where stale data exposure was possible - like filling hole via mmap in data=ordered,nodelalloc mode. The only exception to the above rule are extending direct IO writes where blkdev_direct_IO() waits for IO to complete before increasing i_size and thus stale data exposure is not possible. For now we don't complicate the code with optimizing this special case since the overhead is pretty low. In case this is observed to be a performance problem we can always handle it using a special flag to ext4_map_blocks(). CC: stable@vger.kernel.org Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com> Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24ext4: allow readdir()'s of large empty directories to be interruptedTheodore Ts'o4-0/+18
If a directory has a large number of empty blocks, iterating over all of them can take a long time, leading to scheduler warnings and users getting irritated when they can't kill a process in the middle of one of these long-running readdir operations. Fix this by adding checks to ext4_readdir() and ext4_htree_fill_tree(). This was reverted earlier due to a typo in the original commit where I experimented with using signal_pending() instead of fatal_signal_pending(). The test was in the wrong place if we were going to return signal_pending() since we would end up returning duplicant entries. See 9f2394c9be47 for a more detailed explanation. Added fix as suggested by Linus to check for signal_pending() in in the filldir() functions. Reported-by: Benjamin LaHaise <bcrl@kvack.org> Google-Bug-Id: 27880676 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-17Merge tag 'driver-core-4.6-rc4' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull misc fixes from Greg KH: "Here are three small fixes for 4.6-rc4. Two fix up some lz4 issues with big endian systems, and the remaining one resolves a minor debugfs issue that was reported. All have been in linux-next with no reported issues" * tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: lib: lz4: cleanup unaligned access efficiency detection lib: lz4: fixed zram with lz4 on big endian machines debugfs: Make automount point inodes permanently empty
2016-04-15Merge tag 'for-linus-4.6-rc4' of ↵Linus Torvalds4-25/+56
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs/fscrypto fixes from Jaegeuk Kim: "In addition to f2fs/fscrypto fixes, I've added one patch which prevents RCU mode lookup in d_revalidate, as Al mentioned. These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4 crypto, which have not yet been fully propagated as follows. - use of dget_parent and file_dentry to avoid crashes - disallow RCU-mode lookup in d_invalidate - disallow -ENOMEM in the core data encryption path" * tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: ext4/fscrypto: avoid RCU lookup in d_revalidate fscrypto: don't let data integrity writebacks fail with ENOMEM f2fs: use dget_parent and file_dentry in f2fs_file_open fscrypto: use dget_parent() in fscrypt_d_revalidate()
2016-04-14Make file credentials available to the seqfile interfacesLinus Torvalds1-3/+4
A lot of seqfile users seem to be using things like %pK that uses the credentials of the current process, but that is actually completely wrong for filesystem interfaces. The unix semantics for permission checking files is to check permissions at _open_ time, not at read or write time, and that is not just a small detail: passing off stdin/stdout/stderr to a suid application and making the actual IO happen in privileged context is a classic exploit technique. So if we want to be able to look at permissions at read time, we need to use the file open credentials, not the current ones. Normal file accesses can just use "f_cred" (or any of the helper functions that do that, like file_ns_capable()), but the seqfile interfaces do not have any such options. It turns out that seq_file _does_ save away the user_ns information of the file, though. Since user_ns is just part of the full credential information, replace that special case with saving off the cred pointer instead, and suddenly seq_file has all the permission information it needs. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-13ext4/fscrypto: avoid RCU lookup in d_revalidateJaegeuk Kim2-0/+8
As Al pointed, d_revalidate should return RCU lookup before using d_inode. This was originally introduced by: commit 34286d666230 ("fs: rcu-walk aware d_revalidate method"). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable <stable@vger.kernel.org>
2016-04-13debugfs: Make automount point inodes permanently emptySeth Forshee1-1/+1
Starting with 4.1 the tracing subsystem has its own filesystem which is automounted in the tracing subdirectory of debugfs. Prior to this debugfs could be bind mounted in a cloned mount namespace, but if tracefs has been mounted under debugfs this now fails because there is a locked child mount. This creates a regression for container software which bind mounts debugfs to satisfy the assumption of some userspace software. In other pseudo filesystems such as proc and sysfs we're already creating mountpoints like this in such a way that no dirents can be created in the directories, allowing them to be exceptions to some MNT_LOCKED tests. In fact we're already do this for the tracefs mountpoint in sysfs. Do the same in debugfs_create_automount(), since the intention here is clearly to create a mountpoint. This fixes the regression, as locked child mounts on permanently empty directories do not cause a bind mount to fail. Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12fscrypto: don't let data integrity writebacks fail with ENOMEMJaegeuk Kim2-19/+33
This patch fixes the issue introduced by the ext4 crypto fix in a same manner. For F2FS, however, we flush the pending IOs and wait for a while to acquire free memory. Fixes: c9af28fdd4492 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-12f2fs: use dget_parent and file_dentry in f2fs_file_openJaegeuk Kim1-3/+7
This patch synced with the below two ext4 crypto fixes together. In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes overlayfs. To fix, now we need to use file_dentry() to access that field. Fixes: c0a37d487884 ("ext4: use file_dentry()") Fixes: 9dd78d8c9a7b ("ext4: use dget_parent() in ext4_file_open()") Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-12fscrypto: use dget_parent() in fscrypt_d_revalidate()Jaegeuk Kim1-3/+8
This patch updates fscrypto along with the below ext4 crypto change. Fixes: 3d43bcfef5f0 ("ext4 crypto: use dget_parent() in ext4_d_revalidate()") Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-11Revert "ext4: allow readdir()'s of large empty directories to be interrupted"Linus Torvalds2-10/+0
This reverts commit 1028b55bafb7611dda1d8fed2aeca16a436b7dff. It's broken: it makes ext4 return an error at an invalid point, causing the readdir wrappers to write the the position of the last successful directory entry into the position field, which means that the next readdir will now return that last successful entry _again_. You can only return fatal errors (that terminate the readdir directory walk) from within the filesystem readdir functions, the "normal" errors (that happen when the readdir buffer fills up, for example) happen in the iterorator where we know the position of the actual failing entry. I do have a very different patch that does the "signal_pending()" handling inside the iterator function where it is allowable, but while that one passes all the sanity checks, I screwed up something like four times while emailing it out, so I'm not going to commit it today. So my track record is not good enough, and the stars will have to align better before that one gets committed. And it would be good to get some review too, of course, since celestial alignments are always an iffy debugging model. IOW, let's just revert the commit that caused the problem for now. Reported-by: Greg Thelen <gthelen@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-09Merge branch 'for-linus-4.6' of ↵Linus Torvalds8-32/+215
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are bug fixes, including a really old fsync bug, and a few trace points to help us track down problems in the quota code" * 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix file/data loss caused by fsync after rename and new inode btrfs: Reset IO error counters before start of device replacing btrfs: Add qgroup tracing Btrfs: don't use src fd for printk btrfs: fallback to vmalloc in btrfs_compare_tree btrfs: handle non-fatal errors in btrfs_qgroup_inherit() btrfs: Output more info for enospc_debug mount option Btrfs: fix invalid reference in replace_path Btrfs: Improve FL_KEEP_SIZE handling in fallocate
2016-04-09Merge tag 'for-linus-4.6-ofs1' of ↵Linus Torvalds6-53/+28
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs fixes from Mike Marshall: "Orangefs cleanups and a strncpy vulnerability fix. Cleanups: - remove an unused variable from orangefs_readdir. - clean up printk wrapper used for ofs "gossip" debugging. - clean up truncate ctime and mtime setting in inode.c - remove a useless null check found by coccinelle. - optimize some memcpy/memset boilerplate code. - remove some useless sanity checks from xattr.c Fix: - fix a potential strncpy vulnerability" * tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: remove unused variable orangefs: Add KERN_<LEVEL> to gossip_<level> macros orangefs: strncpy -> strscpy orangefs: clean up truncate ctime and mtime setting Orangefs: fix ifnullfree.cocci warnings Orangefs: optimize boilerplate code. Orangefs: xattr.c cleanup
2016-04-08orangefs: remove unused variableMartin Brandenburg1-3/+1
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08orangefs: Add KERN_<LEVEL> to gossip_<level> macrosJoe Perches1-14/+17
Emit the logging messages at the appropriate levels. Miscellanea: o Change format to fmt o Use the more common ##__VA_ARGS__ Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08orangefs: strncpy -> strscpyMartin Brandenburg1-1/+5
It would have been possible for a rogue client-core to send in a symlink target which is not NUL terminated. This returns EIO if the client-core gives us corrupt data. Leave debugfs and superblock code as is for now. Other dcache.c and namei.c strncpy instances are safe because ORANGEFS_NAME_MAX = NAME_MAX + 1; there is always enough space for a name plus a NUL byte. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08orangefs: clean up truncate ctime and mtime settingMartin Brandenburg1-15/+1
The ctime and mtime are always updated on a successful ftruncate and only updated on a successful truncate where the size changed. We handle the ``if the size changed'' bit. This matches FUSE's behavior. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08Orangefs: fix ifnullfree.cocci warningskbuild test robot1-2/+1
fs/orangefs/orangefs-debugfs.c:130:2-26: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08Orangefs: optimize boilerplate code.Mike Marshall2-2/+2
Suggested by David Binderman <dcb314@hotmail.com> The former can potentially be a performance win over the latter. memcpy(d, s, len); memset(d+len, c, size-len); memset(d, c, size); memcpy(d, s, len); Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08Orangefs: xattr.c cleanupMike Marshall1-16/+1
1. It is nonsense to test for negative size_t, suggested by David Binderman <dcb314@hotmail.com> 2. By the time Orangefs gets called, the vfs has ensured that name != NULL, and that buffer and size are sane. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds17-86/+244
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "These changes contains a fix for overlayfs interacting with some (badly behaved) dentry code in various file systems. These have been reviewed by Al and the respective file system mtinainers and are going through the ext4 tree for convenience. This also has a few ext4 encryption bug fixes that were discovered in Android testing (yes, we will need to get these sync'ed up with the fs/crypto code; I'll take care of that). It also has some bug fixes and a change to ignore the legacy quota options to allow for xfstests regression testing of ext4's internal quota feature and to be more consistent with how xfs handles this case" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: ignore quota mount options if the quota feature is enabled ext4 crypto: fix some error handling ext4: avoid calling dquot_get_next_id() if quota is not enabled ext4: retry block allocation for failed DIO and DAX writes ext4: add lockdep annotations for i_data_sem ext4: allow readdir()'s of large empty directories to be interrupted btrfs: fix crash/invalid memory access on fsync when using overlayfs ext4 crypto: use dget_parent() in ext4_d_revalidate() ext4: use file_dentry() ext4: use dget_parent() in ext4_file_open() nfs: use file_dentry() fs: add file_dentry() ext4 crypto: don't let data integrity writebacks fail with ENOMEM ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-07Btrfs: fix file/data loss caused by fsync after rename and new inodeFilipe Manana1-0/+137
If we rename an inode A (be it a file or a directory), create a new inode B with the old name of inode A and under the same parent directory, fsync inode B and then power fail, at log tree replay time we end up removing inode A completely. If inode A is a directory then all its files are gone too. Example scenarios where this happens: This is reproducible with the following steps, taken from a couple of test cases written for fstests which are going to be submitted upstream soon: # Scenario 1 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir -p /mnt/a/x echo "hello" > /mnt/a/x/foo echo "world" > /mnt/a/x/bar sync mv /mnt/a/x /mnt/a/y mkdir /mnt/a/x xfs_io -c fsync /mnt/a/x <power failure happens> The next time the fs is mounted, log tree replay happens and the directory "y" does not exist nor do the files "foo" and "bar" exist anywhere (neither in "y" nor in "x", nor the root nor anywhere). # Scenario 2 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/a echo "hello" > /mnt/a/foo sync mv /mnt/a/foo /mnt/a/bar echo "world" > /mnt/a/foo xfs_io -c fsync /mnt/a/foo <power failure happens> The next time the fs is mounted, log tree replay happens and the file "bar" does not exists anymore. A file with the name "foo" exists and it matches the second file we created. Another related problem that does not involve file/data loss is when a new inode is created with the name of a deleted snapshot and we fsync it: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/testdir btrfs subvolume snapshot /mnt /mnt/testdir/snap btrfs subvolume delete /mnt/testdir/snap rmdir /mnt/testdir mkdir /mnt/testdir xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir <power failure> The next time the fs is mounted the log replay procedure fails because it attempts to delete the snapshot entry (which has dir item key type of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry, resulting in the following error that causes mount to fail: [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [52174.512570] ------------[ cut here ]------------ [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]() [52174.514681] BTRFS: Transaction aborted (error -2) [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G W 4.5.0-rc6-btrfs-next-27+ #1 [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [52174.524053] 0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758 [52174.524053] 0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd [52174.524053] 00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88 [52174.524053] Call Trace: [52174.524053] [<ffffffff81264e93>] dump_stack+0x67/0x90 [52174.524053] [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2 [52174.524053] [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50 [52174.524053] [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff8118f5e9>] ? iput+0xb0/0x284 [52174.524053] [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [52174.524053] [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs] [52174.524053] [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs] [52174.524053] [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs] [52174.524053] [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs] [52174.524053] [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs] [52174.524053] [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs] [52174.524053] [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs] [52174.524053] [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3 [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffff8119590f>] do_mount+0x8a6/0x9e8 [52174.524053] [<ffffffff811358dd>] ? strndup_user+0x3f/0x59 [52174.524053] [<ffffffff81195c65>] SyS_mount+0x77/0x9f [52174.524053] [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]--- Fix this by forcing a transaction commit when such cases happen. This means we check in the commit root of the subvolume tree if there was any other inode with the same reference when the inode we are fsync'ing is a new inode (created in the current transaction). Test cases for fstests, covering all the scenarios given above, were submitted upstream for fstests: * fstests: generic test for fsync after renaming directory https://patchwork.kernel.org/patch/8694281/ * fstests: generic test for fsync after renaming file https://patchwork.kernel.org/patch/8694301/ * fstests: add btrfs test for fsync after snapshot deletion https://patchwork.kernel.org/patch/8670671/ Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-04-04Merge branch 'for_linus' of ↵Linus Torvalds2-4/+20
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota fixes from Jan Kara: "Fixes for oopses when the new quotactl gets used with quotas disabled" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ocfs2: Fix Q_GETNEXTQUOTA for filesystem without quotas quota: Handle Q_GETNEXTQUOTA when quota is disabled
2016-04-04Merge tag 'f2fs-for-linus' of ↵Linus Torvalds2-44/+72
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim. * tag 'f2fs-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: retrieve IO write stat from the right place f2fs crypto: fix corrupted symlink in encrypted case f2fs: cover large section in sanity check of super
2016-04-04Merge branch 'PAGE_CACHE_SIZE-removal'Linus Torvalds247-2183/+2173
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov: "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. Let's stop pretending that pages in page cache are special. They are not. The first patch with most changes has been done with coccinelle. The second is manual fixups on top. The third patch removes macros definition" [ I was planning to apply this just before rc2, but then I spaced out, so here it is right _after_ rc2 instead. As Kirill suggested as a possibility, I could have decided to only merge the first two patches, and leave the old interfaces for compatibility, but I'd rather get it all done and any out-of-tree modules and patches can trivially do the converstion while still also working with older kernels, so there is little reason to try to maintain the redundant legacy model. - Linus ] * PAGE_CACHE_SIZE-removal: mm: drop PAGE_CACHE_* and page_cache_{get,release} definition mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
2016-04-04mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usageKirill A. Shutemov34-89/+78
Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov243-2096/+2097
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04btrfs: Reset IO error counters before start of device replacingYauhen Kharuzhy1-0/+2
If device replace entry was found on disk at mounting and its num_write_errors stats counter has non-NULL value, then replace operation will never be finished and -EIO error will be reported by btrfs_scrub_dev() because this counter is never reset. # mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ # btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write errs, 0 uncorr. read errs # btrfs replace start -B 4 /dev/sdg /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ ERROR: ioctl(DEV_REPLACE_START) failed on "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error Reset num_write_errors and num_uncorrectable_read_errors counters in the dev_replace structure before start of replacing. Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04btrfs: Add qgroup tracingMark Fasheh1-0/+9
This patch adds tracepoints to the qgroup code on both the reporting side (insert_dirty_extents) and the accounting side. Taken together it allows us to see what qgroup operations have happened, and what their result was. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04Btrfs: don't use src fd for printkJosef Bacik1-1/+1
The fd we pass in may not be on a btrfs file system, so don't try to do BTRFS_I() on it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04btrfs: fallback to vmalloc in btrfs_compare_treeDavid Sterba1-4/+8
The allocation of node could fail if the memory is too fragmented for a given node size, practically observed with 64k. http://article.gmane.org/gmane.comp.file-systems.btrfs/54689 Reported-and-tested-by: Jean-Denis Girard <jd.girard@sysnux.pf> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04btrfs: handle non-fatal errors in btrfs_qgroup_inherit()Mark Fasheh1-22/+32
create_pending_snapshot() will go readonly on _any_ error return from btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by just making a snapshot and asking it to inherit from an invalid qgroup. For example: $ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo Will cause a transaction abort. Fix this by only throwing errors in btrfs_qgroup_inherit() when we know going readonly is acceptable. The following xfstests test case reproduces this bug: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # remove previous $seqres.full before test rm -f $seqres.full # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs _scratch_mount _run_btrfs_util_prog quota enable $SCRATCH_MNT # The qgroup '1/10' does not exist and should be silently ignored _run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT $SCRATCH_MNT/snap1 _scratch_unmount echo "Silence is golden" status=0 exit Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04btrfs: Output more info for enospc_debug mount optionQu Wenruo1-2/+19
As one user in mail list report reproducible balance ENOSPC error, it's better to add more debug info for enospc_debug mount option. Reported-by: Marc Haber <mh+linux-btrfs@zugschlus.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04Btrfs: fix invalid reference in replace_pathLiu Bo1-0/+1
Dan Carpenter's static checker has found this error, it's introduced by commit 64c043de466d ("Btrfs: fix up read_tree_block to return proper error") It's really supposed to 'break' the loop on error like others. Cc: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04Btrfs: Improve FL_KEEP_SIZE handling in fallocateDavide Italiano1-3/+6
- We call inode_size_ok() only if FL_KEEP_SIZE isn't specified. - As an optimisation we can skip the call if (off + len) isn't greater than the current size of the file. This operation is called under the lock so the less work we do, the better. - If we call inode_size_ok() pass to it the correct value rather than a more conservative estimation. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04ext4: ignore quota mount options if the quota feature is enabledTheodore Ts'o1-11/+11
Previously, ext4 would fail the mount if the file system had the quota feature enabled and quota mount options (used for the older quota setups) were present. This broke xfstests, since xfs silently ignores the usrquote and grpquota mount options if they are specified. This commit changes things so that we are consistent with xfs; having the mount options specified is harmless, so no sense break users by forbidding them. Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-03ext4 crypto: fix some error handlingDan Carpenter1-1/+1
We should be testing for -ENOMEM but the minus sign is missing. Fixes: c9af28fdd449 ('ext4 crypto: don't let data integrity writebacks fail with ENOMEM') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-02Merge branch 'for-linus-4.6' of ↵Linus Torvalds1-20/+25
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has a few fixes Dave Sterba had queued up. These are all pretty small, but since they were tested I decided against waiting for more" * 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: transaction_kthread() is not freezable btrfs: cleaner_kthread() doesn't need explicit freeze btrfs: do not write corrupted metadata blocks to disk btrfs: csum_tree_block: return proper errno value
2016-04-02Merge tag 'for-linus' of git://github.com/martinbrandenburg/linuxLinus Torvalds2-6/+4
Pull OrangeFS fixes from Martin Brandenburg: "Two bugfixes for OrangeFS. One is a reference counting bug and the other is a typo in client minimum version" * tag 'for-linus' of git://github.com/martinbrandenburg/linux: orangefs: minimum userspace version is 2.9.3 orangefs: don't put readdir slot twice
2016-04-01ext4: avoid calling dquot_get_next_id() if quota is not enabledTheodore Ts'o1-1/+13
This should be fixed in the quota layer so we can test with the quota mutex held, but for now, we need this to avoid tests from crashing the kernel aborting the regression test suite. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-01ext4: retry block allocation for failed DIO and DAX writesJan Kara1-30/+28
Currently if block allocation for DIO or DAX write fails due to ENOSPC, we just returned it to userspace. However these ENOSPC errors can be transient because the transaction freeing blocks has not yet committed. This demonstrates as failures of generic/102 test when the filesystem is mounted with 'dax' mount option. Fix the problem by properly retrying the allocation in case of ENOSPC error in get blocks functions used for direct IO. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>