summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2018-10-24Merge tag 'for-linus-4.20-ofs1' of ↵Linus Torvalds4-7/+15
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: "Fixes and a cleanup. Fixes: - fix superfluous service_operation return code check in orangefs_lookup - fix some error code paths that missed kmem_cache_free - don't let orangefs_iget return NULL - don't let orangefs_new_inode return NULL - cache NULL when both default_acl and acl are NULL Cleanup: - rate limit the client not running info message" * tag 'for-linus-4.20-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: no need to check for service_operation returns > 0 orangefs: some error code paths missed kmem_cache_free orangefs: don't let orangefs_iget return NULL. orangefs: don't let orangefs_new_inode return NULL orangefs: rate limit the client not running info message orangefs: cache NULL when both default_acl and acl are NULL
2018-10-24Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-3/+11
Pull vfs fixes from Al Viro. * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: gfs2_meta: ->mount() can get NULL dev_name ecryptfs_rename(): verify that lower dentries are still OK after lock_rename() cachefiles: fix the race between cachefiles_bury_object() and rmdir(2)
2018-10-24Merge tag 'jfs-for-4.20' of git://github.com/kleikamp/linux-shaggyLinus Torvalds3-2/+5
Pull jfs updates from David Kleikamp: "Just a few small fixes" * tag 'jfs-for-4.20' of git://github.com/kleikamp/linux-shaggy: jfs: remove redundant dquot_initialize() in jfs_evict_inode() jfs: remove quota option from ignore list jfs: cache NULL when both default_acl and acl are NULL
2018-10-24Merge tag 'for-4.20-part1-tag' of ↵Linus Torvalds39-738/+1239
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This is the first batch with fixes and some nice performance improvements. Preliminary results show eg. more files/sec in fsmark, better perf on multi-threaded workloads (filebench, dbench), fewer context switches and overall better memory allocation characteristics (multiple benchmarks). Apart from general performance, there's an improvement for qgroups + balance workload that's been troubling our users. Note for stable: there are 20+ patches tagged for stable, out of 90. Not all of them apply cleanly on all stable versions but the conflicts are mostly due to simple cleanups and resolving should be obvious. The fixes are otherwise independent. Performance improvements: - transition between blocking and spinning modes of path is gone, which originally resulted to more unnecessary wakeups and updates to the path locks, the effects are measurable and improve latency and scalability - qgroups: first batch of changes that should speedup balancing with qgroups on, skip quota accounting on unchanged subtrees, overall gain is about 30+% in runtime - use rb-tree with cached first node for several structures, small improvement to avoid pointer chasing Fixes: - trim - fix: some blockgroups could have been missed if their logical address was past the total filesystem size (ie. after a lot of balancing) - better error reporting, after processing blockgroups and whole device - fix: continue trimming block groups after an error is encountered - check for trim support of the device earlier and avoid some unnecessary work - less interaction with transaction commit that improves latency on slower storage (eg. image files over NFS) - fsync - fix warning when replaying log after fsync of a O_TMPFILE - fix wrong dentries after fsync of file that got its parent replaced - qgroups: fix rescan that might misc some dirty groups - don't clean dirty pages during buffered writes, this could lead to lost updates in some corner cases - some block groups could have been delayed in creation, if the allocation triggered another one - error handling improvements Cleanups: - removed unused struct members and variables - function return type cleanups - delayed refs code refactoring - protect against deadlock that could be caused by crafted image that tries to allocate from a tree that's locked already" * tag 'for-4.20-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (93 commits) btrfs: switch return_bigger to bool in find_ref_head btrfs: remove fs_info from btrfs_should_throttle_delayed_refs btrfs: remove fs_info from btrfs_check_space_for_delayed_refs btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch btrfs: relocation: Remove redundant tree level check btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled Btrfs: fix wrong dentries after fsync of file that got its parent replaced Btrfs: fix warning when replaying log after fsync of a tmpfile btrfs: drop min_size from evict_refill_and_join btrfs: assert on non-empty delayed iputs btrfs: make sure we create all new block groups btrfs: reset max_extent_size on clear in a bitmap btrfs: protect space cache inode alloc with GFP_NOFS btrfs: release metadata before running delayed refs Btrfs: kill btrfs_clear_path_blocking btrfs: dev-replace: remove pointless assert in write unlock btrfs: dev-replace: move replace members out of fs_info ...
2018-10-24Merge branch 'work.tty-ioctl' of ↵Linus Torvalds1-169/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull tty ioctl updates from Al Viro: "This is the compat_ioctl work related to tty ioctls. Quite a bit of dead code taken out, all tty-related stuff gone from fs/compat_ioctl.c. A bunch of compat bugs fixed - some still remain, but all more or less generic tty-related ioctls should be covered (remaining issues are in things like driver-private ioctls in a pcmcia serial card driver not getting properly handled in 32bit processes on 64bit host, etc)" * 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (53 commits) kill TIOCSERGSTRUCT change semantics of ldisc ->compat_ioctl() kill TIOCSER[SG]WILD synclink_gt(): fix compat_ioctl() pty: fix compat ioctls compat_ioctl - kill keyboard ioctl handling gigaset: add ->compat_ioctl() vt_compat_ioctl(): clean up, use compat_ptr() properly gigaset: don't try to printk userland buffer contents dgnc: don't bother with (empty) stub for TCXONC dgnc: leave TIOC[GS]SOFTCAR to ldisc remove fallback to drivers for TIOCGICOUNT dgnc: break-related ioctls won't reach ->ioctl() kill the rest of tty COMPAT_IOCTL() entries dgnc: TIOCM... won't reach ->ioctl() isdn_tty: TCSBRK{,P} won't reach ->ioctl() kill capinc_tty_ioctl() take compat TIOC[SG]SERIAL treatment into tty_compat_ioctl() synclink: reduce pointless checks in ->ioctl() complete ->[sg]et_serial() switchover ...
2018-10-24Merge tag 'pstore-v4.20-rc1' of ↵Linus Torvalds5-32/+88
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "pstore improvements: - refactor init to happen as early as possible again (Joel Fernandes) - improve resource reservation names" * tag 'pstore-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Clarify resource reservation labels pstore: Refactor compression initialization pstore: Allocate compression during late_initcall() pstore: Centralize init/exit routines
2018-10-24Merge branch 'siginfo-linus' of ↵Linus Torvalds5-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo updates from Eric Biederman: "I have been slowly sorting out siginfo and this is the culmination of that work. The primary result is in several ways the signal infrastructure has been made less error prone. The code has been updated so that manually specifying SEND_SIG_FORCED is never necessary. The conversion to the new siginfo sending functions is now complete, which makes it difficult to send a signal without filling in the proper siginfo fields. At the tail end of the patchset comes the optimization of decreasing the size of struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit. The fundamental observation that enables this is by definition none of the known ways to use struct siginfo uses the extra bytes. This comes at the cost of a small user space observable difference. For the rare case of siginfo being injected into the kernel only what can be copied into kernel_siginfo is delivered to the destination, the rest of the bytes are set to 0. For cases where the signal and the si_code are known this is safe, because we know those bytes are not used. For cases where the signal and si_code combination is unknown the bits that won't fit into struct kernel_siginfo are tested to verify they are zero, and the send fails if they are not. I made an extensive search through userspace code and I could not find anything that would break because of the above change. If it turns out I did break something it will take just the revert of a single change to restore kernel_siginfo to the same size as userspace siginfo. Testing did reveal dependencies on preferring the signo passed to sigqueueinfo over si->signo, so bit the bullet and added the complexity necessary to handle that case. Testing also revealed bad things can happen if a negative signal number is passed into the system calls. Something no sane application will do but something a malicious program or a fuzzer might do. So I have fixed the code that performs the bounds checks to ensure negative signal numbers are handled" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits) signal: Guard against negative signal numbers in copy_siginfo_from_user32 signal: Guard against negative signal numbers in copy_siginfo_from_user signal: In sigqueueinfo prefer sig not si_signo signal: Use a smaller struct siginfo in the kernel signal: Distinguish between kernel_siginfo and siginfo signal: Introduce copy_siginfo_from_user and use it's return value signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE signal: Fail sigqueueinfo if si_signo != sig signal/sparc: Move EMT_TAGOVF into the generic siginfo.h signal/unicore32: Use force_sig_fault where appropriate signal/unicore32: Generate siginfo in ucs32_notify_die signal/unicore32: Use send_sig_fault where appropriate signal/arc: Use force_sig_fault where appropriate signal/arc: Push siginfo generation into unhandled_exception signal/ia64: Use force_sig_fault where appropriate signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn signal/ia64: Use the generic force_sigsegv in setup_frame signal/arm/kvm: Use send_sig_mceerr signal/arm: Use send_sig_fault where appropriate signal/arm: Use force_sig_fault where appropriate ...
2018-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2-54/+55
Pull networking updates from David Miller: 1) Add VF IPSEC offload support in ixgbe, from Shannon Nelson. 2) Add zero-copy AF_XDP support to i40e, from Björn Töpel. 3) All in-tree drivers are converted to {g,s}et_link_ksettings() so we can get rid of the {g,s}et_settings ethtool callbacks, from Michal Kubecek. 4) Add software timestamping to veth driver, from Michael Walle. 5) More work to make packet classifiers and actions lockless, from Vlad Buslov. 6) Support sticky FDB entries in bridge, from Nikolay Aleksandrov. 7) Add ipv6 version of IP_MULTICAST_ALL sockopt, from Andre Naujoks. 8) Support batching of XDP buffers in vhost_net, from Jason Wang. 9) Add flow dissector BPF hook, from Petar Penkov. 10) i40e vf --> generic iavf conversion, from Jesse Brandeburg. 11) Add NLA_REJECT netlink attribute policy type, to signal when users provide attributes in situations which don't make sense. From Johannes Berg. 12) Switch TCP and fair-queue scheduler over to earliest departure time model. From Eric Dumazet. 13) Improve guest receive performance by doing rx busy polling in tx path of vhost networking driver, from Tonghao Zhang. 14) Add per-cgroup local storage to bpf 15) Add reference tracking to BPF, from Joe Stringer. The verifier can now make sure that references taken to objects are properly released by the program. 16) Support in-place encryption in TLS, from Vakul Garg. 17) Add new taprio packet scheduler, from Vinicius Costa Gomes. 18) Lots of selftests additions, too numerous to mention one by one here but all of which are very much appreciated. 19) Support offloading of eBPF programs containing BPF to BPF calls in nfp driver, frm Quentin Monnet. 20) Move dpaa2_ptp driver out of staging, from Yangbo Lu. 21) Lots of u32 classifier cleanups and simplifications, from Al Viro. 22) Add new strict versions of netlink message parsers, and enable them for some situations. From David Ahern. 23) Evict neighbour entries on carrier down, also from David Ahern. 24) Support BPF sk_msg verdict programs with kTLS, from Daniel Borkmann and John Fastabend. 25) Add support for filtering route dumps, from David Ahern. 26) New igc Intel driver for 2.5G parts, from Sasha Neftin et al. 27) Allow vxlan enslavement to bridges in mlxsw driver, from Ido Schimmel. 28) Add queue and stack map types to eBPF, from Mauricio Vasquez B. 29) Add back byte-queue-limit support to r8169, with all the bug fixes in other areas of the driver it works now! From Florian Westphal and Heiner Kallweit. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2147 commits) tcp: add tcp_reset_xmit_timer() helper qed: Fix static checker warning Revert "be2net: remove desc field from be_eq_obj" Revert "net: simplify sock_poll_wait" net: socionext: Reset tx queue in ndo_stop net: socionext: Add dummy PHY register read in phy_write() net: socionext: Stop PHY before resetting netsec net: stmmac: Set OWN bit for jumbo frames arm64: dts: stratix10: Support Ethernet Jumbo frame tls: Add maintainers net: ethernet: ti: cpsw: unsync mcast entries while switch promisc mode octeontx2-af: Support for NIXLF's UCAST/PROMISC/ALLMULTI modes octeontx2-af: Support for setting MAC address octeontx2-af: Support for changing RSS algorithm octeontx2-af: NIX Rx flowkey configuration for RSS octeontx2-af: Install ucast and bcast pkt forwarding rules octeontx2-af: Add LMAC channel info to NIXLF_ALLOC response octeontx2-af: NPC MCAM and LDATA extract minimal configuration octeontx2-af: Enable packet length and csum validation octeontx2-af: Support for VTAG strip and capture ...
2018-10-23Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds1-6/+28
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Lots of changes in this cycle: - Lots of CPA (change page attribute) optimizations and related cleanups (Thomas Gleixner, Peter Zijstra) - Make lazy TLB mode even lazier (Rik van Riel) - Fault handler cleanups and improvements (Dave Hansen) - kdump, vmcore: Enable kdumping encrypted memory with AMD SME enabled (Lianbo Jiang) - Clean up VM layout documentation (Baoquan He, Ingo Molnar) - ... plus misc other fixes and enhancements" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry() x86/mm: Kill stray kernel fault handling comment x86/mm: Do not warn about PCI BIOS W+X mappings resource: Clean it up a bit resource: Fix find_next_iomem_res() iteration issue resource: Include resource end in walk_*() interfaces x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error x86/mm: Remove spurious fault pkey check x86/mm/vsyscall: Consider vsyscall page part of user address space x86/mm: Add vsyscall address helper x86/mm: Fix exception table comments x86/mm: Add clarifying comments for user addr space x86/mm: Break out user address space handling x86/mm: Break out kernel address space handling x86/mm: Clarify hardware vs. software "error_code" x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Add freed_tables element to flush_tlb_info x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range smp,cpumask: introduce on_each_cpu_cond_mask smp: use __cpumask_set_cpu in on_each_cpu_cond ...
2018-10-23Merge branch 'locking-core-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and misc x86 updates from Ingo Molnar: "Lots of changes in this cycle - in part because locking/core attracted a number of related x86 low level work which was easier to handle in a single tree: - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E. McKenney, Andrea Parri) - lockdep scalability improvements and micro-optimizations (Waiman Long) - rwsem improvements (Waiman Long) - spinlock micro-optimization (Matthew Wilcox) - qspinlocks: Provide a liveness guarantee (more fairness) on x86. (Peter Zijlstra) - Add support for relative references in jump tables on arm64, x86 and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens) - Be a lot less permissive on weird (kernel address) uaccess faults on x86: BUG() when uaccess helpers fault on kernel addresses (Jann Horn) - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav Amit) - ... and a handful of other smaller changes as well" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) locking/lockdep: Make global debug_locks* variables read-mostly locking/lockdep: Fix debug_locks off performance problem locking/pvqspinlock: Extend node size when pvqspinlock is configured locking/qspinlock_stat: Count instances of nested lock slowpaths locking/qspinlock, x86: Provide liveness guarantee x86/asm: 'Simplify' GEN_*_RMWcc() macros locking/qspinlock: Rework some comments locking/qspinlock: Re-order code locking/lockdep: Remove duplicated 'lock_class_ops' percpu array x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y futex: Replace spin_is_locked() with lockdep locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs x86/extable: Macrofy inline assembly code to work around GCC inlining bugs x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs x86/refcount: Work around GCC inlining bug x86/objtool: Use asm macros to work around GCC inlining bugs ...
2018-10-23Merge tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtdLinus Torvalds1-3/+1
Pull mtd updates from Boris Brezillon: "SPI NOR core changes: - Support non-uniform erase size - Support controllers with limited TX fifo size Driver changes: - m25p80: Re-issue a WREN command after each write access - cadence: Pass a proper dir value to dma_[un]map_single() - fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B addressing opcodes are properly handled - intel-spi: Add a new PCI entry for Ice Lake Raw NAND core changes: - Two batchs of cleanups of the NAND API, including: * Deprecating a lot of interfaces (now replaced by ->exec_op()). * Moving code in separate drivers (JEDEC, ONFI), in private files (internals), in platform drivers, etc. * Functions/structures reordering. * Exclusive use of the nand_chip structure instead of the MTD one all across the subsystem. - Addition of the nand_wait_readrdy/rdy_op() helpers. Raw NAND controllers drivers changes: - Various coccinelle patches. - Marvell: * Use regmap_update_bits() for syscon access. * More documentation. * BCH failure path rework. * More layouts to be supported. * IRQ handler complete() condition fixed. - Fsl_ifc: * SRAM initialization fixed for newer controller versions. - Denali: * Fix licenses mismatch and use a SPDX tag. * Set SPARE_AREA_SKIP_BYTES register to 8 if unset. - Qualcomm: * Do not include dma-direct.h. - Docg4: * Removed. - Ams-delta: * Use of a GPIO lookup table * Internal machinery changes. Raw NAND chip drivers changes: - Toshiba: * Add support for Toshiba memory BENAND * Pass a single nand_chip object to the status helper. - ESMT: * New driver to retrieve the ECC requirements from the 5th ID byte. MTD changes: - physmap cleanups/fixe - gpio-addr-flash cleanups/fixes" * tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd: (93 commits) jffs2: free jffs2_sb_info through jffs2_kill_sb() mtd: spi-nor: fsl-quadspi: fix read error for flash size larger than 16MB mtd: spi-nor: intel-spi: Add support for Intel Ice Lake SPI serial flash mtd: maps: gpio-addr-flash: Convert to gpiod mtd: maps: gpio-addr-flash: Replace array with an integer mtd: maps: gpio-addr-flash: Use order instead of size mtd: spi-nor: fsl-quadspi: Don't let -EINVAL on the bus mtd: devices: m25p80: Make sure WRITE_EN is issued before each write mtd: spi-nor: Support controllers with limited TX FIFO size mtd: spi-nor: cadence-quadspi: Use proper enum for dma_[un]map_single mtd: spi-nor: parse SFDP Sector Map Parameter Table mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories mtd: rawnand: marvell: fix the IRQ handler complete() condition mtd: rawnand: denali: set SPARE_AREA_SKIP_BYTES register to 8 if unset mtd: rawnand: r852: fix spelling mistake "card_registred" -> "card_registered" mtd: rawnand: toshiba: Pass a single nand_chip object to the status helper mtd: maps: gpio-addr-flash: Use devm_* functions mtd: maps: gpio-addr-flash: Fix ioremapped size mtd: maps: gpio-addr-flash: Replace custom printk mtd: physmap_of: Release resources on error ...
2018-10-22Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-blockLinus Torvalds2-6/+6
Pull block layer updates from Jens Axboe: "This is the main pull request for block changes for 4.20. This contains: - Series enabling runtime PM for blk-mq (Bart). - Two pull requests from Christoph for NVMe, with items such as; - Better AEN tracking - Multipath improvements - RDMA fixes - Rework of FC for target removal - Fixes for issues identified by static checkers - Fabric cleanups, as prep for TCP transport - Various cleanups and bug fixes - Block merging cleanups (Christoph) - Conversion of drivers to generic DMA mapping API (Christoph) - Series fixing ref count issues with blkcg (Dennis) - Series improving BFQ heuristics (Paolo, et al) - Series improving heuristics for the Kyber IO scheduler (Omar) - Removal of dangerous bio_rewind_iter() API (Ming) - Apply single queue IPI redirection logic to blk-mq (Ming) - Set of fixes and improvements for bcache (Coly et al) - Series closing a hotplug race with sysfs group attributes (Hannes) - Set of patches for lightnvm: - pblk trace support (Hans) - SPDX license header update (Javier) - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0 specs behind a common core interface. (Javier, Matias) - Enable pblk to use a common interface to retrieve chunk metadata (Matias) - Bug fixes (Various) - Set of fixes and updates to the blk IO latency target (Josef) - blk-mq queue number updates fixes (Jianchao) - Convert a bunch of drivers from the old legacy IO interface to blk-mq. This will conclude with the removal of the legacy IO interface itself in 4.21, with the rest of the drivers (me, Omar) - Removal of the DAC960 driver. The SCSI tree will introduce two replacement drivers for this (Hannes)" * tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits) block: setup bounce bio_sets properly blkcg: reassociate bios when make_request() is called recursively blkcg: fix edge case for blk_get_rl() under memory pressure nvme-fabrics: move controller options matching to fabrics nvme-rdma: always have a valid trsvcid mtip32xx: fully switch to the generic DMA API rsxx: switch to the generic DMA API umem: switch to the generic DMA API sx8: switch to the generic DMA API sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code skd: switch to the generic DMA API ubd: remove use of blk_rq_map_sg nvme-pci: remove duplicate check drivers/block: Remove DAC960 driver nvme-pci: fix hot removal during error handling nvmet-fcloop: suppress a compiler warning nvme-core: make implicit seed truncation explicit nvmet-fc: fix kernel-doc headers nvme-fc: rework the request initialization code nvme-fc: introduce struct nvme_fcp_op_w_sgl ...
2018-10-22pstore/ram: Clarify resource reservation labelsKees Cook2-7/+20
When ramoops reserved a memory region in the kernel, it had an unhelpful label of "persistent_memory". When reading /proc/iomem, it would be repeated many times, did not hint that it was ramoops in particular, and didn't clarify very much about what each was used for: 400000000-407ffffff : Persistent Memory (legacy) 400000000-400000fff : persistent_memory 400001000-400001fff : persistent_memory ... 4000ff000-4000fffff : persistent_memory Instead, this adds meaningful labels for how the various regions are being used: 400000000-407ffffff : Persistent Memory (legacy) 400000000-400000fff : ramoops:dump(0/252) 400001000-400001fff : ramoops:dump(1/252) ... 4000fc000-4000fcfff : ramoops:dump(252/252) 4000fd000-4000fdfff : ramoops:console 4000fe000-4000fe3ff : ramoops:ftrace(0/3) 4000fe400-4000fe7ff : ramoops:ftrace(1/3) 4000fe800-4000febff : ramoops:ftrace(2/3) 4000fec00-4000fefff : ramoops:ftrace(3/3) 4000ff000-4000fffff : ramoops:pmsg Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22pstore: Refactor compression initializationKees Cook1-15/+33
This refactors compression initialization slightly to better handle getting potentially called twice (via early pstore_register() calls and later pstore_init()) and improves the comments and reporting to be more verbose. Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22pstore: Allocate compression during late_initcall()Joel Fernandes (Google)2-2/+10
ramoops's call of pstore_register() was recently moved to run during late_initcall() because the crypto backend may not have been ready during postcore_initcall(). This meant early-boot crash dumps were not getting caught by pstore any more. Instead, lets allow calls to pstore_register() earlier, and once crypto is ready we can initialize the compression. Reported-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Fixes: cb3bee0369bc ("pstore: Use crypto compress API") [kees: trivial rebase] Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22pstore: Centralize init/exit routinesKees Cook3-11/+28
In preparation for having additional actions during init/exit, this moves the init/exit into platform.c, centralizing the logic to make call outs to the fs init/exit. Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-39/+29
net/sched/cls_api.c has overlapping changes to a call to nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL to the 5th argument, and another (from 'net-next') added cb->extack instead of NULL to the 6th argument. net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to code which moved (to mr_table_dump)) in 'net-next'. Thanks to David Ahern for the heads up. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19Merge tag 'nand/for-4.20' of git://git.infradead.org/linux-mtd into mtd/nextBoris Brezillon5-84/+37
NAND core changes: - Two batchs of cleanups of the NAND API, including: * Deprecating a lot of interfaces (now replaced by ->exec_op()). * Moving code in separate drivers (JEDEC, ONFI), in private files (internals), in platform drivers, etc. * Functions/structures reordering. * Exclusive use of the nand_chip structure instead of the MTD one all across the subsystem. - Addition of the nand_wait_readrdy/rdy_op() helpers. Raw NAND controllers drivers changes: - Various coccinelle patches. - Marvell: * Use regmap_update_bits() for syscon access. * More documentation. * BCH failure path rework. * More layouts to be supported. * IRQ handler complete() condition fixed. - Fsl_ifc: * SRAM initialization fixed for newer controller versions. - Denali: * Fix licenses mismatch and use a SPDX tag. * Set SPARE_AREA_SKIP_BYTES register to 8 if unset. - Qualcomm: * Do not include dma-direct.h. - Docg4: * Removed. - Ams-delta: * Use of a GPIO lookup table * Internal machinery changes. Raw NAND chip drivers changes: - Toshiba: * Add support for Toshiba memory BENAND * Pass a single nand_chip object to the status helper. - ESMT: * New driver to retrieve the ECC requirements from the 5th ID byte.
2018-10-18orangefs: no need to check for service_operation returns > 0Mike Marshall1-1/+1
service_operation returns > 0 is undefined. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18orangefs: some error code paths missed kmem_cache_freeMike Marshall1-3/+3
If a slab cache object is allocated, it needs to be freed eventually, certainly before anyone unloads the module that allocated it. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18orangefs: don't let orangefs_iget return NULL.Mike Marshall1-1/+5
Suggested by Dan Carpenter. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18orangefs: don't let orangefs_new_inode return NULLMike Marshall1-1/+1
Suggested by Dan Carpenter Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18fscache: Fix out of bound read in long cookie keysEric Sandeen1-3/+7
fscache_set_key() can incur an out-of-bounds read, reported by KASAN: BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x5b3/0x680 [fscache] Read of size 4 at addr ffff88084ff056d4 by task mount.nfs/32615 and also reported by syzbot at https://lkml.org/lkml/2018/7/8/236 BUG: KASAN: slab-out-of-bounds in fscache_set_key fs/fscache/cookie.c:120 [inline] BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x7a9/0x880 fs/fscache/cookie.c:171 Read of size 4 at addr ffff8801d3cc8bb4 by task syz-executor907/4466 This happens for any index_key_len which is not divisible by 4 and is larger than the size of the inline key, because the code allocates exactly index_key_len for the key buffer, but the hashing loop is stepping through it 4 bytes (u32) at a time in the buf[] array. Fix this by calculating how many u32 buffers we'll need by using DIV_ROUND_UP, and then using kcalloc() to allocate a precleared allocation buffer to hold the index_key, then using that same count as the hashing index limit. Fixes: ec0328e46d6e ("fscache: Maintain a catalogue of allocated cookies") Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18fscache: Fix incomplete initialisation of inline key spaceDavid Howells3-23/+5
The inline key in struct rxrpc_cookie is insufficiently initialized, zeroing only 3 of the 4 slots, therefore an index_key_len between 13 and 15 bytes will end up hashing uninitialized memory because the memcpy only partially fills the last buf[] element. Fix this by clearing fscache_cookie objects on allocation rather than using the slab constructor to initialise them. We're going to pretty much fill in the entire struct anyway, so bringing it into our dcache writably shouldn't incur much overhead. This removes the need to do clearance in fscache_set_key() (where we aren't doing it correctly anyway). Also, we don't need to set cookie->key_len in fscache_set_key() as we already did it in the only caller, so remove that. Fixes: ec0328e46d6e ("fscache: Maintain a catalogue of allocated cookies") Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com Reported-by: Eric Sandeen <sandeen@redhat.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18cachefiles: fix the race between cachefiles_bury_object() and rmdir(2)Al Viro1-1/+1
the victim might've been rmdir'ed just before the lock_rename(); unlike the normal callers, we do not look the source up after the parents are locked - we know it beforehand and just recheck that it's still the child of what used to be its parent. Unfortunately, the check is too weak - we don't spot a dead directory since its ->d_parent is unchanged, dentry is positive, etc. So we sail all the way to ->rename(), with hosting filesystems _not_ expecting to be asked renaming an rmdir'ed subdirectory. The fix is easy, fortunately - the lock on parent is sufficient for making IS_DEADDIR() on child safe. Cc: stable@vger.kernel.org Fixes: 9ae326a69004 (CacheFiles: A cache that backs onto a mounted filesystem) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-16jffs2: free jffs2_sb_info through jffs2_kill_sb()Hou Tao1-3/+1
When an invalid mount option is passed to jffs2, jffs2_parse_options() will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will be used (use-after-free) and freeed (double-free) in jffs2_kill_sb(). Fix it by removing the buggy invocation of kfree() when getting invalid mount options. Fixes: 92abc475d8de ("jffs2: implement mount option parsing and compression overriding") Cc: stable@kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
2018-10-15btrfs: switch return_bigger to bool in find_ref_headLu Fengqi1-5/+6
Using bool is more suitable than int here, and add the comment about the return_bigger. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: remove fs_info from btrfs_should_throttle_delayed_refsLu Fengqi4-9/+6
The avg_delayed_ref_runtime can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: remove fs_info from btrfs_check_space_for_delayed_refsLu Fengqi4-7/+6
It can be referenced from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lockLu Fengqi3-6/+3
Since trans is only used for referring to delayed_refs, there is no need to pass it instead of delayed_refs to btrfs_delayed_ref_lock(). No functional change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_headLu Fengqi3-8/+5
Since trans is only used for referring to delayed_refs, there is no need to pass it instead of delayed_refs to btrfs_select_ref_head(). No functional change. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branchLu Fengqi1-6/+7
There is no reason to put this check in (!qgroup)'s else branch because if qgroup is null, it will goto out directly. So move it out to reduce indentation level. No functional change. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: relocation: Remove redundant tree level checkQu Wenruo1-1/+0
Commit 581c1760415c ("btrfs: Validate child tree block's level and first key") has made tree block level check mandatory. So if tree block level doesn't match, we won't get a valid extent buffer. The extra WARN_ON() check can be removed completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safeQu Wenruo1-15/+8
And add one line comment explaining what we're doing for each loop. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabledQu Wenruo2-0/+6
Some qgroup trace events like btrfs_qgroup_release_data() and btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is not enabled. This is caused by the lack of qgroup status check before calling some qgroup functions. Thankfully the functions can handle quota disabled case well and just do nothing for qgroup disabled case. This patch will do earlier check before triggering related trace events. And for enabled <-> disabled race case: 1) For enabled->disabled case Disable will wipe out all qgroups data including reservation and excl/rfer. Even if we leak some reservation or numbers, it will still be cleared, so nothing will go wrong. 2) For disabled -> enabled case Current btrfs_qgroup_release_data() will use extent_io tree to ensure we won't underflow reservation. And for delayed_ref we use head->qgroup_reserved to record the reserved space, so in that case head->qgroup_reserved should be 0 and we won't underflow. CC: stable@vger.kernel.org # 4.14+ Reported-by: Chris Murphy <lists@colorremedies.com> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: fix wrong dentries after fsync of file that got its parent replacedFilipe Manana1-3/+27
In a scenario like the following: mkdir /mnt/A # inode 258 mkdir /mnt/B # inode 259 touch /mnt/B/bar # inode 260 sync mv /mnt/B/bar /mnt/A/bar mv -T /mnt/A /mnt/B fsync /mnt/B/bar <power fail> After replaying the log we end up with file bar having 2 hard links, both with the name 'bar' and one in the directory with inode number 258 and the other in the directory with inode number 259. Also, we end up with the directory inode 259 still existing and with the directory inode 258 still named as 'A', instead of 'B'. In this scenario, file 'bar' should only have one hard link, located at directory inode 258, the directory inode 259 should not exist anymore and the name for directory inode 258 should be 'B'. This incorrect behaviour happens because when attempting to log the old parents of an inode, we skip any parents that no longer exist. Fix this by forcing a full commit if an old parent no longer exists. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: fix warning when replaying log after fsync of a tmpfileFilipe Manana1-10/+32
When replaying a log which contains a tmpfile (which necessarily has a link count of 0) we end up calling inc_nlink(), at fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like the following: [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40 [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1 [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [195191.943726] RIP: 0010:inc_nlink+0x33/0x40 [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246 [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006 [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0 [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000 [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60 [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8 [195191.943735] FS: 00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000 [195191.943736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0 [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [195191.943741] Call Trace: [195191.943778] replay_one_buffer+0x797/0x7d0 [btrfs] [195191.943802] walk_up_log_tree+0x1c1/0x250 [btrfs] [195191.943809] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943825] walk_log_tree+0xae/0x1d0 [btrfs] [195191.943840] btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs] [195191.943856] ? replay_dir_deletes+0x280/0x280 [btrfs] [195191.943870] open_ctree+0x1c3b/0x22a0 [btrfs] [195191.943887] btrfs_mount_root+0x6b4/0x800 [btrfs] [195191.943894] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943899] ? pcpu_alloc+0x55b/0x7c0 [195191.943906] ? mount_fs+0x3b/0x140 [195191.943908] mount_fs+0x3b/0x140 [195191.943912] ? __init_waitqueue_head+0x36/0x50 [195191.943916] vfs_kern_mount+0x62/0x160 [195191.943927] btrfs_mount+0x134/0x890 [btrfs] [195191.943936] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943938] ? pcpu_alloc+0x55b/0x7c0 [195191.943943] ? mount_fs+0x3b/0x140 [195191.943952] ? btrfs_remount+0x570/0x570 [btrfs] [195191.943954] mount_fs+0x3b/0x140 [195191.943956] ? __init_waitqueue_head+0x36/0x50 [195191.943960] vfs_kern_mount+0x62/0x160 [195191.943963] do_mount+0x1f9/0xd40 [195191.943967] ? memdup_user+0x4b/0x70 [195191.943971] ksys_mount+0x7e/0xd0 [195191.943974] __x64_sys_mount+0x21/0x30 [195191.943977] do_syscall_64+0x60/0x1b0 [195191.943980] entry_SYSCALL_64_after_hwframe+0x49/0xbe [195191.943983] RIP: 0033:0x7f570e4e524a [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260 [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020 [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260 [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff [195191.944002] irq event stamp: 8688 [195191.944010] hardirqs last enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640 [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c [195191.944018] softirqs last enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150 [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60 [195191.944022] ---[ end trace 5d6e873a9a0b811a ]--- This happens because the inode does not have the flag I_LINKABLE set, which is a runtime only flag, not meant to be persisted, set when the inode is created through open(2) if the flag O_EXCL is not passed to it. Except for the warning, there are no other consequences (like corruptions or metadata inconsistencies). Since it's pointless to replay a tmpfile as it would be deleted in a later phase of the log replay procedure (it has a link count of 0), fix this by not logging tmpfiles and if a tmpfile is found in a log (created by a kernel without this change), skip the replay of the inode. A test case for fstests follows soon. Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: stable@vger.kernel.org # 4.18+ Reported-by: Martin Steigerwald <martin@lichtvoll.de> Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: drop min_size from evict_refill_and_joinJosef Bacik1-10/+6
We don't need it, rsv->size is set once and never changes throughout its lifetime, so just use that for the reserve size. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: assert on non-empty delayed iputsJosef Bacik1-0/+1
I ran into an issue where there was some reference being held on an inode that I couldn't track. This assert wasn't triggered, but it at least rules out we're doing something stupid. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: make sure we create all new block groupsJosef Bacik1-2/+5
Allocating new chunks modifies both the extent and chunk tree, which can trigger new chunk allocations. So instead of doing list_for_each_safe, just do while (!list_empty()) so we make sure we don't exit with other pending bg's still on our list. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: reset max_extent_size on clear in a bitmapJosef Bacik1-0/+2
We need to clear the max_extent_size when we clear bits from a bitmap since it could have been from the range that contains the max_extent_size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: protect space cache inode alloc with GFP_NOFSJosef Bacik1-0/+8
If we're allocating a new space cache inode it's likely going to be under a transaction handle, so we need to use memalloc_nofs_save() in order to avoid deadlocks, and more importantly lockdep messages that make xfstests fail. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: release metadata before running delayed refsJosef Bacik1-3/+3
We want to release the unused reservation we have since it refills the delayed refs reserve, which will make everything go smoother when running the delayed refs if we're short on our reservation. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: kill btrfs_clear_path_blockingLiu Bo3-58/+4
Btrfs's btree locking has two modes, spinning mode and blocking mode, while searching btree, locking is always acquired in spinning mode and then converted to blocking mode if necessary, and in some hot paths we may switch the locking back to spinning mode by btrfs_clear_path_blocking(). When acquiring locks, both of reader and writer need to wait for blocking readers and writers to complete before doing read_lock()/write_lock(). The problem is that btrfs_clear_path_blocking() needs to switch nodes in the path to blocking mode at first (by btrfs_set_path_blocking) to make lockdep happy before doing its actual clearing blocking job. When switching to blocking mode from spinning mode, it consists of step 1) bumping up blocking readers counter and step 2) read_unlock()/write_unlock(), this has caused serious ping-pong effect if there're a great amount of concurrent readers/writers, as waiters will be woken up and go to sleep immediately. 1) Killing this kind of ping-pong results in a big improvement in my 1600k files creation script, MNT=/mnt/btrfs mkfs.btrfs -f /dev/sdf mount /dev/def $MNT time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 w/o patch: real 2m27.307s user 0m12.839s sys 13m42.831s w/ patch: real 1m2.273s user 0m15.802s sys 8m16.495s 1.1) latency histogram from funclatency[1] Overall with the patch, there're ~50% less write lock acquisition and the 95% max latency that write lock takes also reduces to ~100ms from >500ms. -------------------------------------------- w/o patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 2385222 |****************************************| 2 -> 3 : 37147 | | 4 -> 7 : 20452 | | 8 -> 15 : 13131 | | 16 -> 31 : 3877 | | 32 -> 63 : 3900 | | 64 -> 127 : 2612 | | 128 -> 255 : 974 | | 256 -> 511 : 165 | | 512 -> 1023 : 13 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6743860 |****************************************| 2 -> 3 : 2146 | | 4 -> 7 : 190 | | 8 -> 15 : 38 | | 16 -> 31 : 4 | | -------------------------------------------- w/ patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 1318454 |****************************************| 2 -> 3 : 6800 | | 4 -> 7 : 3664 | | 8 -> 15 : 2145 | | 16 -> 31 : 809 | | 32 -> 63 : 219 | | 64 -> 127 : 10 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6854317 |****************************************| 2 -> 3 : 2383 | | 4 -> 7 : 601 | | 8 -> 15 : 92 | | 2) dbench also proves the improvement, dbench -t 120 -D /mnt/btrfs 16 w/o patch: Throughput 158.363 MB/sec w/ patch: Throughput 449.52 MB/sec 3) xfstests didn't show any additional failures. One thing to note is that callers may set path->leave_spinning to have all nodes in the path stay in spinning mode, which means callers are ready to not sleep before releasing the path, but it won't cause problems if they don't want to sleep in blocking mode. [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: dev-replace: remove pointless assert in write unlockDavid Sterba1-1/+0
The value of blocking_readers is increased only when the lock is taken for read, no way we can fail the condition with the write lock. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: dev-replace: move replace members out of fs_infoDavid Sterba3-15/+16
The replace_wait and bio_counter were mistakenly added to fs_info in commit c404e0dc2c843b154f ("Btrfs: fix use-after-free in the finishing procedure of the device replace"), but they logically belong to fs_info::dev_replace. Besides, bio_counter is a very generic name and is confusing in bare fs_info context. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: dev-replace: avoid useless lock on error handling pathDavid Sterba1-1/+6
The exit sequence in btrfs_dev_replace_start does not allow to simply add a label to the right place so the error handling after starting transaction failure jumps there. Currently there's a lock that pairs with the unlock in the section, which is unnecessary and only raises questions. Add a variable to track the locking status and avoid the extra locking. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: open code btrfs_after_dev_replace_commitDavid Sterba3-10/+4
Too trivial, the purpose can be simply documented in a comment. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: open code btrfs_dev_replace_stats_incDavid Sterba2-12/+4
The wrapper is too trivial, open coding does not make it less readable. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: open code btrfs_dev_replace_clear_lock_blockingDavid Sterba3-15/+5
There's a single caller and the function name does not say it's actually taking the lock, so open coding makes it more explicit. For now, btrfs_dev_replace_read_lock is used instead of read_lock so it's paired with the unlocking wrapper in the same block. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>