Age | Commit message (Collapse) | Author | Files | Lines |
|
As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
is that one can easily filter dirty and/or unevictable pages while
estimating the size of unused memory.
Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/sys/kernel/mm/page_idle/bitmap first.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:
- it does not count unmapped file pages
- it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.
Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
page is charged to, indexed by PFN. Having this information is useful for
estimating a cgroup working set size.
The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After commmit e44163e17796 ("btrfs: explictly delete unused block groups
in close_ctree and ro-remount"), added in the 4.3 merge window, we have
calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex.
This can cause a deadlock with a concurrent block group relocation (when
a filesystem balance or shrink operation is in progress for example)
because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the
relocation path locks first delete_unused_bgs_mutex and then it locks
cleaner_mutex, resulting in a classic ABBA deadlock:
CPU 0 CPU 1
lock fs_info->cleaner_mutex
__btrfs_balance() || btrfs_shrink_device()
lock fs_info->delete_unused_bgs_mutex
btrfs_relocate_chunk()
btrfs_relocate_block_group()
lock fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
lock fs_info->delete_unused_bgs_mutex
Fix this by not taking the cleaner_mutex before calling
btrfs_delete_unused_bgs() because it's no longer needed after
commit 67c5e7d464bc ("Btrfs: fix race between balance and unused block
group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the
spinlock fs_info->unused_bgs_lock and a block group's spinlock are
enough to get correct serialization between tasks running relocation
and unused block group deletion (as well as between multiple tasks
concurrently calling btrfs_delete_unused_bgs()).
This issue was discussed (in the mailing list) during the review of
the patch titled "btrfs: explictly delete unused block groups in
close_ctree and ro-remount" and it was agreed that acquiring the
cleaner mutex had to be dropped after the patch titled
"Btrfs: fix race between balance and unused block group deletion"
got merged (both patches were submitted at about the same time, but
one landed in kernel 4.2 and the other in the 4.3 merge window).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
|
Followup to the UFS series - with the way we clear the new blocks (via
buffer cache, possibly on more than a page worth of file) we really
should not insert a reference to new block into inode block tree until
after we'd cleared it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull cifs updates from Steve French:
"Small cifs fix and a patch for improved debugging"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix use-after-free on mid_q_entry
Update cifs version number
Add way to query server fs info for smb3
|
|
As part of the v4.3 merge window the DAX code was updated by Matthew and
Kirill to handle PMD pages. Also as part of the v4.3 merge window we
updated the DAX code to do proper PMEM flushing (commit 2765cfbb342c:
"dax: update I/O path to do proper PMEM flushing").
The additional code added by the DAX PMD patches also needs to be
updated to properly use the PMEM API. This ensures that after a PMD
fault is handled the zeros written to the newly allocated pages are
durable on the DIMMs.
linux/dax.h is included to get rid of a bunch of sparse warnings.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>,
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When readahead encounters file holes, osd reply returns error -ENOENT,
finish_read() skips adding pages to the the page cache. So readahead
does not work for file holes. The fix is adding zero pages to the
page cache when -ENOENT is returned.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap updates from Mark Brown:
"This has been a busy release for regmap.
By far the biggest set of changes here are those from Markus Pargmann
which implement support for block transfers in smbus devices. This
required quite a bit of refactoring but leaves us better able to
handle odd restrictions that controllers may have and with better
performance on smbus.
Other new features include:
- Fix interactions with lockdep for nested regmaps (eg, when a device
using regmap is connected to a bus where the bus controller has a
separate regmap). Lockdep's default class identification is too
crude to work without help.
- Support for must write bitfield operations, useful for operations
which require writing a bit to trigger them from Kuniori Morimoto.
- Support for delaying during register patch application from Nariman
Poushin.
- Support for overriding cache state via the debugfs implementation
from Richard Fitzgerald"
* tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
regmap: fix a NULL pointer dereference in __regmap_init
regmap: Support bulk reads for devices without raw formatting
regmap-i2c: Add smbus i2c block support
regmap: Add raw_write/read checks for max_raw_write/read sizes
regmap: regmap max_raw_read/write getter functions
regmap: Introduce max_raw_read/write for regmap_bulk_read/write
regmap: Add missing comments about struct regmap_bus
regmap: No multi_write support if bus->write does not exist
regmap: Split use_single_rw internally into use_single_read/write
regmap: Fix regmap_bulk_write for bus writes
regmap: regmap_raw_read return error on !bus->read
regulator: core: Print at debug level on debugfs creation failure
regmap: Fix regmap_can_raw_write check
regmap: fix typos in regmap.c
regmap: Fix integertypes for register address and value
regmap: Move documentation to regmap.h
regmap: Use different lockdep class for each regmap init call
thermal: sti: Add parentheses around bridge->ops->regmap_init call
mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
regmap: debugfs: Fix misuse of IS_ENABLED
...
|
|
This is based on the shmem version, but it has diverged quite a bit. We
have no swap to worry about, nor the new file sealing. Add
synchronication via the fault mutex table to coordinate page faults,
fallocate allocation and fallocate hole punch.
What this allows us to do is move physical memory in and out of a
hugetlbfs file without having it mapped. This also gives us the ability
to support MADV_REMOVE since it is currently implemented using
fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
a hugetlbfs file, which wasn't possible before.
hugetlbfs fallocate only operates on whole huge pages.
Based on code by Dave Hansen.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of LLONG_MAX is passed, the
current "truncate" functionality is maintained. Existing callers are
modified to pass LLONG_MAX as end of range. By keying off end ==
LLONG_MAX, the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults by
using the fault mutex table. The hole punch case can experience the
rare region_del error and must handle accordingly.
Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
the case where region_del returns an error.
Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages(). To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().
Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from region_del and
pass it back to the caller.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range. If end is 0, this indicates all pages after start
should be unmapped. This is the same as the existing truncate
functionality. Modify existing callers to add 0 as end of range.
Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We want to know per-process workingset size for smart memory management
on userland and we use swap(ex, zram) heavily to maximize memory
efficiency so workingset includes swap as well as RSS.
On such system, if there are lots of shared anonymous pages, it's really
hard to figure out exactly how many each process consumes memory(ie, rss
+ wap) if the system has lots of shared anonymous memory(e.g, android).
This patch introduces SwapPss field on /proc/<pid>/smaps so we can get
more exact workingset size per process.
Bongkyu tested it. Result is below.
1. 50M used swap
SwapTotal: 461976 kB
SwapFree: 411192 kB
$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
48236
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
141184
2. 240M used swap
SwapTotal: 461976 kB
SwapFree: 216808 kB
$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
230315
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
1387744
[akpm@linux-foundation.org: simplify kunmap_atomic() call]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Bongkyu Kim <bongkyu.kim@lge.com>
Tested-by: Bongkyu Kim <bongkyu.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch sets bit 56 in pagemap if this page is mapped only once. It
allows to detect exclusively used pages without exposing PFN:
present file exclusive state
0 0 0 non-present
1 1 0 file page mapped somewhere else
1 1 1 file page mapped only here
1 0 0 anon non-CoWed page (shared with parent/child)
1 0 1 anon CoWed page (or never forked)
CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.
MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
page could be mapped once but has several swap-ptes which point to it.
Application could detect that by swap bit in pagemap entry and touch that
pte via /proc/pid/mem to get real information.
See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
Requested by Mark Williamson.
[akpm@linux-foundation.org: fix spello]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Mark Williamson <mwilliamson@undo-software.com>
Tested-by: Mark Williamson <mwilliamson@undo-software.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch makes pagemap readable for normal users and hides physical
addresses from them. For some use-cases PFN isn't required at all.
See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name
Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Mark Williamson <mwilliamson@undo-software.com>
Tested-by: Mark Williamson <mwilliamson@undo-software.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch moves pmd dissection out of reporting loop: huge pages are
reported as bunch of normal pages with contiguous PFNs.
Add missing "FILE" bit in hugetlb vmas.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Mark Williamson <mwilliamson@undo-software.com>
Tested-by: Mark Williamson <mwilliamson@undo-software.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch removes page-shift bits (scheduled to remove since 3.11) and
completes migration to the new bit layout. Also it cleans messy macro.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mark Williamson <mwilliamson@undo-software.com>
Tested-by: Mark Williamson <mwilliamson@undo-software.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patchset makes pagemap useable again in the safe way (after row
hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access
for non-privileged users but hides PFNs from them.
Also it adds bit 'map-exclusive' which is set if page is mapped only here:
it helps in estimation of working set without exposing pfns and allows to
distinguish CoWed and non-CoWed private anonymous pages.
Second patch removes page-shift bits and completes migration to the new
pagemap format: flags soft-dirty and mmap-exclusive are available only in
the new format.
This patch (of 5):
This patch moves permission checks from pagemap_read() into pagemap_open().
Pointer to mm is saved in file->private_data. This reference pins only
mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.
See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Mark Williamson <mwilliamson@undo-software.com>
Tested-by: Mark Williamson <mwilliamson@undo-software.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.
__dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.
Re-aquiring the lock should be fine since we check i_size after the
point.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables. There could be a zero pmd in
another process's page tables, if it had done a load. We can detect this
case by noticing that the buffer_head returned from the filesystem is New,
and ensure that other processes mapping this huge page have their page
tables flushed.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes. Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block(). Now the loser of the race will see the block has
already been zeroed, and will not zero it again.
This severely limits our scalability. I have ideas for improving it, but
those can wait for a later patch.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock. We can avoid this by starting the transaction in ext4 before
calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
DAX wants different semantics from any currently-existing ext4 get_block
callback. Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents. So introduce a new ext4_get_block_dax() which
has those semantics.
We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate. The comment is mostly the same as for the PTE
case.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten. ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix typo s/CONFIG_TRANSPARENT_HUGEPAGES/CONFIG_TRANSPARENT_HUGEPAGE/ in
#endif comment introduced by commit 2b26a9206d6a ("dax: add huge page
fault support").
Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use DAX to provide support for huge pages.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use DAX to provide support for huge pages.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use DAX to provide support for huge pages.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the support code for DAX-enabled filesystems to allow them to
provide huge pages in response to faults.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>. We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
|
|
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
parent inode is needed in creating new inode case. For ceph_open,
the target inode already exists.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
err != 0 is already handled. So skip this.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
During MDS failovers, MClientSnap message may cause kclient to move
some inodes from root directory's snaprealm to mdsdir's snaprealm
and queue snapshots for these inodes. For a FS has never created any
snapshot, both root directory's snaprealm and mdsdir's snaprealm
share the same snapshot contexts (both are ceph_empty_snapc). This
confuses ceph_put_wrbuffer_cap_refs(), make it unable to distinguish
snapshot buffers from head buffers.
The fix is do not use ceph_empty_snapc as snaprealm's cached context.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
After forced umount, ceph_writepages_start() skips flushing dirty
pages. To make sure inode's reference count get dropped to zero,
we need to invalidate dirty pages.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
This patch makes try_get_cap_refs() and __do_request() check
if the file system was forced umount, and return -EIO if it was.
This patch also adds a helper function to drops dirty caps and
wakes up blocking operation.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull ecryptfs fixes from Tyler Hicks:
"Invalidate stale eCryptfs dcache entries caused by unlinked lower
inodes"
* tag 'ecryptfs-4.3-rc1-stale-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: Delete a check before the function call "key_put"
eCryptfs: Invalidate dcache entries when lower i_nlink is zero
|
|
Commit 2e6e518335f8 ("Btrfs: fix block group ->space_info null pointer
dereference") accidently marked a space info as full when initializing
it with a value of 0 total bytes. This introduces an ENOSPC problem when
writing file data if we mount a filesystem that has no data block groups
allocated, because the data space info is initialized with 0 total bytes,
marked as full, and it never gets its total bytes incremented by a
(positive) value to unmark it as full (because there are no data block
groups loaded when the fs is mounted).
For metadata and system spaces this issue can never happen since we always
have at least one metadata block group and one system block group (even
for an empty filesystem).
So fix this by just not initializing a space info as full, reverting the
offending part of the commit mentioned above.
The following test case for fstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Mount our filesystem without space caches enabled so that we do not
# get any space used from the initial data block group that mkfs creates
# (space caches used space from data block groups).
_scratch_mount "-o nospace_cache"
# Need an fs with at least 2Gb to make sure mkfs.btrfs does not create
# an fs using mixed block groups (used both for data and metadata). We
# really need to have dedicated block groups for data to reproduce the
# issue and mkfs.btrfs defaults to mixed block groups only for small
# filesystems (up to 1Gb).
_require_fs_space $SCRATCH_MNT $((2 * 1024 * 1024))
# Run balance with the purpose of deleting the unused data block group
# that mkfs created. We could also wait for the background kthread to
# automatically delete the unused block group, but we do not have a way
# to make it run and wait for it to complete, so just do a balance
# instead of some unreliable sleep
_run_btrfs_util_prog balance start -dusage=0 $SCRATCH_MNT
# Now unmount the filesystem, mount it again (either with or with space
# caches enabled, it does not matter to trigger the problem) and attempt
# to create a file with some data - this used to fail with ENOSPC
# because there were no data block groups when the filesystem was
# mounted and the data space info object was marked as full when
# initialized (because it had 0 total bytes), which prevented the file
# write path from attempting to allocate a data block group and fail
# immediately with ENOSPC.
_scratch_remount
echo "hello world" > $SCRATCH_MNT/foobar
echo "Silence is golden"
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
|
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable patches:
- Fix atomicity of pNFS commit list updates
- Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
- nfs_set_pgio_error sometimes misses errors
- Fix a thinko in xs_connect()
- Fix borkage in _same_data_server_addrs_locked()
- Fix a NULL pointer dereference of migration recovery ops for v4.2
client
- Don't let the ctime override attribute barriers.
- Revert "NFSv4: Remove incorrect check in can_open_delegated()"
- Ensure flexfiles pNFS driver updates the inode after write finishes
- flexfiles must not pollute the attribute cache with attrbutes from
the DS
- Fix a protocol error in layoutreturn
- Fix a protocol issue with NFSv4.1 CLOSE stateids
Bugfixes + cleanups
- pNFS blocks bugfixes from Christoph
- Various cleanups from Anna
- More fixes for delegation corner cases
- Don't fsync twice for O_SYNC/IS_SYNC files
- Fix pNFS and flexfiles layoutstats bugs
- pnfs/flexfiles: avoid duplicate tracking of mirror data
- pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
issues
- pnfs/flexfiles: error handling retries a layoutget before fallback
to MDS
Features:
- Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
Kinglong
- More RDMA client transport improvements from Chuck
- Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
verbs from the SUNRPC, Lustre and core infiniband tree.
- Optimise away the close-to-open getattr if there is no cached data"
* tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
NFSv4: Respect the server imposed limit on how many changes we may cache
NFSv4: Express delegation limit in units of pages
Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
NFS: Optimise away the close-to-open getattr if there is no cached data
NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
nfs: Remove unneeded checking of the return value from scnprintf
nfs: Fix truncated client owner id without proto type
NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
NFSv4.1/flexfiles: Fix freeing of mirrors
NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
NFSv4.1: Fix a protocol issue with CLOSE stateids
NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
SUNRPC: Prevent SYN+SYNACK+RST storms
SUNRPC: xs_reset_transport must mark the connection as disconnected
NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There isn't a whole lot to this update - it's mostly bug fixes and
they are spread pretty much all over XFS. There are some corruption
fixes, some fixes for log recovery, some fixes that prevent unount
from hanging, a lockdep annotation rework for inode locking to prevent
false positives and the usual random bunch of cleanups and minor
improvements.
Deatils:
- large rework of EFI/EFD lifecycle handling to fix log recovery
corruption issues, crashes and unmount hangs
- separate metadata UUID on disk to enable changing boot label UUID
for v5 filesystems
- fixes for gcc miscompilation on certain platforms and optimisation
levels
- remote attribute allocation and recovery corruption fixes
- inode lockdep annotation rework to fix bugs with too many
subclasses
- directory inode locking changes to prevent lockdep false positives
- a handful of minor corruption fixes
- various other small cleanups and bug fixes"
* tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits)
xfs: fix error gotos in xfs_setattr_nonsize
xfs: add mssing inode cache attempts counter increment
xfs: return errors from partial I/O failures to files
libxfs: bad magic number should set da block buffer error
xfs: fix non-debug build warnings
xfs: collapse allocsize and biosize mount option handling
xfs: Fix file type directory corruption for btree directories
xfs: lockdep annotations throw warnings on non-debug builds
xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()
xfs: inode lockdep annotations broke non-lockdep build
xfs: flush entire file on dio read/write to cached file
xfs: Fix xfs_attr_leafblock definition
libxfs: readahead of dir3 data blocks should use the read verifier
xfs: stop holding ILOCK over filldir callbacks
xfs: clean up inode lockdep annotations
xfs: swap leaf buffer into path struct atomically during path shift
xfs: relocate sparse inode mount warning
xfs: dquots should be stamped with sb_meta_uuid
xfs: log recovery needs to validate against sb_meta_uuid
xfs: growfs not aware of sb_meta_uuid
...
|
|
The NFSv4 delegation spec allows the server to tell a client to limit how
much data it cache after the file is closed. In return, the server
guarantees enough free space to avoid ENOSPC situations, etc.
Prior to this patch, we assumed we could always cache aggressively after
close. Unfortunately, this causes problems with servers that set the
limit to 0 and therefore do not offer any ENOSPC guarantees.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Since we're tracking modifications to the page cache on a per-page
basis, it makes sense to express the limit to how much we may cache
in units of pages.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
"In this one:
- d_move fixes (Eric Biederman)
- UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
handling ought to be fixed)
- switch of sb_writers to percpu rwsem (Oleg Nesterov)
- superblock scalability (Josef Bacik and Dave Chinner)
- swapon(2) race fix (Hugh Dickins)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
vfs: Test for and handle paths that are unreachable from their mnt_root
dcache: Reduce the scope of i_lock in d_splice_alias
dcache: Handle escaped paths in prepend_path
mm: fix potential data race in SyS_swapon
inode: don't softlockup when evicting inodes
inode: rename i_wb_list to i_io_list
sync: serialise per-superblock sync operations
inode: convert inode_sb_list_lock to per-sb
inode: add hlist_fake to avoid the inode hash lock in evict
writeback: plug writeback at a high level
change sb_writers to use percpu_rw_semaphore
shift percpu_counter_destroy() into destroy_super_work()
percpu-rwsem: kill CONFIG_PERCPU_RWSEM
percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
percpu-rwsem: introduce percpu_down_read_trylock()
document rwsem_release() in sb_wait_write()
fix the broken lockdep logic in __sb_start_write()
introduce __sb_writers_{acquired,release}() helpers
ufs_inode_get{frag,block}(): get rid of 'phys' argument
ufs_getfrag_block(): tidy up a bit
...
|