summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2017-09-09mm/memory_hotplug: introduce add_pagesMichal Hocko1-0/+11
There are new users of memory hotplug emerging. Some of them require different subset of arch_add_memory. There are some which only require allocation of struct pages without mapping those pages to the kernel address space. We currently have __add_pages for that purpose. But this is rather lowlevel and not very suitable for the code outside of the memory hotplug. E.g. x86_64 wants to update max_pfn which should be done by the caller. Introduce add_pages() which should care about those details if they are needed. Each architecture should define its implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the currently existing __add_pages. Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm/hmm/mirror: device page fault handlerJérôme Glisse1-0/+27
This handles page fault on behalf of device driver, unlike handle_mm_fault() it does not trigger migration back to system memory for device memory. Link: http://lkml.kernel.org/r/20170817000548.32038-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm/hmm/mirror: helper to snapshot CPU page tableJérôme Glisse1-2/+53
This does not use existing page table walker because we want to share same code for our page fault handler. Link: http://lkml.kernel.org/r/20170817000548.32038-5-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm/hmm/mirror: mirror process address space on device with HMM helpersJérôme Glisse1-0/+110
This is a heterogeneous memory management (HMM) process address space mirroring. In a nutshell this provide an API to mirror process address space on a device. This boils down to keeping CPU and device page table synchronize (we assume that both device and CPU are cache coherent like PCIe device can be). This patch provide a simple API for device driver to achieve address space mirroring thus avoiding each device driver to grow its own CPU page table walker and its own CPU page table synchronization mechanism. This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more hardware in the future. [jglisse@redhat.com: fix hmm for "mmu_notifier kill invalidate_page callback"] Link: http://lkml.kernel.org/r/20170830231955.GD9445@redhat.com Link: http://lkml.kernel.org/r/20170817000548.32038-4-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm/hmm: heterogeneous memory management (HMM for short)Jérôme Glisse2-0/+158
HMM provides 3 separate types of functionality: - Mirroring: synchronize CPU page table and device page table - Device memory: allocating struct page for device memory - Migration: migrating regular memory to device memory This patch introduces some common helpers and definitions to all of those 3 functionality. Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm: memory_hotplug: memory hotremove supports thp migrationNaoya Horiguchi1-1/+14
This patch enables thp migration for memory hotremove. Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm: soft-dirty: keep soft-dirty bits over thp migrationNaoya Horiguchi1-0/+2
Soft dirty bit is designed to keep tracked over page migration. This patch makes it work in the same manner for thp migration too. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm: thp: check pmd migration entry in common pathZi Yan1-2/+12
When THP migration is being used, memory management code needs to handle pmd migration entries properly. This patch uses !pmd_present() or is_swap_pmd() (depending on whether pmd_none() needs separate code or not) to check pmd migration entries at the places where a pmd entry is present. Since pmd-related code uses split_huge_page(), split_huge_pmd(), pmd_trans_huge(), pmd_trans_unstable(), or pmd_none_or_trans_huge_or_clear_bad(), this patch: 1. adds pmd migration entry split code in split_huge_pmd(), 2. takes care of pmd migration entries whenever pmd_trans_huge() is present, 3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware. Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable() is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change them. Until this commit, a pmd entry should be: 1. pointing to a pte page, 2. is_swap_pmd(), 3. pmd_trans_huge(), 4. pmd_devmap(), or 5. pmd_none(). Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm: thp: enable thp migration in generic pathZi Yan1-3/+70
Add thp migration's core code, including conversions between a PMD entry and a swap entry, setting PMD migration entry, removing PMD migration entry, and waiting on PMD migration entries. This patch makes it possible to support thp migration. If you fail to allocate a destination page as a thp, you just split the source thp as we do now, and then enter the normal page migration. If you succeed to allocate destination thp, you enter thp migration. Subsequent patches actually enable thp migration for each caller of page migration by allowing its get_new_page() callback to allocate thps. [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning] Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu [akpm@linux-foundation.org: fix x86_64 allnoconfig warning] Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATIONNaoya Horiguchi1-0/+10
Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration functionality to x86_64, which should be safer at the first step. Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-09mm: thp: introduce separate TTU flag for thp freezingNaoya Horiguchi1-1/+2
TTU_MIGRATION is used to convert pte into migration entry until thp split completes. This behavior conflicts with thp migration added later patches, so let's introduce a new TTU flag specifically for freezing. try_to_unmap() is used both for thp split (via freeze_page()) and page migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for head page is like below (assuming anonymous thp): (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \ TTU_MIGRATION | TTU_SPLIT_HUGE_PMD) and ttu_flag given for tail pages is: (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \ TTU_MIGRATION) __unmap_and_move() calls try_to_unmap() with ttu_flag: (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS) Now I'm trying to insert a branch for thp migration at the top of try_to_unmap_one() like below static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { ... /* PMD-mapped THP migration entry */ if (!pvmw.pte && (flags & TTU_MIGRATION)) { if (!PageAnon(page)) continue; set_pmd_migration_entry(&pvmw, page); continue; } ... } so try_to_unmap() for tail pages called by thp split can go into thp migration code path (which converts *pmd* into migration entry), while the expectation is to freeze thp (which converts *pte* into migration entry.) I detected this failure as a "bad page state" error in a testcase where split_huge_page() is called from queue_pages_pte_range(). Link: http://lkml.kernel.org/r/20170717193955.20207-4-zi.yan@sent.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2-2/+19
Pull SCSI updates from James Bottomley: "This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas, megaraid_sas, zfcp and a host of minor updates. The major driver change here is the elimination of the block based cciss driver in favour of the SCSI based hpsa driver (which now drives all the legacy cases cciss used to be required for). Plus a reset handler clean up and the redo of the SAS SMP handler to use bsg lib" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits) scsi: scsi-mq: Always unprepare before requeuing a request scsi: Show .retries and .jiffies_at_alloc in debugfs scsi: Improve requeuing behavior scsi: Call scsi_initialize_rq() for filesystem requests scsi: qla2xxx: Reset the logo flag, after target re-login. scsi: qla2xxx: Fix slow mem alloc behind lock scsi: qla2xxx: Clear fc4f_nvme flag scsi: qla2xxx: add missing includes for qla_isr scsi: qla2xxx: Fix an integer overflow in sysfs code scsi: aacraid: report -ENOMEM to upper layer from aac_convert_sgraw2() scsi: aacraid: get rid of one level of indentation scsi: aacraid: fix indentation errors scsi: storvsc: fix memory leak on ring buffer busy scsi: scsi_transport_sas: switch to bsg-lib for SMP passthrough scsi: smartpqi: remove the smp_handler stub scsi: hpsa: remove the smp_handler stub scsi: bsg-lib: pass the release callback through bsg_setup_queue scsi: Rework handling of scsi_device.vpd_pg8[03] scsi: Rework the code for caching Vital Product Data (VPD) scsi: rcu: Introduce rcu_swap_protected() ...
2017-09-08Merge tag 'secureexec-v4.14-rc1' of ↵Linus Torvalds3-21/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull secureexec update from Kees Cook: "This series has the ultimate goal of providing a sane stack rlimit when running set*id processes. To do this, the bprm_secureexec LSM hook is collapsed into the bprm_set_creds hook so the secureexec-ness of an exec can be determined early enough to make decisions about rlimits and the resulting memory layouts. Other logic acting on the secureexec-ness of an exec is similarly consolidated. Capabilities needed some special handling, but the refactoring removed other special handling, so that was a wash" * tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Consolidate pdeath_signal clearing exec: Use sane stack rlimit under secureexec exec: Consolidate dumpability logic smack: Remove redundant pdeath_signal clearing exec: Use secureexec for clearing pdeath_signal exec: Use secureexec for setting dumpability LSM: drop bprm_secureexec hook commoncap: Move cap_elevated calculation into bprm_set_creds commoncap: Refactor to remove bprm_secureexec hook smack: Refactor to remove bprm_secureexec hook selinux: Refactor to remove bprm_secureexec hook apparmor: Refactor to remove bprm_secureexec hook binfmt: Introduce secureexec flag exec: Correct comments about "point of no return" exec: Rename bprm->cred_prepared to called_set_creds
2017-09-08Merge tag 'pstore-v4.14-rc1' of ↵Linus Torvalds1-9/+0
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore update from Kees Cook: "Make pstore permissions more versatile by removing CAP_SYSLOG requirement and defining more restrictive root directory DAC permissions default (0750, which can be adjust after boot unlike the CAP_SYSLOG check). Suggested by Nick Kralevich" * tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps" pstore: Make default pstorefs root dir perms 0750
2017-09-08Merge branch 'quota_scaling' of ↵Linus Torvalds3-19/+21
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota scaling updates from Jan Kara: "This contains changes to make the quota subsystem more scalable. Reportedly it improves number of files created per second on ext4 filesystem on fast storage by about a factor of 2x" * 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits) quota: Add lock annotations to struct members quota: Reduce contention on dq_data_lock fs: Provide __inode_get_bytes() quota: Inline dquot_[re]claim_reserved_space() into callsite quota: Inline inode_{incr,decr}_space() into callsites quota: Inline functions into their callsites ext4: Disable dirty list tracking of dquots when journalling quotas quota: Allow disabling tracking of dirty dquots in a list quota: Remove dq_wait_unused from dquot quota: Move locking into clear_dquot_dirty() quota: Do not dirty bad dquots quota: Fix possible corruption of dqi_flags quota: Propagate ->quota_read errors from v2_read_file_info() quota: Fix error codes in v2_read_file_info() quota: Push dqio_sem down to ->read_file_info() quota: Push dqio_sem down to ->write_file_info() quota: Push dqio_sem down to ->get_next_id() quota: Push dqio_sem down to ->release_dqblk() quota: Remove locking for writing to the old quota format quota: Do not acquire dqio_sem for dquot overwrites in v2 format ...
2017-09-08Merge tag 'devicetree-for-4.14' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull DeviceTree updates from Rob Herring: "There's a few orphans in the conversion to %pOF printf specifiers included here that no one else picked up. Summary: - Convert more DT code to use of_property_read_* API. - Improve DT overlay support when adding multiple overlays - Convert printk's to %pOF format specifiers. Most went via subsystem trees, but picked up the remaining orphans - Correct unittests to use preferred "okay" for "status" property value - Add a KASLR seed property - Vendor prefixes for Mellanox, Theobroma System, Adaptrum, Moxa - Fix modalias buffer handling - Clean-up of include paths for building dtbs - Add bindings for amc6821, isl1208, tsl2x7x, srf02, and srf10 devices - Add nvmem bindings for MediaTek MT7623 and MT7622 SoC - Add compatible string for Allwinner H5 Mali-450 GPU - Fix links to old OpenFirmware docs with new mirror on devicetree.org - Remove status property from binding doc examples" * tag 'devicetree-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (45 commits) devicetree: Adjust status "ok" -> "okay" under drivers/of/ dt-bindings: Remove "status" from examples dt-bindings: pinctrl: sh-pfc: Use generic node name dt-bindings: Add vendor Mellanox dt-binding: net/phy: fix interrupts description virt: Convert to using %pOF instead of full_name macintosh: Convert to using %pOF instead of full_name ide: pmac: Convert to using %pOF instead of full_name microblaze: Convert to using %pOF instead of full_name dt-bindings: usb: musb: Grammar s/the/to/, s/is/are/ of: Use PLATFORM_DEVID_NONE definition of/device: Fix of_device_get_modalias() buffer handling of/device: Prevent buffer overflow in of_device_modalias() dt-bindings: add amc6821, isl1208 trivial bindings dt-bindings: add vendor prefix for Theobroma Systems of: search scripts/dtc/include-prefixes path for both CPP and DTC of: remove arch/$(SRCARCH)/boot/dts from include search path for CPP of: remove drivers/of/testcase-data from include search path for CPP of: return of_get_cpu_node from of_cpu_device_node_get if CPUs are not registered iio: srf08: add device tree binding for srf02 and srf10 ...
2017-09-08Merge tag 'leds_for_4.14' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds Pull LED updates from Jacek Anaszewski: "LED class drivers improvements: leds-pca955x: - add Device Tree support and bindings - use devm_led_classdev_register() - add GPIO support - prevent crippled LED class device name - check for I2C errors leds-gpio: - add optional retain-state-shutdown DT property - allow LED to retain state at shutdown leds-tlc591xx: - merge conditional tests - add missing of_node_put leds-powernv: - delete an error message for a failed memory allocation in powernv_led_create() leds-is31fl32xx.c - convert to using custom %pOF printf format specifier Constify attribute_group structures in: - leds-blinkm - leds-lm3533 Make several arrays static const in: - leds-aat1290 - leds-lp5521 - leds-lp5562 - leds-lp8501" * tag 'leds_for_4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: leds: pca955x: check for I2C errors leds: gpio: Allow LED to retain state at shutdown dt-bindings: leds: gpio: Add optional retain-state-shutdown property leds: powernv: Delete an error message for a failed memory allocation in powernv_led_create() leds: lp8501: make several arrays static const leds: lp5562: make several arrays static const leds: lp5521: make several arrays static const leds: aat1290: make array max_mm_current_percent static const leds: pca955x: Prevent crippled LED device name leds: lm3533: constify attribute_group structure dt-bindings: leds: add pca955x leds: pca955x: add GPIO support leds: pca955x: use devm_led_classdev_register leds: pca955x: add device tree support leds: Convert to using %pOF instead of full_name leds: blinkm: constify attribute_group structures. leds: tlc591xx: add missing of_node_put leds: tlc591xx: merge conditional tests
2017-09-08Merge tag 'dmaengine-4.14-rc1' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds2-19/+83
Pull dmaengine updates from Vinod Koul: "This one features the usual updates to the drivers and one good part of removing DA_SG from core as it has no users. Summary: - Remove DMA_SG support as we have no users for this feature - New driver for Altera / Intel mSGDMA IP core - Support for memset in dmatest and qcom_hidma driver - Update for non cyclic mode in k3dma, bunch of update in bam_dma, bcm sba-raid - Constify device ids across drivers" * tag 'dmaengine-4.14-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (52 commits) dmaengine: sun6i: support V3s SoC variant dmaengine: sun6i: make gate bit in sun8i's DMA engines a common quirk dmaengine: rcar-dmac: document R8A77970 bindings dmaengine: xilinx_dma: Fix error code format specifier dmaengine: altera: Use macros instead of structs to describe the registers dmaengine: ti-dma-crossbar: Fix dra7 reserve function dmaengine: pl330: constify amba_id dmaengine: pl08x: constify amba_id dmaengine: bcm-sba-raid: Remove redundant SBA_REQUEST_STATE_COMPLETED dmaengine: bcm-sba-raid: Explicitly ACK mailbox message after sending dmaengine: bcm-sba-raid: Add debugfs support dmaengine: bcm-sba-raid: Remove redundant SBA_REQUEST_STATE_RECEIVED dmaengine: bcm-sba-raid: Re-factor sba_process_deferred_requests() dmaengine: bcm-sba-raid: Pre-ack async tx descriptor dmaengine: bcm-sba-raid: Peek mbox when we have no free requests dmaengine: bcm-sba-raid: Alloc resources before registering DMA device dmaengine: bcm-sba-raid: Improve sba_issue_pending() run duration dmaengine: bcm-sba-raid: Increase number of free sba_request dmaengine: bcm-sba-raid: Allow arbitrary number free sba_request dmaengine: bcm-sba-raid: Remove reqs_free_count from sba_device ...
2017-09-07Merge tag 'backlight-next-4.14' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Fix-ups: - Constification; pwm_bl - Use new GPIO API; gpio_backlight - Remove unused functionality; gpio_backlight Bug Fixes: - Fix artificial MAXREG limit; lm3630a_bl" * tag 'backlight-next-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: backlight: gpio_backlight: Delete pdata inversion backlight: gpio_backlight: Convert to use GPIO descriptor backlight: pwm_bl: Make of_device_ids const backlight: lm3630a: Bump REG_MAX value to 0x50 instead of 0x1F
2017-09-07Merge tag 'mfd-next-4.14' of ↵Linus Torvalds11-10/+480
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "New Drivers - RK805 Power Management IC (PMIC) - ROHM BD9571MWV-M MFD Power Management IC (PMIC) - Texas Instruments TPS68470 Power Management IC (PMIC) & LEDs New Device Support: - Add support for HiSilicon Hi6421v530 to hi6421-pmic-core - Add support for X-Powers AXP806 to axp20x - Add support for X-Powers AXP813 to axp20x - Add support for Intel Sunrise Point LPSS to intel-lpss-pci New Functionality: - Amend API to provide register layout; atmel-smc Fix-ups: - DT re-work; omap, nokia - Header file location change {I2C => MFD}; dm355evm_msp, tps65010 - Fix chip ID formatting issue(s); rk808 - Optionally register touchscreen devices; da9052-core - Documentation improvements; twl-core - Constification; rtsx_pcr, ab8500-core, da9055-i2c, da9052-spi - Drop unnecessary static declaration; max8925-i2c - Kconfig changes (missing deps and remove module support) - Slim down oversized licence statement; hi6421-pmic-core - Use managed resources (devm_*); lp87565 - Supply proper error checking/handling; t7l66xb Bug Fixes: - Fix counter duplication issue; da9052-core - Fix potential NULL deference issue; max8998 - Leave SPI-NOR write-protection bit alone; lpc_ich - Ensure device is put into reset during suspend; intel-lpss - Correct register offset variable size; omap-usb-tll" * tag 'mfd-next-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (61 commits) mfd: intel_soc_pmic: Differentiate between Bay and Cherry Trail CRC variants mfd: intel_soc_pmic: Export separate mfd-cell configs for BYT and CHT dt-bindings: mfd: Add bindings for ZII RAVE devices mfd: omap-usb-tll: Fix register offsets mfd: da9052: Constify spi_device_id mfd: intel-lpss: Put I2C and SPI controllers into reset state on suspend mfd: da9055: Constify i2c_device_id mfd: intel-lpss: Add missing PCI ID for Intel Sunrise Point LPSS devices mfd: t7l66xb: Handle return value of clk_prepare_enable mfd: Add ROHM BD9571MWV-M PMIC DT bindings mfd: intel_soc_pmic_chtwc: Turn Kconfig option into a bool mfd: lp87565: Convert to use devm_mfd_add_devices() mfd: Add support for TPS68470 device mfd: lpc_ich: Do not touch SPI-NOR write protection bit on Haswell/Broadwell mfd: syscon: atmel-smc: Add helper to retrieve register layout mfd: axp20x: Use correct platform device ID for many PEK dt-bindings: mfd: axp20x: Introduce bindings for AXP813 mfd: axp20x: Add support for AXP813 PMIC dt-bindings: mfd: axp20x: Add AXP806 to supported list of chips mfd: Add ROHM BD9571MWV-M MFD PMIC driver ...
2017-09-07Merge tag 'mmc-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds5-15/+112
Pull MMC updates from Ulf Hansson: "MMC core: - Continue to refactor the mmc block code to prepare for blkmq - Move mmc block debugfs into block module - Next step for eMMC CMDQ by adding a new mmc host interface for it - Move Kconfig option MMC_DEBUG from core to host - Some additional minor improvements MMC host: - Declare structs as const when applicable - Explicitly request exclusive reset control when applicable - Improve some error paths and other various cleanups - sdhci: Preparations to support SDHCI OMAP - sdhci: Improve some PM related code - sdhci: Re-factoring and modernizations - sdhci-xenon: Add runtime PM and system sleep support - sdhci-xenon: Add support for eMMC HS400 Enhanced Strobe - sdhci-cadence: Add system sleep support - sdhci-of-at91: Improve system sleep support - dw_mmc: Add support for Hisilicon hi3660 - sunxi: Add support for A83T eMMC - sunxi: Add support for DDR52 mode - meson-gx: Add support for UHS-I SD-cards - meson-gx: Cleanups and improvements - tmio: Fix CMD12 (STOP) handling - tmio: Cleanups and improvements - renesas_sdhi: Add r8a7743/5 support - renesas-sdhi: Add support for R-Car Gen3 SDHI DMAC - renesas_sdhi: Cleanups and improvements" * tag 'mmc-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (145 commits) mmc: renesas_sdhi: Add r8a7743/5 support mmc: meson-gx: fix __ffsdi2 undefined on arm32 mmc: sdhci-xenon: add runtime pm support and reimplement standby mmc: core: Move mmc_start_areq() declaration mmc: mmci: stop building qcom dml as module mmc: sunxi: Reset the device at probe time clk: sunxi-ng: Provide a default reset hook mmc: meson-gx: rework tuning function mmc: meson-gx: change default tx phase mmc: meson-gx: implement voltage switch callback mmc: meson-gx: use CCF to handle the clock phases mmc: meson-gx: implement card_busy callback mmc: meson-gx: simplify interrupt handler mmc: meson-gx: work around clk-stop issue mmc: meson-gx: fix dual data rate mode frequencies mmc: meson-gx: rework clock init function mmc: meson-gx: rework clk_set function mmc: meson-gx: rework set_ios function mmc: meson-gx: cfg init overwrite values mmc: meson-gx: initialize sane clk default before clock register ...
2017-09-07Merge branch 'fixes' into miscJames Bottomley1-2/+0
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds13-68/+127
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-09-07Merge tag 'powerpc-4.14-1' of ↵Linus Torvalds2-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Nothing really major this release, despite quite a lot of activity. Just lots of things all over the place. Some things of note include: - Access via perf to a new type of PMU (IMC) on Power9, which can count both core events as well as nest unit events (Memory controller etc). - Optimisations to the radix MMU TLB flushing, mostly to avoid unnecessary Page Walk Cache (PWC) flushes when the structure of the tree is not changing. - Reworks/cleanups of do_page_fault() to modernise it and bring it closer to other architectures where possible. - Rework of our page table walking so that THP updates only need to send IPIs to CPUs where the affected mm has run, rather than all CPUs. - The size of our vmalloc area is increased to 56T on 64-bit hash MMU systems. This avoids problems with the percpu allocator on systems with very sparse NUMA layouts. - STRICT_KERNEL_RWX support on PPC32. - A new sched domain topology for Power9, to capture the fact that pairs of cores may share an L2 cache. - Power9 support for VAS, which is a new mechanism for accessing coprocessors, and initial support for using it with the NX compression accelerator. - Major work on the instruction emulation support, adding support for many new instructions, and reworking it so it can be used to implement the emulation needed to fixup alignment faults. - Support for guests under PowerVM to use the Power9 XIVE interrupt controller. And probably that many things again that are almost as interesting, but I had to keep the list short. Plus the usual fixes and cleanups as always. Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh, Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly, Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall, LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring, Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff, Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu, Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui" * tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits) powerpc/xive: Fix section __init warning powerpc: Fix kernel crash in emulation of vector loads and stores powerpc/xive: improve debugging macros powerpc/xive: add XIVE Exploitation Mode to CAS powerpc/xive: introduce H_INT_ESB hcall powerpc/xive: add the HW IRQ number under xive_irq_data powerpc/xive: introduce xive_esb_write() powerpc/xive: rename xive_poke_esb() in xive_esb_read() powerpc/xive: guest exploitation of the XIVE interrupt controller powerpc/xive: introduce a common routine xive_queue_page_alloc() powerpc/sstep: Avoid used uninitialized error axonram: Return directly after a failed kzalloc() in axon_ram_probe() axonram: Improve a size determination in axon_ram_probe() axonram: Delete an error message for a failed memory allocation in axon_ram_probe() powerpc/powernv/npu: Move tlb flush before launching ATSD powerpc/macintosh: constify wf_sensor_ops structures powerpc/iommu: Use permission-specific DEVICE_ATTR variants powerpc/eeh: Delete an error out of memory message at init time powerpc/mm: Use seq_putc() in two functions macintosh: Convert to using %pOF instead of full_name ...
2017-09-07Merge branch 'efi-core-for-linus' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The main changes in this cycle were: - Transparently fall back to other poweroff method(s) if EFI poweroff fails (and returns) - Use separate PE/COFF section headers for the RX and RW parts of the ARM stub loader so that the firmware can use strict mapping permissions - Add support for requesting the firmware to wipe RAM at warm reboot - Increase the size of the random seed obtained from UEFI so CRNG fast init can complete earlier - Update the EFI framebuffer address if it points to a BAR that gets moved by the PCI resource allocation code - Enable "reset attack mitigation" of TPM environments: this is enabled if the kernel is configured with CONFIG_RESET_ATTACK_MITIGATION=y. - Clang related fixes - Misc cleanups, constification, refactoring, etc" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/bgrt: Use efi_mem_type() efi: Move efi_mem_type() to common code efi/reboot: Make function pointer orig_pm_power_off static efi/random: Increase size of firmware supplied randomness efi/libstub: Enable reset attack mitigation firmware/efi/esrt: Constify attribute_group structures firmware/efi: Constify attribute_group structures firmware/dcdbas: Constify attribute_group structures arm/efi: Split zImage code and data into separate PE/COFF sections arm/efi: Replace open coded constants with symbolic ones arm/efi: Remove pointless dummy .reloc section arm/efi: Remove forbidden values from the PE/COFF header drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it efi/reboot: Fall back to original power-off method if EFI_RESET_SHUTDOWN returns efi/arm/arm64: Add missing assignment of efi.config_table efi/libstub/arm64: Set -fpie when building the EFI stub efi/libstub/arm64: Force 'hidden' visibility for section markers efi/libstub/arm64: Use hidden attribute for struct screen_info reference efi/arm: Don't mark ACPI reclaim memory as MEMBLOCK_NOMAP
2017-09-07Merge branch 'x86-platform-for-linus' of ↵Linus Torvalds1-16/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform updates from Ingo Molnar: "The main changes include various Hyper-V optimizations such as faster hypercalls and faster/better TLB flushes - and there's also some Intel-MID cleanups" * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others() x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls x86/platform/intel-mid: Make several arrays static, to make code smaller MAINTAINERS: Add missed file for Hyper-V x86/hyper-v: Use hypercall for remote TLB flush hyper-v: Globalize vp_index x86/hyper-v: Implement rep hypercalls hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT x86/hyper-v: Introduce fast hypercall implementation x86/hyper-v: Make hv_do_hypercall() inline x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set x86/platform/intel-mid: Make 'bt_sfi_data' const x86/platform/intel-mid: Make IRQ allocation a bit more flexible x86/platform/intel-mid: Group timers callbacks together
2017-09-07Merge branch 'for-4.14' of ↵Linus Torvalds2-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: "Except for the ahci fix that fixes a boot issue, nothing major in this pull request. Some new platform controller support and device specific changes" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: libata: zpodd: make arrays cdb static, reduces object code size ahci: don't use MSI for devices with the silly Intel NVMe remapping scheme dt-bindings: ata: add DT bindings for MediaTek SATA controller ata: mediatek: add support for MediaTek SATA controller pata_octeon_cf: use of_property_read_{bool|u32}() cs5536: add support for IDE controller variant ata: sata_gemini: Introduce explicit IDE pin control ata: sata_gemini: Retire custom pin control ata: ahci_platform: Add shutdown handler ata: sata_gemini: explicitly request exclusive reset control ata: Drop unnecessary static ata: Convert to using %pOF instead of full_name
2017-09-07Merge branch 'for-4.14' of ↵Linus Torvalds2-6/+101
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Several notable changes this cycle: - Thread mode was merged. This will be used for cgroup2 support for CPU and possibly other controllers. Unfortunately, CPU controller cgroup2 support didn't make this pull request but most contentions have been resolved and the support is likely to be merged before the next merge window. - cgroup.stat now shows the number of descendant cgroups. - cpuset now can enable the easier-to-configure v2 behavior on v1 hierarchy" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Allow v2 behavior in v1 cgroup cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup cgroup: remove unneeded checks cgroup: misc changes cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy cgroup: re-use the parent pointer in cgroup_destroy_locked() cgroup: add cgroup.stat interface with basic hierarchy stats cgroup: implement hierarchy limits cgroup: keep track of number of descent cgroups cgroup: add comment to cgroup_enable_threaded() cgroup: remove unnecessary empty check when enabling threaded mode cgroup: update debug controller to print out thread mode information cgroup: implement cgroup v2 thread support cgroup: implement CSS_TASK_ITER_THREADED cgroup: introduce cgroup->dom_cgrp and threaded css_set handling cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS cgroup: reorganize cgroup.procs / task write path cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets cgroup: distinguish local and children populated states cgroup: remove now unused list_head @pending in cgroup_apply_cftypes() ...
2017-09-07Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-1/+1
Pull workqueue updates from Tejun Heo: "Nothing major. I introduced a flag collsion bug during v4.13 cycle which is fixed in this pull request. Fortunately, the flag is for debugging / verification and the bug isn't critical" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix flag collision workqueue: Use TASK_IDLE workqueue: fix path to documentation workqueue: doc change for ST behavior on NUMA systems
2017-09-07Merge branch 'for-4.14' of ↵Linus Torvalds1-1/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of changes for percpu this time around. percpu inherited the same area allocator from the original pre-virtual-address-mapped implementation. This was from the time when percpu allocator wasn't used all that much and the implementation was focused on simplicity, with the unfortunate computational complexity of O(number of areas allocated from the chunk) per alloc / free. With the increase in percpu usage, we're hitting cases where the lack of scalability is hurting. The most prominent one right now is bpf perpcu map creation / destruction which may allocate and free a lot of entries consecutively and it's likely that the problem will become more prominent in the future. To address the issue, Dennis replaced the area allocator with hinted bitmap allocator which is more consistent. While the new allocator does perform a bit worse in some cases, it outperforms the old allocator way more than an order of magnitude in other more common scenarios while staying mostly flat in CPU overhead and completely flat in memory consumption" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (27 commits) percpu: update header to contain bitmap allocator explanation. percpu: update pcpu_find_block_fit to use an iterator percpu: use metadata blocks to update the chunk contig hint percpu: update free path to take advantage of contig hints percpu: update alloc path to only scan if contig hints are broken percpu: keep track of the best offset for contig hints percpu: skip chunks if the alloc does not fit in the contig hint percpu: add first_bit to keep track of the first free in the bitmap percpu: introduce bitmap metadata blocks percpu: replace area map allocator with bitmap percpu: generalize bitmap (un)populated iterators percpu: increase minimum percpu allocation size and align first regions percpu: introduce nr_empty_pop_pages to help empty page accounting percpu: change the number of pages marked in the first_chunk pop bitmap percpu: combine percpu address checks percpu: modify base_addr to be region specific percpu: setup_first_chunk rename schunk/dchunk to chunk percpu: end chunk area maps page aligned for the populated bitmap percpu: unify allocation of schunk and dchunk percpu: setup_first_chunk remove dyn_size and consolidate logic ...
2017-09-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds20-122/+172
Merge updates from Andrew Morton: - various misc bits - DAX updates - OCFS2 - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits) mm,fork: introduce MADV_WIPEONFORK x86,mpx: make mpx depend on x86-64 to free up VMA flag mm: add /proc/pid/smaps_rollup mm: hugetlb: clear target sub-page last when clearing huge page mm: oom: let oom_reap_task and exit_mmap run concurrently swap: choose swap device according to numa node mm: replace TIF_MEMDIE checks by tsk_is_oom_victim mm, oom: do not rely on TIF_MEMDIE for memory reserves access z3fold: use per-cpu unbuddied lists mm, swap: don't use VMA based swap readahead if HDD is used as swap mm, swap: add sysfs interface for VMA based swap readahead mm, swap: VMA based swap readahead mm, swap: fix swap readahead marking mm, swap: add swap readahead hit statistics mm/vmalloc.c: don't reinvent the wheel but use existing llist API mm/vmstat.c: fix wrong comment selftests/memfd: add memfd_create hugetlbfs selftest mm/shmem: add hugetlbfs support to memfd_create() mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas() ...
2017-09-07mm,fork: introduce MADV_WIPEONFORKRik van Riel1-1/+1
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty in the child process after fork. This differs from MADV_DONTFORK in one important way. If a child process accesses memory that was MADV_WIPEONFORK, it will get zeroes. The address ranges are still valid, they are just empty. If a child process accesses memory that was MADV_DONTFORK, it will get a segmentation fault, since those address ranges are no longer valid in the child after fork. Since MADV_DONTFORK also seems to be used to allow very large programs to fork in systems with strict memory overcommit restrictions, changing the semantics of MADV_DONTFORK might break existing programs. MADV_WIPEONFORK only works on private, anonymous VMAs. The use case is libraries that store or cache information, and want to know that they need to regenerate it in the child process after fork. Examples of this would be: - systemd/pulseaudio API checks (fail after fork) (replacing a getpid check, which is too slow without a PID cache) - PKCS#11 API reinitialization check (mandated by specification) - glibc's upcoming PRNG (reseed after fork) - OpenSSL PRNG (reseed after fork) The security benefits of a forking server having a re-inialized PRNG in every child process are pretty obvious. However, due to libraries having all kinds of internal state, and programs getting compiled with many different versions of each library, it is unreasonable to expect calling programs to re-initialize everything manually after fork. A further complication is the proliferation of clone flags, programs bypassing glibc's functions to call clone directly, and programs calling unshare, causing the glibc pthread_atfork hook to not get called. It would be better to have the kernel take care of this automatically. The patch also adds MADV_KEEPONFORK, to undo the effects of a prior MADV_WIPEONFORK. This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO: https://man.openbsd.org/minherit.2 [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines] Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Florian Weimer <fweimer@redhat.com> Reported-by: Colm MacCártaigh <colm@allcosts.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07x86,mpx: make mpx depend on x86-64 to free up VMA flagRik van Riel1-2/+6
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4. If a child process accesses memory that was MADV_WIPEONFORK, it will get zeroes. The address ranges are still valid, they are just empty. If a child process accesses memory that was MADV_DONTFORK, it will get a segmentation fault, since those address ranges are no longer valid in the child after fork. Since MADV_DONTFORK also seems to be used to allow very large programs to fork in systems with strict memory overcommit restrictions, changing the semantics of MADV_DONTFORK might break existing programs. The use case is libraries that store or cache information, and want to know that they need to regenerate it in the child process after fork. Examples of this would be: - systemd/pulseaudio API checks (fail after fork) (replacing a getpid check, which is too slow without a PID cache) - PKCS#11 API reinitialization check (mandated by specification) - glibc's upcoming PRNG (reseed after fork) - OpenSSL PRNG (reseed after fork) The security benefits of a forking server having a re-inialized PRNG in every child process are pretty obvious. However, due to libraries having all kinds of internal state, and programs getting compiled with many different versions of each library, it is unreasonable to expect calling programs to re-initialize everything manually after fork. A further complication is the proliferation of clone flags, programs bypassing glibc's functions to call clone directly, and programs calling unshare, causing the glibc pthread_atfork hook to not get called. It would be better to have the kernel take care of this automatically. The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior MADV_WIPEONFORK. This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO: https://man.openbsd.org/minherit.2 This patch (of 2): MPX only seems to be available on 64 bit CPUs, starting with Skylake and Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in order to free up a VMA flag. Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Colm MacCártaigh <colm@allcosts.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm: hugetlb: clear target sub-page last when clearing huge pageHuang Ying1-1/+1
Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when clearing huge page on x86_64 platform, the cache footprint is 2M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache pressure is heavy when clearing the huge page, and we clear the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing clearing the end of the huge page. And it is possible for the application to access the begin of the huge page after clearing the huge page. To help the above situation, in this patch, when we clear a huge page, the order to clear sub-pages is changed. In quite some situation, we can get the address that the application will access after we clear the huge page, for example, in a page fault handler. Instead of clearing the huge page from begin to end, we will clear the sub-pages farthest from the the sub-page to access firstly, and clear the sub-page to access last. This will make the sub-page to access most cache-hot and sub-pages around it more cache-hot too. If we cannot know the address the application will access, the begin of the huge page is assumed to be the the address the application will access. With this patch, the throughput increases ~28.3% in vm-scalability anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case creates 72 processes, each process mmap a big anonymous memory area and writes to it from the begin to the end. For each process, other processes could be seen as other workload which generates heavy cache pressure. At the same time, the cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per cycle) increased from 0.56 to 0.74, and the time spent in user space is reduced ~7.9% Christopher Lameter suggests to clear bytes inside a sub-page from end to begin too. But tests show no visible performance difference in the tests. May because the size of page is small compared with the cache size. Thanks Andi Kleen to propose to use address to access to determine the order of sub-pages to clear. The hugetlbfs access address could be improved, will do that in another patch. [ying.huang@intel.com: improve readability of clear_huge_page()] Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com Suggested-by: Andi Kleen <andi.kleen@intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm: oom: let oom_reap_task and exit_mmap run concurrentlyAndrea Arcangeli1-6/+0
This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07swap: choose swap device according to numa nodeAaron Lu1-1/+1
If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same priority, they are used round robin. This patch changes the previous single global swap_avail_list into a per-numa-node list, i.e. for each numa node, it sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. The current swap device's priority is set as: user can set a >=0 value, or the system will pick one starting from -1 then downwards. The priority value in the swap_avail_list is the negated value of the swap device's due to plist being sorted from low to high. The new policy doesn't change the semantics for priority >=0 cases, the previous starting from -1 then downwards now becomes starting from -2 then downwards and -1 is reserved as the promoted value. Take 4-node EX machine as an example, suppose 4 swap devices are available, each sit on a different node: swapA on node 0 swapB on node 1 swapC on node 2 swapD on node 3 After they are all swapped on in the sequence of ABCD. Current behaviour: their priorities will be: swapA: -1 swapB: -2 swapC: -3 swapD: -4 And their position in the global swap_avail_list will be: swapA -> swapB -> swapC -> swapD prio:1 prio:2 prio:3 prio:4 New behaviour: their priorities will be(note that -1 is skipped): swapA: -2 swapB: -3 swapC: -4 swapD: -5 And their positions in the 4 swap_avail_lists[nid] will be: swap_avail_lists[0]: /* node 0's available swap device list */ swapA -> swapB -> swapC -> swapD prio:1 prio:3 prio:4 prio:5 swap_avali_lists[1]: /* node 1's available swap device list */ swapB -> swapA -> swapC -> swapD prio:1 prio:2 prio:4 prio:5 swap_avail_lists[2]: /* node 2's available swap device list */ swapC -> swapA -> swapB -> swapD prio:1 prio:2 prio:3 prio:5 swap_avail_lists[3]: /* node 3's available swap device list */ swapD -> swapA -> swapB -> swapC prio:1 prio:2 prio:3 prio:4 To see the effect of the patch, a test that starts N process, each mmap a region of anonymous memory and then continually write to it at random position to trigger both swap in and out is used. On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives are used as swap devices with each attached to a different node, the result is: runtime=30m/processes=32/total test size=128G/each process mmap region=4G kernel throughput vanilla 13306 auto-binding 15169 +14% runtime=30m/processes=64/total test size=128G/each process mmap region=2G kernel throughput vanilla 11885 auto-binding 14879 +25% [aaron.lu@intel.com: v2] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com [akpm@linux-foundation.org: use kmalloc_array()] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm, swap: don't use VMA based swap readahead if HDD is used as swapHuang Ying1-5/+6
VMA based swap readahead will readahead the virtual pages that is continuous in the virtual address space. While the original swap readahead will readahead the swap slots that is continuous in the swap device. Although VMA based swap readahead is more correct for the swap slots to be readahead, it will trigger more small random readings, which may cause the performance of HDD (hard disk) to degrade heavily, and may finally exceed the benefit. To avoid the issue, in this patch, if the HDD is used as swap, the VMA based swap readahead will be disabled, and the original swap readahead will be used instead. Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm, swap: VMA based swap readaheadHuang Ying2-2/+56
The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patch, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm, swap: add swap readahead hit statisticsHuang Ying1-0/+4
Patch series "mm, swap: VMA based swap readahead", v4. The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory space. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patchset, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. In addition to the swap readahead changes, some new sysfs interface is added to show the efficiency of the readahead algorithm and some other swap statistics. This new implementation will incur more small random read, on SSD, the improved correctness of estimation and readahead target should beat the potential increased overhead, this is also illustrated in the test results below. But on HDD, the overhead may beat the benefit, so the original implementation will be used by default. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. This patch (of 5): The statistics for total readahead pages and total readahead hits are recorded and exported via the following sysfs interface. /sys/kernel/mm/swap/ra_hits /sys/kernel/mm/swap/ra_total With them, the efficiency of the swap readahead could be measured, so that the swap readahead algorithm and parameters could be tuned accordingly. [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n] Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm: rename global_page_state to global_zone_page_stateMichal Hocko2-4/+4
global_page_state is error prone as a recent bug report pointed out [1]. It only returns proper values for zone based counters as the enum it gets suggests. We already have global_node_page_state so let's rename global_page_state to global_zone_page_state to be more explicit here. All existing users seems to be correct: $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c 2 NR_BOUNCE 2 NR_FREE_CMA_PAGES 11 NR_FREE_PAGES 1 NR_KERNEL_STACK_KB 1 NR_MLOCK 2 NR_PAGETABLE This patch shouldn't introduce any functional change. [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm: shm: use new hugetlb size encoding definitionsMike Kravetz1-17/+0
Use the common definitions from hugetlb_encode.h header file for encoding hugetlb size definitions in shmget system call flags. In addition, move these definitions from the internal (kernel) to user (uapi) header file. Link: http://lkml.kernel.org/r/1501527386-10736-4-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07include/linux/fs.h: remove unneeded forward definition of mm_structJeff Layton1-2/+0
Link: http://lkml.kernel.org/r/20170525102927.6163-1-jlayton@redhat.com Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd supportMike Rapoport1-0/+6
shmem_mfill_zeropage_pte is the low level routine that implements the userfaultfd UFFDIO_ZEROPAGE command. Since for shmem mappings zero pages are always allocated and accounted, the new method is a slight extension of the existing shmem_mcopy_atomic_pte. Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm, THP, swap: add THP swapping out fallback countingHuang Ying1-0/+1
When swapping out THP (Transparent Huge Page), instead of swapping out the THP as a whole, sometimes we have to fallback to split the THP into normal pages before swapping, because no free swap clusters are available, or cgroup limit is exceeded, etc. To count the number of the fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when we fallback to split the THP. Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm, THP, swap: support splitting THP for THP swap outHuang Ying1-0/+9
After adding swapping out support for THP (Transparent Huge Page), it is possible that a THP in swap cache (partly swapped out) need to be split. To split such a THP, the swap cluster backing the THP need to be split too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap cluster. The patch implemented this. And because the THP swap writing needs the THP keeps as huge page during writing. The PageWriteback flag is checked before splitting. Link: http://lkml.kernel.org/r/20170724051840.2309-8-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm: test code to write THP to swap device as a wholeHuang Ying3-2/+11
To support delay splitting THP (Transparent Huge Page) after swapped out, we need to enhance swap writing code to support to write a THP as a whole. This will improve swap write IO performance. As Ming Lei <ming.lei@redhat.com> pointed out, this should be based on multipage bvec support, which hasn't been merged yet. So this patch is only for testing the functionality of the other patches in the series. And will be reimplemented after multipage bvec support is merged. Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Shaohua Li <shli@kernel.org> Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm, THP, swap: make reuse_swap_page() works for THP swapped outHuang Ying1-2/+2
After supporting to delay THP (Transparent Huge Page) splitting after swapped out, it is possible that some page table mappings of the THP are turned into swap entries. So reuse_swap_page() need to check the swap count in addition to the map count as before. This patch done that. In the huge PMD write protect fault handler, in addition to the page map count, the swap count need to be checked too, so the page lock need to be acquired too when calling reuse_swap_page() in addition to the page table lock. [ying.huang@intel.com: silence a compiler warning] Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm, THP, swap: support to reclaim swap space for THP swapped outHuang Ying1-0/+1
The normal swap slot reclaiming can be done when the swap count reaches SWAP_HAS_CACHE. But for the swap slot which is backing a THP, all swap slots backing one THP must be reclaimed together, because the swap slot may be used again when the THP is swapped out again later. So the swap slots backing one THP can be reclaimed together when the swap count for all swap slots for the THP reached SWAP_HAS_CACHE. In the patch, the functions to check whether the swap count for all swap slots backing one THP reached SWAP_HAS_CACHE are implemented and used when checking whether a swap slot can be reclaimed. To make it easier to determine whether a swap slot is backing a THP, a new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap cluster which is backing a THP (Transparent Huge Page). Because THP swap in as a whole isn't supported now. After deleting the THP from the swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE flag will be cleared. So that, the normal pages inside THP can be swapped in individually. [ying.huang@intel.com: fix swap_page_trans_huge_swapped on HDD] Link: http://lkml.kernel.org/r/874ltsm0bi.fsf@yhuang-dev.intel.com Link: http://lkml.kernel.org/r/20170724051840.2309-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm: memcontrol: use int for event/state parameter in several functionsMatthias Kaehlcke1-20/+32
Several functions use an enum type as parameter for an event/state, but are called in some locations with an argument of a different enum type. Adjust the interface of these functions to reality by changing the parameter to int. This fixes a ton of enum-conversion warnings that are generated when building the kernel with clang. [mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()] Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.org Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Doug Anderson <dianders@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07mm: remove nr_pages argument from pagevec_lookup{,_range}()Jan Kara1-4/+3
All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>