<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/dax, branch v4.11.5</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.11.5</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.11.5'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2017-05-20T12:49:48+00:00</updated>
<entry>
<title>device-dax: fix sysfs attribute deadlock</title>
<updated>2017-05-20T12:49:48+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2017-04-30T13:57:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=55dbfd7dcd7172681f1c2397b6f6c5276fc4763a'/>
<id>urn:sha1:55dbfd7dcd7172681f1c2397b6f6c5276fc4763a</id>
<content type='text'>
commit 565851c972b50612f3a4542e26879ffb3e906fc2 upstream.

Usage of device_lock() for dax_region attributes is unnecessary and
deadlock prone. It's unnecessary because the order of registration /
un-registration guarantees that drvdata is always valid. It's deadlock
prone because it sets up this situation:

 ndctl           D    0  2170   2082 0x00000000
 Call Trace:
  __schedule+0x31f/0x980
  schedule+0x3d/0x90
  schedule_preempt_disabled+0x15/0x20
  __mutex_lock+0x402/0x980
  ? __mutex_lock+0x158/0x980
  ? align_show+0x2b/0x80 [dax]
  ? kernfs_seq_start+0x2f/0x90
  mutex_lock_nested+0x1b/0x20
  align_show+0x2b/0x80 [dax]
  dev_attr_show+0x20/0x50

 ndctl           D    0  2186   2079 0x00000000
 Call Trace:
  __schedule+0x31f/0x980
  schedule+0x3d/0x90
  __kernfs_remove+0x1f6/0x340
  ? kernfs_remove_by_name_ns+0x45/0xa0
  ? remove_wait_queue+0x70/0x70
  kernfs_remove_by_name_ns+0x45/0xa0
  remove_files.isra.1+0x35/0x70
  sysfs_remove_group+0x44/0x90
  sysfs_remove_groups+0x2e/0x50
  dax_region_unregister+0x25/0x40 [dax]
  devm_action_release+0xf/0x20
  release_nodes+0x16d/0x2b0
  devres_release_all+0x3c/0x60
  device_release_driver_internal+0x17d/0x220
  device_release_driver+0x12/0x20
  unbind_store+0x112/0x160

ndctl/2170 is trying to acquire the device_lock() to read an attribute,
and ndctl/2186 is holding the device_lock() while trying to drain all
active attribute readers.

Thanks to Yi Zhang for the reproduction script.

Fixes: d7fe1a67f658 ("dax: add region 'id', 'size', and 'align' attributes")
Reported-by: Yi Zhang &lt;yizhan@redhat.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>device-dax: fix cdev leak</title>
<updated>2017-05-20T12:49:48+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2017-03-17T18:48:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8931477790743c6c2593948b8dddb452171eb3a8'/>
<id>urn:sha1:8931477790743c6c2593948b8dddb452171eb3a8</id>
<content type='text'>
commit ed01e50acdd3e4a640cf9ebd28a7e810c3ceca97 upstream.

If device_add() fails, cleanup the cdev. Otherwise, we leak a kobj_map()
with a stale device number.

As Jason points out, there is a small possibility that userspace has
opened and mapped the device in the time between cdev_add() and the
device_add() failure. We need a new kill_dax_dev() helper to invalidate
any established mappings.

Fixes: ba09c01d2fa8 ("dax: convert to the cdev api")
Reported-by: Jason Gunthorpe &lt;jgunthorpe@obsidianresearch.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Logan Gunthorpe &lt;logang@deltatee.com&gt;
Reviewed-by: Johannes Thumshirn &lt;jthumshirn@suse.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>device-dax: switch to srcu, fix rcu_read_lock() vs pte allocation</title>
<updated>2017-04-12T20:45:18+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2017-04-07T23:42:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=956a4cd2c957acf638ff29951aabaa9d8e92bbc2'/>
<id>urn:sha1:956a4cd2c957acf638ff29951aabaa9d8e92bbc2</id>
<content type='text'>
The following warning triggers with a new unit test that stresses the
device-dax interface.

 ===============================
 [ ERR: suspicious RCU usage.  ]
 4.11.0-rc4+ #1049 Tainted: G           O
 -------------------------------
 ./include/linux/rcupdate.h:521 Illegal context switch in RCU read-side critical section!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 0
 2 locks held by fio/9070:
  #0:  (&amp;mm-&gt;mmap_sem){++++++}, at: [&lt;ffffffff8d0739d7&gt;] __do_page_fault+0x167/0x4f0
  #1:  (rcu_read_lock){......}, at: [&lt;ffffffffc03fbd02&gt;] dax_dev_huge_fault+0x32/0x620 [dax]

 Call Trace:
  dump_stack+0x86/0xc3
  lockdep_rcu_suspicious+0xd7/0x110
  ___might_sleep+0xac/0x250
  __might_sleep+0x4a/0x80
  __alloc_pages_nodemask+0x23a/0x360
  alloc_pages_current+0xa1/0x1f0
  pte_alloc_one+0x17/0x80
  __pte_alloc+0x1e/0x120
  __get_locked_pte+0x1bf/0x1d0
  insert_pfn.isra.70+0x3a/0x100
  ? lookup_memtype+0xa6/0xd0
  vm_insert_mixed+0x64/0x90
  dax_dev_huge_fault+0x520/0x620 [dax]
  ? dax_dev_huge_fault+0x32/0x620 [dax]
  dax_dev_fault+0x10/0x20 [dax]
  __do_fault+0x1e/0x140
  __handle_mm_fault+0x9af/0x10d0
  handle_mm_fault+0x16d/0x370
  ? handle_mm_fault+0x47/0x370
  __do_page_fault+0x28c/0x4f0
  trace_do_page_fault+0x58/0x2a0
  do_async_page_fault+0x1a/0xa0
  async_page_fault+0x28/0x30

Inserting a page table entry may trigger an allocation while we are
holding a read lock to keep the device instance alive for the duration
of the fault. Use srcu for this keep-alive protection.

Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
</content>
</entry>
<entry>
<title>device-dax: fix debug output typo</title>
<updated>2017-03-11T03:56:56+00:00</updated>
<author>
<name>Dave Jiang</name>
<email>dave.jiang@intel.com</email>
</author>
<published>2017-03-09T23:56:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=52084f89b38cdd896b59627c629915ef1a7bf615'/>
<id>urn:sha1:52084f89b38cdd896b59627c629915ef1a7bf615</id>
<content type='text'>
The debug output for return the return data of pgoff_to_phys() in the
fault handlers has 'phys' and 'pgoff' incorrectly swapped.

Reported-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Dave Jiang &lt;dave.jiang@intel.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
</content>
</entry>
<entry>
<title>device-dax: fix pud fault fallback handling</title>
<updated>2017-03-11T03:55:53+00:00</updated>
<author>
<name>Dave Jiang</name>
<email>dave.jiang@intel.com</email>
</author>
<published>2017-03-10T20:24:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=70b085b06c4560a69e95607f77bb4c2b2e41943c'/>
<id>urn:sha1:70b085b06c4560a69e95607f77bb4c2b2e41943c</id>
<content type='text'>
Jeff Moyer reports:

    With a device dax alignment of 4KB or 2MB, I get sigbus when running
    the attached fio job file for the current kernel (4.11.0-rc1+).  If
    I specify an alignment of 1GB, it works.

    I turned on debug output, and saw that it was failing in the huge
    fault code.

     dax dax1.0: dax_open
     dax dax1.0: dax_mmap
     dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 -
     dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf60)
     dax dax1.0: dax_release

    fio config for reproduce:
    [global]
    ioengine=dev-dax
    direct=0
    filename=/dev/dax0.0
    bs=2m

    [write]
    rw=write

    [read]
    stonewall
    rw=read

The driver fails to fallback when taking a fault that is larger than
the device alignment, or handling a larger fault when a smaller
mapping is already established. While we could support larger
mappings for a device with a smaller alignment, that change is
too large for the immediate fix. The simplest change is to force
fallback until the fault size matches the alignment.

Reported-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Dave Jiang &lt;dave.jiang@intel.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
</content>
</entry>
<entry>
<title>device-dax: fix pmd/pte fault fallback handling</title>
<updated>2017-03-11T02:15:04+00:00</updated>
<author>
<name>Dave Jiang</name>
<email>dave.jiang@intel.com</email>
</author>
<published>2017-03-10T20:24:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0134ed4fb9e78672ee9f7b18007114404c81e63f'/>
<id>urn:sha1:0134ed4fb9e78672ee9f7b18007114404c81e63f</id>
<content type='text'>
Jeff Moyer reports:

    With a device dax alignment of 4KB or 2MB, I get sigbus when running
    the attached fio job file for the current kernel (4.11.0-rc1+).  If
    I specify an alignment of 1GB, it works.

    I turned on debug output, and saw that it was failing in the huge
    fault code.

     dax dax1.0: dax_open
     dax dax1.0: dax_mmap
     dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 -
     dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf60
     dax dax1.0: dax_release

    fio config for reproduce:
    [global]
    ioengine=dev-dax
    direct=0
    filename=/dev/dax0.0
    bs=2m

    [write]
    rw=write

    [read]
    stonewall
    rw=read

The driver fails to fallback when taking a fault that is larger than
the device alignment, or handling a larger fault when a smaller
mapping is already established. While we could support larger
mappings for a device with a smaller alignment, that change is
too large for the immediate fix. The simplest change is to force
fallback until the fault size matches the alignment.

Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Cc: &lt;stable@vger.kernel.org&gt;
Reported-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Dave Jiang &lt;dave.jiang@intel.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
</content>
</entry>
<entry>
<title>sched/headers: Prepare to remove the &lt;linux/magic.h&gt; include from &lt;linux/sched/task_stack.h&gt;</title>
<updated>2017-03-02T07:42:40+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-02-05T15:03:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=50d34394cee68dd12c5e01fff073d1167700bfce'/>
<id>urn:sha1:50d34394cee68dd12c5e01fff073d1167700bfce</id>
<content type='text'>
Update files that depend on the magic.h inclusion.

Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm: replace FAULT_FLAG_SIZE with parameter to huge_fault</title>
<updated>2017-02-25T01:46:54+00:00</updated>
<author>
<name>Dave Jiang</name>
<email>dave.jiang@intel.com</email>
</author>
<published>2017-02-24T22:57:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c791ace1e747371658237f0d30234fef56c39669'/>
<id>urn:sha1:c791ace1e747371658237f0d30234fef56c39669</id>
<content type='text'>
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations.  More than one kernel oops was introduced due to
difficulties of getting the placement correctly.

Remove the flag values and introduce an input parameter to huge_fault
that indicates the size of the page entry.  This makes the code easier
to trace and should avoid the issues we see with the fault flags where
removal of the flag was necessary in the fallback paths.

Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang &lt;dave.jiang@intel.com&gt;
Tested-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Nilesh Choudhury &lt;nilesh.choudhury@oracle.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>dax: support for transparent PUD pages for device DAX</title>
<updated>2017-02-25T01:46:54+00:00</updated>
<author>
<name>Dave Jiang</name>
<email>dave.jiang@intel.com</email>
</author>
<published>2017-02-24T22:57:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9557feee39b75ceb502b4777e08706df1ddf10ed'/>
<id>urn:sha1:9557feee39b75ceb502b4777e08706df1ddf10ed</id>
<content type='text'>
Add transparent huge PUD pages support for device DAX by adding a
pud_fault handler.

Link: http://lkml.kernel.org/r/148545060002.17912.6765687780007547551.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang &lt;dave.jiang@intel.com&gt;
Cc: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Jan Kara &lt;jack@suse.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Nilesh Choudhury &lt;nilesh.choudhury@oracle.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm,fs,dax: change -&gt;pmd_fault to -&gt;huge_fault</title>
<updated>2017-02-25T01:46:54+00:00</updated>
<author>
<name>Dave Jiang</name>
<email>dave.jiang@intel.com</email>
</author>
<published>2017-02-24T22:56:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a2d581675d485eb7188f521f36efc114639a3096'/>
<id>urn:sha1:a2d581675d485eb7188f521f36efc114639a3096</id>
<content type='text'>
Patch series "1G transparent hugepage support for device dax", v2.

The following series implements support for 1G trasparent hugepage on
x86 for device dax.  The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX.  I have
forward ported the relevant bits to 4.10-rc.  The current submission has
only the necessary code to support device DAX.

Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs.  Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file.  We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].

[1]: https://lkml.org/lkml/2016/1/31/52

Comments from Nilesh @ Oracle:

There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:

processes         : 10,000
memory            :    6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB

As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level.  Memory sizes will keep increasing; so
this number will keep increasing.

An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.

This patch (of 3):

In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault.  The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.

[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
  Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
  Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
Signed-off-by: Dave Jiang &lt;dave.jiang@intel.com&gt;
Signed-off-by: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Jan Kara &lt;jack@suse.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Nilesh Choudhury &lt;nilesh.choudhury@oracle.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Dave Jiang &lt;dave.jiang@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
