summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-11-16Merge branch 'smc-last-part-of-termination-improvements'David S. Miller6-8/+61
Karsten Graul says: ==================== last part of termination improvements Patches 1 and 2 finish the set of termination patches, introducing a reboot handler that terminates all link groups. Patch 3 adds an rcu_barrier before the module is unloaded, and patch 4 is cleanup. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16net/smc: remove unused constantUrsula Braun1-2/+0
Constant SMC_CLOSE_WAIT_LISTEN_CLCSOCK_TIME is defined, but since commit 3d502067599f ("net/smc: simplify wait when closing listen socket") no longer used. Remove it. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16net/smc: use rcu_barrier() on module unloadUrsula Braun1-0/+2
Add rcu_barrier() to make sure no RCU readers or callbacks are pending when the module is unloaded. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16net/smc: guarantee removal of link groups in rebootUrsula Braun1-1/+15
When rebooting it should be guaranteed all link groups are cleaned up and freed. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16net/smc: introduce bookkeeping of SMCR link groupsUrsula Braun5-6/+45
If the smc module is unloaded return control from exit routine only, if all link groups are freed. If an IB device is thrown away return control from device removal only, if all link groups belonging to this device are freed. Counters for the total number of SMCR link groups and for the total number of SMCR links per IB device are introduced. smc module unloading continues only if the total number of SMCR link groups is zero. IB device removal continues only it the total number of SMCR links per IB device has decreased to zero. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16net: dsa: tag_8021q: Fix dsa_8021q_restore_pvid for an absent pvidVladimir Oltean1-1/+1
This sequence of operations: ip link set dev br0 type bridge vlan_filtering 1 bridge vlan del dev swp2 vid 1 ip link set dev br0 type bridge vlan_filtering 1 ip link set dev br0 type bridge vlan_filtering 0 apparently fails with the message: [ 31.305716] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering [ 31.322161] sja1105 spi0.1: Couldn't determine PVID attributes (pvid 0) [ 31.328939] sja1105 spi0.1: Failed to setup VLAN tagging for port 1: -2 [ 31.335599] ------------[ cut here ]------------ [ 31.340215] WARNING: CPU: 1 PID: 194 at net/switchdev/switchdev.c:157 switchdev_port_attr_set_now+0x9c/0xa4 [ 31.349981] br0: Commit of attribute (id=6) failed. [ 31.354890] Modules linked in: [ 31.357942] CPU: 1 PID: 194 Comm: ip Not tainted 5.4.0-rc6-01792-gf4f632e07665-dirty #2062 [ 31.366167] Hardware name: Freescale LS1021A [ 31.370437] [<c03144dc>] (unwind_backtrace) from [<c030e184>] (show_stack+0x10/0x14) [ 31.378153] [<c030e184>] (show_stack) from [<c11d1c1c>] (dump_stack+0xe0/0x10c) [ 31.385437] [<c11d1c1c>] (dump_stack) from [<c034c730>] (__warn+0xf4/0x10c) [ 31.392373] [<c034c730>] (__warn) from [<c034c7bc>] (warn_slowpath_fmt+0x74/0xb8) [ 31.399827] [<c034c7bc>] (warn_slowpath_fmt) from [<c11ca204>] (switchdev_port_attr_set_now+0x9c/0xa4) [ 31.409097] [<c11ca204>] (switchdev_port_attr_set_now) from [<c117036c>] (__br_vlan_filter_toggle+0x6c/0x118) [ 31.418971] [<c117036c>] (__br_vlan_filter_toggle) from [<c115d010>] (br_changelink+0xf8/0x518) [ 31.427637] [<c115d010>] (br_changelink) from [<c0f8e9ec>] (__rtnl_newlink+0x3f4/0x76c) [ 31.435613] [<c0f8e9ec>] (__rtnl_newlink) from [<c0f8eda8>] (rtnl_newlink+0x44/0x60) [ 31.443329] [<c0f8eda8>] (rtnl_newlink) from [<c0f89f20>] (rtnetlink_rcv_msg+0x2cc/0x51c) [ 31.451477] [<c0f89f20>] (rtnetlink_rcv_msg) from [<c1008df8>] (netlink_rcv_skb+0xb8/0x110) [ 31.459796] [<c1008df8>] (netlink_rcv_skb) from [<c1008648>] (netlink_unicast+0x17c/0x1f8) [ 31.468026] [<c1008648>] (netlink_unicast) from [<c1008980>] (netlink_sendmsg+0x2bc/0x3b4) [ 31.476261] [<c1008980>] (netlink_sendmsg) from [<c0f43858>] (___sys_sendmsg+0x230/0x250) [ 31.484408] [<c0f43858>] (___sys_sendmsg) from [<c0f44c84>] (__sys_sendmsg+0x50/0x8c) [ 31.492209] [<c0f44c84>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x28) [ 31.500090] Exception stack(0xedf47fa8 to 0xedf47ff0) [ 31.505122] 7fa0: 00000002 b6f2e060 00000003 beabd6a4 00000000 00000000 [ 31.513265] 7fc0: 00000002 b6f2e060 5d6e3213 00000128 00000000 00000001 00000006 000619c4 [ 31.521405] 7fe0: 00086078 beabd658 0005edbc b6e7ce68 The reason is the implementation of br_get_pvid: static inline u16 br_get_pvid(const struct net_bridge_vlan_group *vg) { if (!vg) return 0; smp_rmb(); return vg->pvid; } Since VID 0 is an invalid pvid from the bridge's point of view, let's add this check in dsa_8021q_restore_pvid to avoid restoring a pvid that doesn't really exist. Fixes: 5f33183b7fdf ("net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16Merge branch 'seg6-fixes-to-Segment-Routing-in-IPv6'David S. Miller1-0/+11
Andrea Mayer says: ==================== seg6: fixes to Segment Routing in IPv6 This patchset is divided in 2 patches and it introduces some fixes to Segment Routing in IPv6, which are: - in function get_srh() fix the srh pointer after calling pskb_may_pull(); - fix the skb->transport_header after calling decap_and_validate() function; Any comments on the patchset are welcome. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16seg6: fix skb transport_header after decap_and_validate()Andrea Mayer1-0/+6
in the receive path (more precisely in ip6_rcv_core()) the skb->transport_header is set to skb->network_header + sizeof(*hdr). As a consequence, after routing operations, destination input expects to find skb->transport_header correctly set to the next protocol (or extension header) that follows the network protocol. However, decap behaviors (DX*, DT*) remove the outer IPv6 and SRH extension and do not set again the skb->transport_header pointer correctly. For this reason, the patch sets the skb->transport_header to the skb->network_header + sizeof(hdr) in each DX* and DT* behavior. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16seg6: fix srh pointer in get_srh()Andrea Mayer1-0/+5
pskb_may_pull may change pointers in header. For this reason, it is mandatory to reload any pointer that points into skb header. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16net: stmmac: Use the correct style for SPDX License IdentifierNishad Kamdar3-3/+3
This patch corrects the SPDX License Identifier style in header files related to STMicroelectronics based Multi-Gigabit Ethernet driver. For C header files Documentation/process/license-rules.rst mandates C-like comments (opposed to C source files where C++ style should be used). Changes made by using a script provided by Joe Perches here: https://lkml.org/lkml/2019/2/7/46. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16octeontx2-af: Use the correct style for SPDX License IdentifierNishad Kamdar9-18/+18
This patch corrects the SPDX License Identifier style in header files related to Marvell OcteonTX2 network devices. It uses an expilict block comment for the SPDX License Identifier. Changes made by using a script provided by Joe Perches here: https://lkml.org/lkml/2019/2/7/46. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16Merge branch 'akpm' (patches from Andrew)Linus Torvalds12-86/+133
Merge misc fixes from Andrew Morton: "11 fixes" MM fixes and one xz decompressor fix. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/debug.c: PageAnon() is true for PageKsm() pages mm/debug.c: __dump_page() prints an extra line mm/page_io.c: do not free shared swap slots mm/memory_hotplug: fix try_offline_node() mm,thp: recheck each page before collapsing file THP mm: slub: really fix slab walking for init_on_free mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() lib/xz: fix XZ_DYNALLOC to avoid useless memory reallocations mm: fix trying to reclaim unevictable lru page when calling madvise_pageout mm: mempolicy: fix the wrong return value and potential pages leak of mbind
2019-11-16Merge branch 'for-linus' of ↵Linus Torvalds3-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull more input fixes from Dmitry Torokhov: "A couple of fixes in driver teardown paths and another ID for Synaptics RMI mode" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: synaptics - enable RMI mode for X1 Extreme 2nd Generation Input: synaptics-rmi4 - destroy F54 poller workqueue when removing Input: ff-memless - kill timer in destroy()
2019-11-16mm/debug.c: PageAnon() is true for PageKsm() pagesRalph Campbell1-3/+3
PageAnon() and PageKsm() use the low two bits of the page->mapping pointer to indicate the page type. PageAnon() only checks the LSB while PageKsm() checks the least significant 2 bits are equal to 3. Therefore, PageAnon() is true for KSM pages. __dump_page() incorrectly will never print "ksm" because it checks PageAnon() first. Fix this by checking PageKsm() first. Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com Fixes: 1c6fb1d89e73 ("mm: print more information about mapping in __dump_page") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm/debug.c: __dump_page() prints an extra lineRalph Campbell1-12/+15
When dumping struct page information, __dump_page() prints the page type with a trailing blank followed by the page flags on a separate line: anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked) It looks like the intent was to use pr_cont() for printing "flags:" but pr_cont() usage is discouraged so fix this by extending the format to include the flags into a single line: anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked) If the page is file backed, the name might be long so use two lines: shmem_aops name:"dev/zero" flags: 0x10000000008000c(uptodate|dirty|swapbacked) Eliminate pr_conf() usage as well for appending compound_mapcount. Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm/page_io.c: do not free shared swap slotsVinayak Menon1-3/+3
The following race is observed due to which a processes faulting on a swap entry, finds the page neither in swapcache nor swap. This causes zram to give a zero filled page that gets mapped to the process, resulting in a user space crash later. Consider parent and child processes Pa and Pb sharing the same swap slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. Virtual address 'VA' of Pa and Pb points to the shared swap entry. Pa Pb fault on VA fault on VA do_swap_page do_swap_page lookup_swap_cache fails lookup_swap_cache fails Pb scheduled out swapin_readahead (deletes zram entry) swap_free (makes swap_count 1) Pb scheduled in swap_readpage (swap_count == 1) Takes SWP_SYNCHRONOUS_IO path zram enrty absent zram gives a zero filled page Fix this by making sure that swap slot is freed only when swap count drops down to one. Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org Fixes: aa8d22a11da9 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference") Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Suggested-by: Minchan Kim <minchan@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm/memory_hotplug: fix try_offline_node()David Hildenbrand3-16/+64
try_offline_node() is pretty much broken right now: - The node span is updated when onlining memory, not when adding it. We ignore memory that was mever onlined. Bad. - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily trigger a kernel panic. Bad for memory that is offline but also bad for subsection hotadd with ZONE_DEVICE, whereby the memmap of the first PFN of a section might contain garbage. - Sections belonging to mixed nodes are not properly considered. As memory blocks might belong to multiple nodes, we would have to walk all pageblocks (or at least subsections) within present sections. However, we don't have a way to identify whether a memmap that is not online was initialized (relevant for ZONE_DEVICE). This makes things more complicated. Luckily, we can piggy pack on the node span and the nid stored in memory blocks. Currently, the node span is grown when calling move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when removing memory, before calling try_offline_node(). Sysfs links are created via link_mem_sections(), e.g., during boot or when adding memory. If the node still spans memory or if any memory block belongs to the nid, we don't set the node offline. As memory blocks that span multiple nodes cannot get offlined, the nid stored in memory blocks is reliable enough (for such online memory blocks, the node still spans the memory). Introduce for_each_memory_block() to efficiently walk all memory blocks. Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span when removing ZONE_DEVICE memory to fix similar issues (access of garbage memmaps) - until we have a reliable way to identify whether these memmaps were properly initialized. This implies later, that once a node had ZONE_DEVICE memory, we won't be able to set a node offline - which should be acceptable. Since commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") memory that is added is not assoziated with a zone/node (memmap not initialized). The introducing commit 60a5a19e7419 ("memory-hotplug: remove sysfs file of node") already missed that we could have multiple nodes for a section and that the zone/node span is updated when onlining pages, not when adding them. I tested this by hotplugging two DIMMs to a memory-less and cpu-less NUMA node. The node is properly onlined when adding the DIMMs. When removing the DIMMs, the node is properly offlined. Masayoshi Mizuma reported: : Without this patch, memory hotplug fails as panic: : : BUG: kernel NULL pointer dereference, address: 0000000000000000 : ... : Call Trace: : remove_memory_block_devices+0x81/0xc0 : try_remove_memory+0xb4/0x130 : __remove_memory+0xa/0x20 : acpi_memory_device_remove+0x84/0x100 : acpi_bus_trim+0x57/0x90 : acpi_bus_trim+0x2e/0x90 : acpi_device_hotplug+0x2b2/0x4d0 : acpi_hotplug_work_fn+0x1a/0x30 : process_one_work+0x171/0x380 : worker_thread+0x49/0x3f0 : kthread+0xf8/0x130 : ret_from_fork+0x35/0x40 [david@redhat.com: v3] Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com Fixes: 60a5a19e7419 ("memory-hotplug: remove sysfs file of node") Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b319 Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Nayna Jain <nayna@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm,thp: recheck each page before collapsing file THPSong Liu1-12/+16
In collapse_file(), for !is_shmem case, current check cannot guarantee the locked page is up-to-date. Specifically, xas_unlock_irq() should not be called before lock_page() and get_page(); and it is necessary to recheck PageUptodate() after locking the page. With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text may contain corrupted data. This is because khugepaged mistakenly collapses some not up-to-date sub pages into a huge page, and assumes the huge page is up-to-date. This will NOT corrupt data in the disk, because the page is read-only and never written back. Fix this by properly checking PageUptodate() after locking the page. This check replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);". Also, move PageDirty() check after locking the page. Current khugepaged should not try to collapse dirty file THP, because it is limited to read-only .text. The only case we hit a dirty page here is when the page hasn't been written since write. Bail out and retry when this happens. syzbot reported bug on previous version of this patch. Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm: slub: really fix slab walking for init_on_freeLaura Abbott1-30/+9
Commit 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free") fixed one problem with the slab walking but missed a key detail: When walking the list, the head and tail pointers need to be updated since we end up reversing the list as a result. Without doing this, bulk free is broken. One way this is exposed is a NULL pointer with slub_debug=F: ============================================================================= BUG skbuff_head_cache (Tainted: G T): Object already free ----------------------------------------------------------------------------- INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201 BUG: kernel NULL pointer dereference, address: 0000000000000000 Oops: 0000 [#1] PREEMPT SMP PTI RIP: 0010:print_trailer+0x70/0x1d5 Call Trace: <IRQ> free_debug_processing.cold.37+0xc9/0x149 __slab_free+0x22a/0x3d0 kmem_cache_free_bulk+0x415/0x420 __kfree_skb_flush+0x30/0x40 net_rx_action+0x2dd/0x480 __do_softirq+0xf0/0x246 irq_exit+0x93/0xb0 do_IRQ+0xa0/0x110 common_interrupt+0xf/0xf </IRQ> Given we're now almost identical to the existing debugging code which correctly walks the list, combine with that. Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com Fixes: 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free") Signed-off-by: Laura Abbott <labbott@redhat.com> Reported-by: Thibaut Sautereau <thibaut.sautereau@clip-os.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Alexander Potapenko <glider@google.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <clipos@ssi.gouv.fr> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()Roman Gushchin1-1/+1
An exiting task might belong to an offline cgroup. In this case an attempt to grab a cgroup reference from the task can end up with an infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the cgroup will become online, neither the task will be migrated to a live cgroup. Fix this by switching over to css_tryget(). As css_tryget_online() can't guarantee that the cgroup won't go offline, in most cases the check doesn't make sense. In this particular case users of hugetlb_cgroup_charge_cgroup() are not affected by this change. A similar problem is described by commit 18fa84a2db0e ("cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()"). Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()Roman Gushchin1-1/+1
We've encountered a rcu stall in get_mem_cgroup_from_mm(): rcu: INFO: rcu_sched self-detected stall on CPU rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017 (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33 <...> RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90 <...> __memcg_kmem_charge+0x55/0x140 __alloc_pages_nodemask+0x267/0x320 pipe_write+0x1ad/0x400 new_sync_write+0x127/0x1c0 __kernel_write+0x4f/0xf0 dump_emit+0x91/0xc0 writenote+0xa0/0xc0 elf_core_dump+0x11af/0x1430 do_coredump+0xc65/0xee0 get_signal+0x132/0x7c0 do_signal+0x36/0x640 exit_to_usermode_loop+0x61/0xd0 do_syscall_64+0xd4/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is caused by an exiting task which is associated with an offline memcg. We're iterating over and over in the do {} while (!css_tryget_online()) loop, but obviously the memcg won't become online and the exiting task won't be migrated to a live memcg. Let's fix it by switching from css_tryget_online() to css_tryget(). As css_tryget_online() cannot guarantee that the memcg won't go offline, the check is usually useless, except some rare cases when for example it determines if something should be presented to a user. A similar problem is described by commit 18fa84a2db0e ("cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()"). Johannes: : The bug aside, it doesn't matter whether the cgroup is online for the : callers. It used to matter when offlining needed to evacuate all charges : from the memcg, and so needed to prevent new ones from showing up, but we : don't care now. Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeeb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16lib/xz: fix XZ_DYNALLOC to avoid useless memory reallocationsLasse Collin1-0/+1
s->dict.allocated was initialized to 0 but never set after a successful allocation, thus the code always thought that the dictionary buffer has to be reallocated. Link: http://lkml.kernel.org/r/20191104185107.3b6330df@tukaani.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Reported-by: Yu Sun <yusun2@cisco.com> Acked-by: Daniel Walker <danielwa@cisco.com> Cc: "Yixia Si (yisi)" <yisi@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm: fix trying to reclaim unevictable lru page when calling madvise_pageoutzhong jiang1-4/+12
Recently, I hit the following issue when running upstream. kernel BUG at mm/vmscan.c:1521! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 Call Trace: reclaim_pages+0x499/0x800 mm/vmscan.c:2188 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 walk_pmd_range mm/pagewalk.c:53 [inline] walk_pud_range mm/pagewalk.c:112 [inline] walk_p4d_range mm/pagewalk.c:139 [inline] walk_pgd_range mm/pagewalk.c:166 [inline] __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 walk_page_range+0x179/0x310 mm/pagewalk.c:349 madvise_pageout_page_range mm/madvise.c:506 [inline] madvise_pageout+0x1f0/0x330 mm/madvise.c:542 madvise_vma mm/madvise.c:931 [inline] __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe madvise_pageout() accesses the specified range of the vma and isolates them, then runs shrink_page_list() to reclaim its memory. But it also isolates the unevictable pages to reclaim. Hence, we can catch the cases in shrink_page_list(). The root cause is that we scan the page tables instead of specific LRU list. and so we need to filter out the unevictable lru pages from our end. Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16mm: mempolicy: fix the wrong return value and potential pages leak of mbindYang Shi1-5/+9
Commit d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value of mbind() for a couple of corner cases. But, it altered the errno for some other cases, for example, mbind() should return -EFAULT when part or all of the memory range specified by nodemask and maxnode points outside your accessible address space, or there was an unmapped hole in the specified memory range specified by addr and len. Fix this by preserving the errno returned by queue_pages_range(). And, the pagelist may be not empty even though queue_pages_range() returns error, put the pages back to LRU since mbind_range() is not called to really apply the policy so those pages should not be migrated, this is also the old behavior before the problematic commit. Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: Li Xinhai <lixinhai.lxh@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [4.19 and 5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-16Input: synaptics - enable RMI mode for X1 Extreme 2nd GenerationLyude Paul1-0/+1
Just got one of these for debugging some unrelated issues, and noticed that Lenovo seems to have gone back to using RMI4 over smbus with Synaptics touchpads on some of their new systems, particularly this one. So, let's enable RMI mode for the X1 Extreme 2nd Generation. Signed-off-by: Lyude Paul <lyude@redhat.com> Link: https://lore.kernel.org/r/20191115221814.31903-1-lyude@redhat.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-11-16Merge branch 'bpf-trampoline'Daniel Borkmann34-173/+2354
Alexei Starovoitov says: ==================== Introduce BPF trampoline that works as a bridge between kernel functions, BPF programs and other BPF programs. The first use case is fentry/fexit BPF programs that are roughly equivalent to kprobe/kretprobe. Unlike k[ret]probe there is practically zero overhead to call a set of BPF programs before or after kernel function. The second use case is heavily influenced by pain points in XDP development. BPF trampoline allows attaching similar fentry/fexit BPF program to any networking BPF program. It's now possible to see packets on input and output of any XDP, TC, lwt, cgroup programs without disturbing them. This greatly helps BPF-based network troubleshooting. The third use case of BPF trampoline will be explored in the follow up patches. The BPF trampoline will be used to dynamicly link BPF programs. It's more generic mechanism than array and link list of programs used in tracing, networking, cgroups. In many cases it can be used as a replacement for bpf_tail_call-based program chaining. See [1] for long term design discussion. v3 -> v4: - Included Peter's "86/alternatives: Teach text_poke_bp() to emulate instructions" as a first patch. If it changes between now and merge window, I'll rebease to newer version. The patch is necessary to do s/text_poke/text_poke_bp/ in patch 3 to fix the race. - In patch 4 fixed bpf_trampoline creation race spotted by Andrii. - Added patch 15 that annotates prog->kern bpf context types. It made patches 16 and 17 cleaner and more generic. - Addressed Andrii's feedback in other patches. v2 -> v3: - Addressed Song's and Andrii's comments - Fixed few minor bugs discovered while testing - Added one more libbpf patch v1 -> v2: - Addressed Andrii's comments - Added more test for fentry/fexit to kernel functions. Including stress test for maximum number of progs per trampoline. - Fixed a race btf_resolve_helper_id() - Added a patch to compare BTF types of functions arguments with actual types. - Added support for attaching BPF program to another BPF program via trampoline - Converted to use text_poke() API. That's the only viable mechanism to implement BPF-to-BPF attach. BPF-to-kernel attach can be refactored to use register_ftrace_direct() whenever it's available. [1] https://lore.kernel.org/bpf/20191112025112.bhzmrrh2pr76ssnh@ast-mbp.dhcp.thefacebook.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-11-16selftests/bpf: Add a test for attaching BPF prog to another BPF prog and subprogAlexei Starovoitov2-0/+167
Add a test that attaches one FEXIT program to main sched_cls networking program and two other FEXIT programs to subprograms. All three tracing programs access return values and skb->len of networking program and subprograms. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-21-ast@kernel.org
2019-11-16selftests/bpf: Extend test_pkt_access testAlexei Starovoitov1-2/+36
The test_pkt_access.o is used by multiple tests. Fix its section name so that program type can be automatically detected by libbpf and make it call other subprograms with skb argument. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-20-ast@kernel.org
2019-11-16libbpf: Add support for attaching BPF programs to other BPF programsAlexei Starovoitov5-17/+71
Extend libbpf api to pass attach_prog_fd into bpf_object__open. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-19-ast@kernel.org
2019-11-16bpf: Support attaching tracing BPF program to other BPF programsAlexei Starovoitov8-28/+152
Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type including their subprograms. This feature allows snooping on input and output packets in XDP, TC programs including their return values. In order to do that the verifier needs to track types not only of vmlinux, but types of other BPF programs as well. The verifier also needs to translate uapi/linux/bpf.h types used by networking programs into kernel internal BTF types used by FENTRY/FEXIT BPF programs. In some cases LLVM optimizations can remove arguments from BPF subprograms without adjusting BTF info that LLVM backend knows. When BTF info disagrees with actual types that the verifiers sees the BPF trampoline has to fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT program can still attach to such subprograms, but it won't be able to recognize pointer types like 'struct sk_buff *' and it won't be able to pass them to bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program would need to use bpf_probe_read_kernel() instead. The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd points to previously loaded BPF program the attach_btf_id is BTF type id of main function or one of its subprograms. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
2019-11-16bpf: Compare BTF types of functions arguments with actual typesAlexei Starovoitov5-3/+107
Make the verifier check that BTF types of function arguments match actual types passed into top-level BPF program and into BPF-to-BPF calls. If types match such BPF programs and sub-programs will have full support of BPF trampoline. If types mismatch the trampoline has to be conservative. It has to save/restore five program arguments and assume 64-bit scalars. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org
2019-11-16netfilter: nf_tables: add nft_unregister_flowtable_hook()Pablo Neira Ayuso1-10/+14
Unbind flowtable callback if hook is unregistered. This patch is implicitly fixing the error path of nf_tables_newflowtable() and nft_flowtable_event(). Fixes: 8bb69f3b2918 ("netfilter: nf_tables: add flowtable offload control plane") Reported-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_tables: check if bind callback fails and unbind if hook ↵wenxu1-3/+11
registration fails Undo the callback binding before unregistering the existing hooks. This should also check for error of the bind setup call. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_tables_offload: undo updates if transaction failsPablo Neira Ayuso2-1/+64
The nft_flow_rule_offload_commit() function might fail after several successful commands, thus, leaving the hardware filtering policy in inconsistent state. This patch adds nft_flow_rule_offload_abort() function which undoes the updates that have been already processed if one command in this transaction fails. Hence, the hardware ruleset is left as it was before this aborted transaction. The deletion path needs to create the flow_rule object too, in case that an existing rule needs to be re-added from the abort path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_tables_offload: release flow_rule on error from commit pathPablo Neira Ayuso1-5/+21
If hardware offload commit path fails, release all flow_rule objects. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_tables_offload: remove reference to flow rule from deletion pathPablo Neira Ayuso1-2/+1
The cookie is sufficient to delete the rule from the hardware. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_flow_table: remove unnecessary parameter in flow_offload_fill_dirwenxu1-4/+4
The ct object is already in the flow_offload structure, remove it. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_flow_table_offload: Fix check ndo_setup_tc when setup_blockwenxu1-0/+3
It should check the ndo_setup_tc in the nf_flow_table_offload_setup. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16bpf: Annotate context typesAlexei Starovoitov6-43/+176
Annotate BPF program context types with program-side type and kernel-side type. This type information is used by the verifier. btf_get_prog_ctx_type() is used in the later patches to verify that BTF type of ctx in BPF program matches to kernel expected ctx type. For example, the XDP program type is: BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff) That means that XDP program should be written as: int xdp_prog(struct xdp_md *ctx) { ... } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org
2019-11-16netfilter: Support iif matches in POSTROUTINGPhil Sutter4-6/+6
Instead of generally passing NULL to NF_HOOK_COND() for input device, pass skb->dev which contains input device for routed skbs. Note that iptables (both legacy and nft) reject rules with input interface match from being added to POSTROUTING chains, but nftables allows this. Cc: Eric Garver <eric@garver.life> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_flow_table_offload: add IPv6 supportPablo Neira Ayuso5-11/+127
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet flowtable type definitions. Rename the nf_flow_rule_route() function to nf_flow_rule_route_ipv4(). Adjust maximum number of actions, which now becomes 16 to leave sufficient room for the IPv6 address mangling for NAT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nf_flow_table_offload: add flow_action_entry_next() and use itPablo Neira Ayuso1-38/+38
This function retrieves a spare action entry from the array of actions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: nft_meta: use 64-bit time arithmeticArnd Bergmann1-5/+5
On 32-bit architectures, get_seconds() returns an unsigned 32-bit time value, which also matches the type used in the nft_meta code. This will not overflow in year 2038 as a time_t would, but it still suffers from the overflow problem later on in year 2106. Change this instance to use the time64_t type consistently and avoid the deprecated get_seconds(). The nft_meta_weekday() calculation potentially gets a little slower on 32-bit architectures, but now it has the same behavior as on 64-bit architectures and does not overflow. Fixes: 63d10e12b00d ("netfilter: nft_meta: support for time matching") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16netfilter: xt_time: use time64_tArnd Bergmann1-8/+11
The current xt_time driver suffers from the y2038 overflow on 32-bit architectures, when the time of day calculations break. Also, on both 32-bit and 64-bit architectures, there is a problem with info->date_start/stop, which is part of the user ABI and overflows in in 2106. Fix the first issue by using time64_t and explicit calls to div_u64() and div_u64_rem(), and document the seconds issue. The explicit 64-bit division is unfortunately slower on 32-bit architectures, but doing it as unsigned lets us use the optimized division-through-multiplication path in most configurations. This should be fine, as the code already does not allow any negative time of day values. Using u32 seconds values consistently would probably also work and be a little more efficient, but that doesn't feel right as it would propagate the y2106 overflow to more place rather than fewer. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-16bpf: Fix race in btf_resolve_helper_id()Alexei Starovoitov4-9/+32
btf_resolve_helper_id() caching logic is a bit racy, since under root the verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE. Fix the type as well, since error is also recorded. Fixes: a7658e1a4164 ("bpf: Check types of arguments passed into helpers") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org
2019-11-16bpf: Reserve space for BPF trampoline in BPF programsAlexei Starovoitov1-2/+7
BPF trampoline can be made to work with existing 5 bytes of BPF program prologue, but let's add 5 bytes of NOPs to the beginning of every JITed BPF program to make BPF trampoline job easier. They can be removed in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-14-ast@kernel.org
2019-11-16selftests/bpf: Add stress test for maximum number of progsAlexei Starovoitov1-0/+76
Add stress test for maximum number of attached BPF programs per BPF trampoline. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-13-ast@kernel.org
2019-11-16selftests/bpf: Add combined fentry/fexit testAlexei Starovoitov1-0/+90
Add a combined fentry/fexit test. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-12-ast@kernel.org
2019-11-16selftests/bpf: Add fexit tests for BPF trampolineAlexei Starovoitov2-0/+162
Add fexit tests for BPF trampoline that checks kernel functions with up to 6 arguments of different sizes and their return values. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-11-ast@kernel.org
2019-11-16selftests/bpf: Add test for BPF trampolineAlexei Starovoitov3-0/+167
Add sanity test for BPF trampoline that checks kernel functions with up to 6 arguments of different sizes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-10-ast@kernel.org