Age | Commit message (Collapse) | Author | Files | Lines |
|
Patch series "mm: fix some MADV_FREE issues", v5.
We are trying to use MADV_FREE in jemalloc. Several issues are found.
Without solving the issues, jemalloc can't use the MADV_FREE feature.
- Doesn't support system without swap enabled. Because if swap is off,
we can't or can't efficiently age anonymous pages. And since
MADV_FREE pages are mixed with other anonymous pages, we can't
reclaim MADV_FREE pages. In current implementation, MADV_FREE will
fallback to MADV_DONTNEED without swap enabled. But in our
environment, a lot of machines don't enable swap. This will prevent
our setup using MADV_FREE.
- Increases memory pressure. page reclaim bias file pages reclaim
against anonymous pages. This doesn't make sense for MADV_FREE pages,
because those pages could be freed easily and refilled with very
slight penality. Even page reclaim doesn't bias file pages, there is
still an issue, because MADV_FREE pages and other anonymous pages are
mixed together. To reclaim a MADV_FREE page, we probably must scan a
lot of other anonymous pages, which is inefficient. In our test, we
usually see oom with MADV_FREE enabled and nothing without it.
- Accounting. There are two accounting problems. We don't have a global
accounting. If the system is abnormal, we don't know if it's a
problem from MADV_FREE side. The other problem is RSS accounting.
MADV_FREE pages are accounted as normal anon pages and reclaimed
lazily, so application's RSS becomes bigger. This confuses our
workloads. We have monitoring daemon running and if it finds
applications' RSS becomes abnormal, the daemon will kill the
applications even kernel can reclaim the memory easily.
To address the first the two issues, we can either put MADV_FREE pages
into a separate LRU list (Minchan's previous patches and V1 patches), or
put them into LRU_INACTIVE_FILE list (suggested by Johannes). The
patchset use the second idea. The reason is LRU_INACTIVE_FILE list is
tiny nowadays and should be full of used once file pages. So we can
still efficiently reclaim MADV_FREE pages there without interference
with other anon and active file pages. Putting the pages into inactive
file list also has an advantage which allows page reclaim to prioritize
MADV_FREE pages and used once file pages. MADV_FREE pages are put into
the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
directly freed without pageout if they are clean, otherwise normal swap
will reclaim them.
For the third issue, the previous post adds global accounting and a
separate RSS count for MADV_FREE pages. The problem is we never get
accurate accounting for MADV_FREE pages. The pages are mapped to
userspace, can be dirtied without notice from kernel side. To get
accurate accounting, we could write protect the page, but then there is
extra page fault overhead, which people don't want to pay. Jemalloc
guys have concerns about the inaccurate accounting, so this post drops
the accounting patches temporarily. The info exported to
/proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
can get accurate accounting right now.
This patch (of 6):
Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
always have the flag set if we want to do an unmap. For cases we don't
do an unmap, the TTU_LZFREE part of code should never run.
Also the TTU_UNMAP is unnecessary. If no other flags set (for example,
TTU_MIGRATION), an unmap is implied.
The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
code
Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.
Link: http://lkml.kernel.org/r/e8e3d4280a34facbc007346f31df833cec28801e.1488070291.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
loops without progress, we'll OOM anyway; backing off might cut one or
two iterations off that in the rare OOM case. If we have intermittent
success reclaiming a few pages, the backoff function gets reset also,
and so is of little help in these scenarios.
We might want a backoff function for when there IS progress, but not
enough to be satisfactory. But this isn't that. Remove it.
Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit d7f05528eedb047efe2288cff777676b028747b6.
Now that reclaimability of a node is no longer based on the ratio
between pages scanned and theoretically reclaimable pages, we can remove
accounting tricks for pages skipped due to zone constraints.
Link: http://lkml.kernel.org/r/20170228214007.5621-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
NR_PAGES_SCANNED counts number of pages scanned since the last page free
event in the allocator. This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them. In that role, it has been replaced in the preceding
patches by a different mechanism.
Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well. It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.
Remove the counter and the unused pgdat_reclaimable().
Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
sought to avoid high reclaim priorities for memcg by forcing it to scan
a minimum amount of pages when lru_pages >> priority yielded nothing.
This was done at a time when reclaim decisions like dirty throttling
were tied to the priority level.
Nowadays, the only meaningful thing still tied to priority dropping
below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
allowed to write. But that is from an era where direct reclaim was
still allowed to call ->writepage, and kswapd nowadays avoids writes
until it's scanned every clean page in the system. Potential changes to
how quick sc->may_writepage could trigger are of little concern.
Remove the force_scan stuff, as well as the ugly multi-pass target
calculation that it necessitated.
Link: http://lkml.kernel.org/r/20170228214007.5621-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
sought to avoid high reclaim priorities for kswapd by forcing it to scan
a minimum amount of pages when lru_pages >> priority yielded nothing.
Commit b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg
LRU lists"), due to switching global reclaim to a round-robin scheme
over all cgroups, had to restrict this forceful behavior to
unreclaimable zones in order to prevent massive overreclaim with many
cgroups.
The latter patch effectively neutered the behavior completely for all
but extreme memory pressure. But in those situations we might as well
drop the reclaimers to lower priority levels. Remove the check.
Link: http://lkml.kernel.org/r/20170228214007.5621-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
NUMA balancing already checks the watermarks of the target node to
decide whether it's a suitable balancing target. Whether the node is
reclaimable or not is irrelevant when we don't intend to reclaim.
Link: http://lkml.kernel.org/r/20170228214007.5621-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") allowed laptop_mode=1 to start writing not just when the
priority drops to DEF_PRIORITY - 2 but also when the node is
unreclaimable.
That appears to be a spurious change in this patch as I doubt the series
was tested with laptop_mode, and neither is that particular change
mentioned in the changelog. Remove it, it's still recent.
Link: http://lkml.kernel.org/r/20170228214007.5621-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
all free pages in each zone fall below half the min watermark. During
the summation, we want to exclude zones that don't have reclaimables.
Checking the same pgdat over and over again doesn't make sense.
Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
Link: http://lkml.kernel.org/r/20170228214007.5621-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".
Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.
The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.
This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.
Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.
Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.
If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.
This patch (of 9):
Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:
$ echo 4000 >/proc/sys/vm/nr_hugepages
top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3
At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.
Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.
However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.
Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.
[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Jia He <hejianet@gmail.com>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Each slab kmem cache has per cpu array caches. The array caches are
created when the kmem_cache is created, either via kmem_cache_create()
or lazily when the first object is allocated in context of a kmem
enabled memcg. Array caches are replaced by writing to /proc/slabinfo.
Array caches are protected by holding slab_mutex or disabling
interrupts. Array cache allocation and replacement is done by
__do_tune_cpucache() which holds slab_mutex and calls
kick_all_cpus_sync() to interrupt all remote processors which confirms
there are no references to the old array caches.
IPIs are needed when replacing array caches. But when creating a new
array cache, there's no need to send IPIs because there cannot be any
references to the new cache. Outside of memcg kmem accounting these
IPIs occur at boot time, so they're not a problem. But with memcg kmem
accounting each container can create kmem caches, so the IPIs are
wasteful.
Avoid unnecessary IPIs when creating array caches.
Test which reports the IPI count of allocating slab in 10000 memcg:
import os
def ipi_count():
with open("/proc/interrupts") as f:
for l in f:
if 'Function call interrupts' in l:
return int(l.split()[1])
def echo(val, path):
with open(path, "w") as f:
f.write(val)
n = 10000
os.chdir("/mnt/cgroup/memory")
pid = str(os.getpid())
a = ipi_count()
for i in range(n):
os.mkdir(str(i))
echo("1G\n", "%d/memory.limit_in_bytes" % i)
echo("1G\n", "%d/memory.kmem.limit_in_bytes" % i)
echo(pid, "%d/cgroup.procs" % i)
open("/tmp/x", "w").close()
os.unlink("/tmp/x")
b = ipi_count()
print "%d loops: %d => %d (+%d ipis)" % (n, a, b, b-a)
echo(pid, "cgroup.procs")
for i in range(n):
os.rmdir(str(i))
patched: 10000 loops: 1069 => 1170 (+101 ipis)
unpatched: 10000 loops: 1192 => 48933 (+47741 ipis)
Link: http://lkml.kernel.org/r/20170416214544.109476-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use offset_in_page() macro instead of open-coding.
Link: http://lkml.kernel.org/r/4dbc77ccaaed98b183cf4dba58a4fa325fd65048.1492758503.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Configfs is the interface for ocfs2-tools to set configure to kernel and
$configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
used to configure heartbeat dead threshold. Kernel has a default value
of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
to override it.
Commit 45b997737a80 ("ocfs2/cluster: use per-attribute show and store
methods") changed heartbeat dead threshold name while ocfs2-tools did
not, so ocfs2-tools won't set this configurable and the default value is
always used. So revert it.
Fixes: 45b997737a80 ("ocfs2/cluster: use per-attribute show and store methods")
Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use setup_timer() instead of init_timer() to simplify the code.
Link: http://lkml.kernel.org/r/5e75bf07beb91e092d5aa36c36769949a480456a.1489060564.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In many of clk_disable() implementations, it is a no-op for a NULL
pointer input, but this is one of the exceptions.
Making it treewide consistent will allow clock consumers to call
clk_disable() without NULL pointer check.
Link: http://lkml.kernel.org/r/1490692624-11931-4-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Steven Miao <realmz6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Here are some of the more common spelling mistakes that I've found while
fixing up spelling mistakes in kernel error message text. They probably
should be added to this list so we don't keep on seeing them appearing
again.
Link: http://lkml.kernel.org/r/20170421122534.5378-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Joe Perches <joe@perches.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Interrupt enable/disabled with spinlock is not a valid operation for RT
as it can make executing tasks sleep from a non-sleepable context. So
convert it to spin_lock_irq[save, restore].
Link: http://lkml.kernel.org/r/1492065666-3816-1-git-send-email-pagupta@redhat.com
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ville Syrjl <ville.syrjala@linux.intel.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull audit updates from Paul Moore:
"Fourteen audit patches for v4.12 that span the full range of fixes,
new features, and internal cleanups.
We have a patches to move to 64-bit timestamps, convert refcounts from
atomic_t to refcount_t, track PIDs using the pid struct instead of
pid_t, convert our own private audit buffer cache to a standard
kmem_cache, log kernel module names when they are unloaded, and
normalize the NETFILTER_PKT to make the userspace folks happier.
From a fixes perspective, the most important is likely the auditd
connection tracking RCU fix; it was a rather brain dead bug that I'll
take the blame for, but thankfully it didn't seem to affect many
people (only one report).
I think the patch subject lines and commit descriptions do a pretty
good job of explaining the details and why the changes are important
so I'll point you there instead of duplicating it here; as usual, if
you have any questions you know where to find us.
We also manage to take out more code than we put in this time, that
always makes me happy :)"
* 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit:
audit: fix the RCU locking for the auditd_connection structure
audit: use kmem_cache to manage the audit_buffer cache
audit: Use timespec64 to represent audit timestamps
audit: store the auditd PID as a pid struct instead of pid_t
audit: kernel generated netlink traffic should have a portid of 0
audit: combine audit_receive() and audit_receive_skb()
audit: convert audit_watch.count from atomic_t to refcount_t
audit: convert audit_tree.count from atomic_t to refcount_t
audit: normalize NETFILTER_PKT
netfilter: use consistent ipv4 network offset in xt_AUDIT
audit: log module name on delete_module
audit: remove unnecessary semicolon in audit_watch_handle_event()
audit: remove unnecessary semicolon in audit_mark_handle_event()
audit: remove unnecessary semicolon in audit_field_valid()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
"Highlights:
IMA:
- provide ">" and "<" operators for fowner/uid/euid rules
KEYS:
- add a system blacklist keyring
- add KEYCTL_RESTRICT_KEYRING, exposes keyring link restriction
functionality to userland via keyctl()
LSM:
- harden LSM API with __ro_after_init
- add prlmit security hook, implement for SELinux
- revive security_task_alloc hook
TPM:
- implement contextual TPM command 'spaces'"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (98 commits)
tpm: Fix reference count to main device
tpm_tis: convert to using locality callbacks
tpm: fix handling of the TPM 2.0 event logs
tpm_crb: remove a cruft constant
keys: select CONFIG_CRYPTO when selecting DH / KDF
apparmor: Make path_max parameter readonly
apparmor: fix parameters so that the permission test is bypassed at boot
apparmor: fix invalid reference to index variable of iterator line 836
apparmor: use SHASH_DESC_ON_STACK
security/apparmor/lsm.c: set debug messages
apparmor: fix boolreturn.cocci warnings
Smack: Use GFP_KERNEL for smk_netlbl_mls().
smack: fix double free in smack_parse_opts_str()
KEYS: add SP800-56A KDF support for DH
KEYS: Keyring asymmetric key restrict method with chaining
KEYS: Restrict asymmetric key linkage using a specific keychain
KEYS: Add a lookup_restriction function for the asymmetric key type
KEYS: Add KEYCTL_RESTRICT_KEYRING
KEYS: Consistent ordering for __key_link_begin and restrict check
KEYS: Add an optional lookup_restriction hook to key_type
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
tty: fix comment for __tty_alloc_driver()
init/main: properly align the multi-line comment
init/main: Fix double "the" in comment
Fix dead URLs to ftp.kernel.org
drivers: Clean up duplicated email address
treewide: Fix typo in xml/driver-api/basics.xml
tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
selftests/timers: Spelling s/privledges/privileges/
HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
net: phy: dp83848: Fix Typo
UBI: Fix typos
Documentation: ftrace.txt: Correct nice value of 120 priority
net: fec: Fix typo in error msg and comment
treewide: Fix typos in printk
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatch updates from Jiri Kosina:
- a per-task consistency model is being added for architectures that
support reliable stack dumping (extending this, currently rather
trivial set, is currently in the works).
This extends the nature of the types of patches that can be applied
by live patching infrastructure. The code stems from the design
proposal made [1] back in November 2014. It's a hybrid of SUSE's
kGraft and RH's kpatch, combining advantages of both: it uses
kGraft's per-task consistency and syscall barrier switching combined
with kpatch's stack trace switching. There are also a number of
fallback options which make it quite flexible.
Most of the heavy lifting done by Josh Poimboeuf with help from
Miroslav Benes and Petr Mladek
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
- module load time patch optimization from Zhou Chengming
- a few assorted small fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add missing printk newlines
livepatch: Cancel transition a safe way for immediate patches
livepatch: Reduce the time of finding module symbols
livepatch: make klp_mutex proper part of API
livepatch: allow removal of a disabled patch
livepatch: add /proc/<pid>/patch_state
livepatch: change to a per-task consistency model
livepatch: store function sizes
livepatch: use kstrtobool() in enabled_store()
livepatch: move patching functions into patch.c
livepatch: remove unnecessary object loaded check
livepatch: separate enabled and patched states
livepatch/s390: add TIF_PATCH_PENDING thread flag
livepatch/s390: reorganize TIF thread flag bits
livepatch/powerpc: add TIF_PATCH_PENDING thread flag
livepatch/x86: add TIF_PATCH_PENDING thread flag
livepatch: create temporary klp_update_patch_state() stub
x86/entry: define _TIF_ALLWORK_MASK flags explicitly
stacktrace/x86: add function for detecting reliable stack traces
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID subsystem updates from Jiri Kosina:
- The need for HID_QUIRK_NO_INIT_REPORTS per-device quirk has been
growing dramatically during past years, so the time has come to
switch over the default, and perform the pro-active reading only in
cases where it's really needed (multitouch, wacom).
The only place where this behavior is (in some form) preserved is
hiddev so that we don't introduce userspace-visible change of
behavior.
From Benjamin Tissoires
- HID++ support for power_supply / baterry reporting.
From Benjamin Tissoires and Bastien Nocera
- Vast improvements / rework of DS3 and DS4 in Sony driver.
From Roderick Colenbrander
- Improvment (in terms of getting closer to the Microsoft's
interpretation of slightly ambiguous specification) of logical range
interpretation in case null-state is set in the rdesc.
From Valtteri Heikkilä and Tomasz Kramkowski
- A lot of newly supported device IDs and small assorted fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (71 commits)
HID: usbhid: Add HID_QUIRK_NOGET for Aten CS-1758 KVM switch
HID: asus: support backlight on USB keyboards
HID: wacom: Move wacom_remote_irq and wacom_remote_status_irq
HID: wacom: generic: sync pad events only for actual packets
HID: sony: remove redundant check for -ve err
HID: sony: Make sure to unregister sensors on failure
HID: sony: Make DS4 bt poll interval adjustable
HID: sony: Set proper bit flags on DS4 output report
HID: sony: DS4 use brighter LED colors
HID: sony: Improve navigation controller axis/button mapping
HID: sony: Use DS3 MAC address as unique identifier on USB
HID: logitech-hidpp: add a sysfs file to tell we support power_supply
HID: logitech-hidpp: enable HID++ 1.0 battery reporting
HID: logitech-hidpp: add support for battery status for the K750
HID: logitech-hidpp: battery: provide CAPACITY_LEVEL
HID: logitech-hidpp: rename battery level into capacity
HID: logitech-hidpp: battery: provide ONLINE property
HID: logitech-hidpp: notify battery on connect
HID: logitech-hidpp: return an error if the queried feature is not present
HID: logitech-hidpp: create the battery for all types of HID++ devices
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"This is the bulk of pin control changes for the v4.12 cycle.
The extra week before the merge window actually resulted in some of
the type of fixes that usually arrive after the merge window already
starting to trickle in from eager developers using -next, I'm
impressed.
I have recruited a Samsung subsubsystem maintainer (Krzysztof) to deal
with the onset of Samsung patches. It works great.
Apart from that it is a boring round, just incremental updates and
fixes all over the place, no serious core changes or anything exciting
like that. The most pleasing to see is Julia Cartwrights work to audit
the irqchip-providing drivers for realtime locking compliance. It's
one of those "I should really get around to looking into that" things
that have been on my TODO list since forever.
Summary:
Core changes:
- add bi-directional and output-enable pin configurations to the
generic bindings and generic pin controlling core.
New drivers or subdrivers:
- Armada 37xx SoC pin controller and GPIO support.
- Axis ARTPEC-6 SoC pin controller support.
- AllWinner A64 R_PIO controller support, and opening up the
AllWinner sunxi driver for ARM64 use.
- Rockchip RK3328 support.
- Renesas R-Car H3 ES2.0 support.
- STM32F469 support in the STM32 driver.
- Aspeed G4 and G5 pin controller support.
Improvements:
- a whole slew of realtime improvements to drivers implementing
irqchips: BCM, AMD, SiRF, sunxi, rockchip.
- switch meson driver to get the GPIO ranges from the device tree.
- input schmitt trigger support on the Rockchip driver.
- enable the sunxi (AllWinner) driver to also be used on ARM64
silicon.
- name the Qualcomm QDF2xxx GPIO lines.
- support GMMR GPIO regions on the Intel Cherryview. This fixes a
serialization problem on these platforms.
- pad retention support for the Samsung Exynos 5433.
- handle suspend-to-ram in the AT91-pio4 driver.
- pin configuration support in the Aspeed driver.
Cleanups:
- the final name of Rockchip RK1108 was RV1108 so rename the driver
and variables to stay consistent"
* tag 'pinctrl-v4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (80 commits)
pinctrl: mediatek: Add missing pinctrl bindings for mt7623
pinctrl: artpec6: Fix return value check in artpec6_pmx_probe()
pinctrl: artpec6: Remove .owner field for driver
pinctrl: tegra: xusb: Silence sparse warnings
ARM: at91/at91-pinctrl documentation: fix spelling mistake: "contoller" -> "controller"
pinctrl: make artpec6 explicitly non-modular
pinctrl: aspeed: g5: Add pinconf support
pinctrl: aspeed: g4: Add pinconf support
pinctrl: aspeed: Add core pinconf support
pinctrl: aspeed: Document pinconf in devicetree bindings
pinctrl: Add st,stm32f469-pinctrl compatible to stm32-pinctrl
pinctrl: stm32: Add STM32F469 MCU support
Documentation: dt: Remove ngpios from stm32-pinctrl binding
pinctrl: stm32: replace device_initcall() with arch_initcall()
pinctrl: stm32: add possibility to use gpio-ranges to declare bank range
pinctrl: armada-37xx: Add gpio support
pinctrl: armada-37xx: Add pin controller support for Armada 37xx
pinctrl: dt-bindings: Add documentation for Armada 37xx pin controllers
pinctrl: core: Make pinctrl_init_controller() static
pinctrl: generic: Add bi-directional and output-enable
...
|
|
Pull MMC updates from Ulf Hansson:
"MMC core:
- Continue to re-factor code to prepare for eMMC CMDQ and blkmq support
- Introduce queue semantics to prepare for eMMC CMDQ and blkmq support
- Add helper functions to manage temporary enable/disable of eMMC CMDQ
- Improve wait-busy detection for SDIO
MMC host:
- cavium: Add driver to support Cavium controllers
- cavium: Extend Cavium driver to support Octeon and ThunderX SOCs
- bcm2835: Add new driver for Broadcom BCM2835 controller
- sdhci-xenon: Add driver to support Marvell Xenon SDHCI controller
- sdhci-tegra: Add support for the Tegra186 variant
- sdhci-of-esdhc: Support for UHS-I SD cards
- sdhci-of-esdhc: Support for eMMC HS200 cards
- sdhci-cadence: Add eMMC HS400 enhanced strobe support
- sdhci-esdhc-imx: Reset tuning circuit when needed
- sdhci-pci: Modernize and clean-up some PM related code
- sdhci-pci: Avoid re-tuning at runtime PM for some Intel devices
- sdhci-pci|acpi: Use aggressive PM for some Intel BYT controllers
- sdhci: Re-factoring and modernizations
- sdhci: Optimize delay loops
- sdhci: Improve register dump print format
- sdhci: Add support for the Command Queue Engine
- meson-gx: Various improvements and clean-ups
- meson-gx: Add support for CMD23
- meson-gx: Basic tuning support to avoid CRC errors
- s3cmci: Enable probing via DT
- mediatek: Improve tuning support for eMMC HS200 and HS400 mode
- tmio: Improve DMA support
- tmio: Use correct response for CMD12
- dw_mmc: Minor improvements and clean-ups"
* tag 'mmc-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (148 commits)
mmc: sdhci-of-esdhc: limit SD clock for ls1012a/ls1046a
mmc: sdhci-of-esdhc: poll ESDHC_CLOCK_STABLE bit with udelay
mmc: sdhci-xenon: Fix default value of LOGIC_TIMING_ADJUST for eMMC5.0 PHY
mmc: sdhci-xenon: Fix the work flow in xenon_remove().
MIPS: Octeon: cavium_octeon_defconfig: Enable Octeon MMC
mmc: sdhci-xenon: Remove redundant dev_err call in get_dt_pad_ctrl_data()
mmc: cavium: Use module_pci_driver to simplify the code
mmc: cavium: Add MMC support for Octeon SOCs.
mmc: cavium: Fix detection of block or byte addressing.
mmc: core: Export API to allow hosts to get the card address
mmc: sdio: Fix sdio wait busy implement limitation
mmc: sdhci-esdhc-imx: reset tuning circuit when power on mmc card
clk: apn806: fix spelling mistake: "mising" -> "missing"
mmc: sdhci-of-esdhc: add delay between tuning cycles
mmc: sdhci: Control the delay between tuning commands
mmc: sdhci-of-esdhc: add tuning support
mmc: sdhci-of-esdhc: add support for signal voltage switch
mmc: sdhci-of-esdhc: add peripheral clock support
mmc: sdhci-pci: Allow for 3 bytes from Intel DSM
mmc: cavium: Fix a shift wrapping bug
...
|
|
Pull networking updates from David Millar:
"Here are some highlights from the 2065 networking commits that
happened this development cycle:
1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)
2) Add a generic XDP driver, so that anyone can test XDP even if they
lack a networking device whose driver has explicit XDP support
(me).
3) Sparc64 now has an eBPF JIT too (me)
4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
Starovoitov)
5) Make netfitler network namespace teardown less expensive (Florian
Westphal)
6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)
7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)
8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)
9) Multiqueue support in stmmac driver (Joao Pinto)
10) Remove TCP timewait recycling, it never really could possibly work
well in the real world and timestamp randomization really zaps any
hint of usability this feature had (Soheil Hassas Yeganeh)
11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
Aleksandrov)
12) Add socket busy poll support to epoll (Sridhar Samudrala)
13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
and several others)
14) IPSEC hw offload infrastructure (Steffen Klassert)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
tipc: refactor function tipc_sk_recv_stream()
tipc: refactor function tipc_sk_recvmsg()
net: thunderx: Optimize page recycling for XDP
net: thunderx: Support for XDP header adjustment
net: thunderx: Add support for XDP_TX
net: thunderx: Add support for XDP_DROP
net: thunderx: Add basic XDP support
net: thunderx: Cleanup receive buffer allocation
net: thunderx: Optimize CQE_TX handling
net: thunderx: Optimize RBDR descriptor handling
net: thunderx: Support for page recycling
ipx: call ipxitf_put() in ioctl error path
net: sched: add helpers to handle extended actions
qed*: Fix issues in the ptp filter config implementation.
qede: Fix concurrency issue in PTP Tx path processing.
stmmac: Add support for SIMATIC IOT2000 platform
net: hns: fix ethtool_get_strings overflow in hns driver
tcp: fix wraparound issue in tcp_lp
bpf, arm64: fix jit branch offset related to ldimm64
bpf, arm64: implement jiting of BPF_XADD
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"Here is the crypto update for 4.12:
API:
- Add batch registration for acomp/scomp
- Change acomp testing to non-unique compressed result
- Extend algorithm name limit to 128 bytes
- Require setkey before accept(2) in algif_aead
Algorithms:
- Add support for deflate rfc1950 (zlib)
Drivers:
- Add accelerated crct10dif for powerpc
- Add crc32 in stm32
- Add sha384/sha512 in ccp
- Add 3des/gcm(aes) for v5 devices in ccp
- Add Queue Interface (QI) backend support in caam
- Add new Exynos RNG driver
- Add ThunderX ZIP driver
- Add driver for hardware random generator on MT7623 SoC"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (101 commits)
crypto: stm32 - Fix OF module alias information
crypto: algif_aead - Require setkey before accept(2)
crypto: scomp - add support for deflate rfc1950 (zlib)
crypto: scomp - allow registration of multiple scomps
crypto: ccp - Change ISR handler method for a v5 CCP
crypto: ccp - Change ISR handler method for a v3 CCP
crypto: crypto4xx - rename ce_ring_contol to ce_ring_control
crypto: testmgr - Allow ecb(cipher_null) in FIPS mode
Revert "crypto: arm64/sha - Add constant operand modifier to ASM_EXPORT"
crypto: ccp - Disable interrupts early on unload
crypto: ccp - Use only the relevant interrupt bits
hwrng: mtk - Add driver for hardware random generator on MT7623 SoC
dt-bindings: hwrng: Add Mediatek hardware random generator bindings
crypto: crct10dif-vpmsum - Fix missing preempt_disable()
crypto: testmgr - replace compression known answer test
crypto: acomp - allow registration of multiple acomps
hwrng: n2 - Use devm_kcalloc() in n2rng_probe()
crypto: chcr - Fix error handling related to 'chcr_alloc_shash'
padata: get_next is never NULL
crypto: exynos - Add new Exynos RNG driver
...
|
|
Jon Maloy says:
====================
tipc: refactor socket receive functions
We try to make the functions tipc_sk_recvmsg() and
tipc_sk_recvstream() more readable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We try to make this function more readable by improving variable names
and comments, using more stack variables, and doing some smaller changes
to the logics. We also rename the function to make it consistent with
naming conventions used elsewhere in the code.
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We try to make this function more readable by improving variable names
and comments, plus some minor changes to the logics.
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sunil Goutham says:
====================
net: thunderx: Adds XDP support
This patch series adds support for XDP to ThunderX NIC driver
which is used on CN88xx, CN81xx and CN83xx platforms.
Patches 1-4 are performance improvement and cleanup patches
which are done keeping XDP performance bottlenecks in view.
Rest of the patches adds actual XDP support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Driver follows a method of taking one extra reference on the
page for recycling which is fine in usual packet path where
each 64KB page is segmented into multiple receive buffers.
But in XDP mode since there is just one receive buffer per
page taking extra page reference itself becomes big bottleneck
consuming ~50% of CPU cycles due to atomic operations.
This patch adds a internal ref count in pgcache for each
page and additional page references are taken in a batch
instead of just one at a time. Internal i.e 'pgcache->ref_count'
and page's i.e 'page->_refcount' counters are compared to check
page's recyclability.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When in XDP mode reserve XDP_PACKET_HEADROOM bytes at the start
of receive buffer for XDP program to modify headers and adjust
packet start. Additional code changes done to handle such packets.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adds support for XDP_TX i.e transmits packet out of
the XDP TX queue mapped to the corresponding Rx queue
on which packet is received.
Since SQ for XDP TX will be used only on a single cpu i.e
SQ description creation and freeing, using atomic free count
is not necessary and will become a bottleneck. Hence added
a separate 'xdp_free_cnt' used for SQs designated for XDP
to track descriptor free count.
Changes also include
- A new entry 'xdp_page' is added to save transmitted packet's
page pointer for later cleanup.
- XDP Tx SQ's doorbell is ringed once per NAPI instance.
- Retrieving designated SQ for packets being sent out by stack
via 'nicvf_xmit'.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adds support for XDP_DROP.
Also since in XDP mode there is just a single buffer per page,
made changes to recycle DMA mapping info as well along with pages.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adds basic XDP support i.e attaching a BPF program to an
interface. Also takes care of allocating separate Tx queues
for XDP path and for network stack packet transmission.
This patch doesn't support handling of any of the XDP actions,
all are treated as XDP_PASS i.e packets will be handed over to
the network stack.
Changes also involve allocating one receive buffer per page in XDP
mode and multiple in normal mode i.e when no BPF program is attached.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Get rid of unnecessary double pointer references and type casting
in receive buffer allocation code.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Optimized CQE handling with below changes
- Feeing descriptors back to SQ in bulk i.e once per NAPI
instance instead for every CQE_TX, this will reduce number
of atomic updates to 'sq->free_cnt'.
- Checking errors in CQE_TX and CQE_RX before calling appropriate
fn()s to update error stats i.e reduce branching.
Also removed debug messages in packet handling path which otherwise
causes issues if DEBUG is enabled.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Receive buffer's physical address or iova will anyway not
go beyond 49bits, since it is the max supported HW address.
As per perf, updating bitfields i.e buf_addr:42 in RBDR
descriptor entry consumes lots of cpu cycles, hence changed
it to a 64bit field with alignment requirements taken care of.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adds support for page recycling for allocating receive buffers
to reduce cost of refilling RBDR ring. Also got rid of using
compound pages when pagesize is 4K, only order-0 pages now.
Only page is recycled, DMA mappings still needs to be done for
every receive buffer allocated due to following constraints
- Cannot have just one receive buffer per 64KB page.
- There is just one buffer ring shared across 8 Rx queues, so
buffers of same page can go to any Rx queue.
- HW gives buffer address where packet has been DMA'ed and not
the index into buffer ring.
This makes it not possible to resue DMA mapping info. So unfortunately
have to go through costly mapping route for every buffer.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We should call ipxitf_put() if the copy_to_user() fails.
Reported-by: 李强 <liqiang6-s@360.cn>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jump is now the only one using value action opcode. This is going to
change soon. So introduce helpers to work with this. Convert TC_ACT_JUMP.
This also fixes the TC_ACT_JUMP check, which is incorrectly done as a
bit check, not a value check.
Fixes: e0ee84ded796 ("net sched actions: Complete the JUMPX opcode")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sudarsana Reddy Kalluru says:
====================
qed*: PTP bug fixes.
The series addresses couple of issues in the PTP implementation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
PTP hardware filter configuration performed by the driver for a given
user requested config is not correct for some of the PTP modes.
Following changes are needed for PTP config-filter implementation.
1. NIG_REG_TX_PTP_EN register - Bits 0/1/2 respectively enables
TimeSync/"V1 frame format support"/"V2 frame format support" on
the TX side. Set the associated bits based on the user request.
2. ptp4l application fails to operate in Peer Delay mode. Following
changes are needed to fix this,
a. Driver should enable (set to 0) DA #1-related bits for IPv4,
IPv6 and MAC destination addresses in these registers:
NIG_REG_TX_LLH_PTP_RULE_MASK
NIG_REG_LLH_PTP_RULE_MASK
b. NIG_REG_LLH_PTP_PARAM_MASK/NIG_REG_TX_LLH_PTP_PARAM_MASK should
be set to 0x0 in all modes.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
PTP Tx timestamping data structures are not protected against the
concurrent access in the Tx paths. Protecting the same using atomic
bit locks.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IOT2000 is industrial controller platform, derived from the Intel
Galileo Gen2 board. The variant IOT2020 comes with one LAN port, the
IOT2040 has two of them. They can be told apart based on the board asset
tag in the DMI table.
Based on patch by Sascha Weisenberger.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Sascha Weisenberger <sascha.weisenberger@siemens.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hns_get_sset_count() returns HNS_NET_STATS_CNT and the data space allocated
is not enough for ethtool_get_strings(), which will cause random memory
corruption.
When SLAB and DEBUG_SLAB are both enabled, memory corruptions like the
the following can be observed without this patch:
[ 43.115200] Slab corruption (Not tainted): Acpi-ParseExt start=ffff801fb0b69030, len=80
[ 43.115206] Redzone: 0x9f911029d006462/0x5f78745f31657070.
[ 43.115208] Last user: [<5f7272655f746b70>](0x5f7272655f746b70)
[ 43.115214] 010: 70 70 65 31 5f 74 78 5f 70 6b 74 00 6b 6b 6b 6b ppe1_tx_pkt.kkkk
[ 43.115217] 030: 70 70 65 31 5f 74 78 5f 70 6b 74 5f 6f 6b 00 6b ppe1_tx_pkt_ok.k
[ 43.115218] Next obj: start=ffff801fb0b69098, len=80
[ 43.115220] Redzone: 0x706d655f6f666966/0x9f911029d74e35b.
[ 43.115229] Last user: [<ffff0000084b11b0>](acpi_os_release_object+0x28/0x38)
[ 43.115231] 000: 74 79 00 6b 6b 6b 6b 6b 70 70 65 31 5f 74 78 5f ty.kkkkkppe1_tx_
[ 43.115232] 010: 70 6b 74 5f 65 72 72 5f 63 73 75 6d 5f 66 61 69 pkt_err_csum_fai
Signed-off-by: Timmy Li <lixiaoping3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Be careful when comparing tcp_time_stamp to some u32 quantity,
otherwise result can be surprising.
Fixes: 7c106d7e782b ("[TCP]: TCP Low Priority congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When the instruction right before the branch destination is
a 64 bit load immediate, we currently calculate the wrong
jump offset in the ctx->offset[] array as we only account
one instruction slot for the 64 bit load immediate although
it uses two BPF instructions. Fix it up by setting the offset
into the right slot after we incremented the index.
Before (ldimm64 test 1):
[...]
00000020: 52800007 mov w7, #0x0 // #0
00000024: d2800060 mov x0, #0x3 // #3
00000028: d2800041 mov x1, #0x2 // #2
0000002c: eb01001f cmp x0, x1
00000030: 54ffff82 b.cs 0x00000020
00000034: d29fffe7 mov x7, #0xffff // #65535
00000038: f2bfffe7 movk x7, #0xffff, lsl #16
0000003c: f2dfffe7 movk x7, #0xffff, lsl #32
00000040: f2ffffe7 movk x7, #0xffff, lsl #48
00000044: d29dddc7 mov x7, #0xeeee // #61166
00000048: f2bdddc7 movk x7, #0xeeee, lsl #16
0000004c: f2ddddc7 movk x7, #0xeeee, lsl #32
00000050: f2fdddc7 movk x7, #0xeeee, lsl #48
[...]
After (ldimm64 test 1):
[...]
00000020: 52800007 mov w7, #0x0 // #0
00000024: d2800060 mov x0, #0x3 // #3
00000028: d2800041 mov x1, #0x2 // #2
0000002c: eb01001f cmp x0, x1
00000030: 540000a2 b.cs 0x00000044
00000034: d29fffe7 mov x7, #0xffff // #65535
00000038: f2bfffe7 movk x7, #0xffff, lsl #16
0000003c: f2dfffe7 movk x7, #0xffff, lsl #32
00000040: f2ffffe7 movk x7, #0xffff, lsl #48
00000044: d29dddc7 mov x7, #0xeeee // #61166
00000048: f2bdddc7 movk x7, #0xeeee, lsl #16
0000004c: f2ddddc7 movk x7, #0xeeee, lsl #32
00000050: f2fdddc7 movk x7, #0xeeee, lsl #48
[...]
Also, add a couple of test cases to make sure JITs pass
this test. Tested on Cavium ThunderX ARMv8. The added
test cases all pass after the fix.
Fixes: 8eee539ddea0 ("arm64: bpf: fix out-of-bounds read in bpf2a64_offset()")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This work adds BPF_XADD for BPF_W/BPF_DW to the arm64 JIT and therefore
completes JITing of all BPF instructions, meaning we can thus also remove
the 'notyet' label and do not need to fall back to the interpreter when
BPF_XADD is used in a program!
This now also brings arm64 JIT in line with x86_64, s390x, ppc64, sparc64,
where all current eBPF features are supported.
BPF_W example from test_bpf:
.u.insns_int = {
BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
BPF_ST_MEM(BPF_W, R10, -40, 0x10),
BPF_STX_XADD(BPF_W, R10, R0, -40),
BPF_LDX_MEM(BPF_W, R0, R10, -40),
BPF_EXIT_INSN(),
},
[...]
00000020: 52800247 mov w7, #0x12 // #18
00000024: 928004eb mov x11, #0xffffffffffffffd8 // #-40
00000028: d280020a mov x10, #0x10 // #16
0000002c: b82b6b2a str w10, [x25,x11]
// start of xadd mapping:
00000030: 928004ea mov x10, #0xffffffffffffffd8 // #-40
00000034: 8b19014a add x10, x10, x25
00000038: f9800151 prfm pstl1strm, [x10]
0000003c: 885f7d4b ldxr w11, [x10]
00000040: 0b07016b add w11, w11, w7
00000044: 880b7d4b stxr w11, w11, [x10]
00000048: 35ffffab cbnz w11, 0x0000003c
// end of xadd mapping:
[...]
BPF_DW example from test_bpf:
.u.insns_int = {
BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
BPF_ST_MEM(BPF_DW, R10, -40, 0x10),
BPF_STX_XADD(BPF_DW, R10, R0, -40),
BPF_LDX_MEM(BPF_DW, R0, R10, -40),
BPF_EXIT_INSN(),
},
[...]
00000020: 52800247 mov w7, #0x12 // #18
00000024: 928004eb mov x11, #0xffffffffffffffd8 // #-40
00000028: d280020a mov x10, #0x10 // #16
0000002c: f82b6b2a str x10, [x25,x11]
// start of xadd mapping:
00000030: 928004ea mov x10, #0xffffffffffffffd8 // #-40
00000034: 8b19014a add x10, x10, x25
00000038: f9800151 prfm pstl1strm, [x10]
0000003c: c85f7d4b ldxr x11, [x10]
00000040: 8b07016b add x11, x11, x7
00000044: c80b7d4b stxr w11, x11, [x10]
00000048: 35ffffab cbnz w11, 0x0000003c
// end of xadd mapping:
[...]
Tested on Cavium ThunderX ARMv8, test suite results after the patch:
No JIT: [ 3751.855362] test_bpf: Summary: 311 PASSED, 0 FAILED, [0/303 JIT'ed]
With JIT: [ 3573.759527] test_bpf: Summary: 311 PASSED, 0 FAILED, [303/303 JIT'ed]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|