Commit Graph

873808 Commits (d2cd795c4ece1a24fda170c35eeb4f17d9826cbb)

Author SHA1 Message Date
Dave Chinner 756c6f0f7e xfs: reverse search directory freespace indexes
When a directory is growing rapidly, new blocks tend to get added at
the end of the directory. These end up at the end of the freespace
index, and when the directory gets large finding these new
freespaces gets expensive. The code does a linear search across the
frespace index from the first block in the directory to the last,
hence meaning the newly added space is the last index searched.

Instead, do a reverse order index search, starting from the last
block and index in the freespace index. This makes most lookups for
free space on rapidly growing directories O(1) instead of O(N), but
should not have any impact on random insert workloads because the
average search length is the same regardless of which end of the
array we start at.

The result is a major improvement in large directory grow rates:

		create time(sec) / rate (files/s)
 File count     vanilla             Prev commit		Patched
  10k	      0.41 / 24.3k	   0.42 / 23.8k       0.41 / 24.3k
  20k	      0.74 / 27.0k	   0.76 / 26.3k       0.75 / 26.7k
 100k	      3.81 / 26.4k	   3.47 / 28.8k       3.27 / 30.6k
 200k	      8.58 / 23.3k	   7.19 / 27.8k       6.71 / 29.8k
   1M	     85.69 / 11.7k	  48.53 / 20.6k      37.67 / 26.5k
   2M	    280.31 /  7.1k	 130.14 / 15.3k      79.55 / 25.2k
  10M	   3913.26 /  2.5k                          552.89 / 18.1k

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner 610125ab1e xfs: speed up directory bestfree block scanning
When running a "create millions inodes in a directory" test
recently, I noticed we were spending a huge amount of time
converting freespace block headers from disk format to in-memory
format:

 31.47%  [kernel]  [k] xfs_dir2_node_addname
 17.86%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
  3.55%  [kernel]  [k] xfs_dir3_free_bests_p

We shouldn't be hitting the best free block scanning code so hard
when doing sequential directory creates, and it turns out there's
a highly suboptimal loop searching the the best free array in
the freespace block - it decodes the block header before checking
each entry inside a loop, instead of decoding the header once before
running the entry search loop.

This makes a massive difference to create rates. Profile now looks
like this:

  13.15%  [kernel]  [k] xfs_dir2_node_addname
   3.52%  [kernel]  [k] xfs_dir3_leaf_check_int
   3.11%  [kernel]  [k] xfs_log_commit_cil

And the wall time/average file create rate differences are
just as stark:

		create time(sec) / rate (files/s)
File count	     vanilla		    patched
  10k		   0.41 / 24.3k		   0.42 / 23.8k
  20k		   0.74	/ 27.0k		   0.76 / 26.3k
 100k		   3.81	/ 26.4k		   3.47 / 28.8k
 200k		   8.58	/ 23.3k		   7.19 / 27.8k
   1M		  85.69	/ 11.7k		  48.53 / 20.6k
   2M		 280.31	/  7.1k		 130.14 / 15.3k

The larger the directory, the bigger the performance improvement.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner 0e822255f9 xfs: factor free block index lookup from xfs_dir2_node_addname_int()
Simplify the logic in xfs_dir2_node_addname_int() by factoring out
the free block index lookup code that finds a block with enough free
space for the entry to be added. The code that is moved gets a major
cleanup at the same time, but there is no algorithm change here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner a07258a695 xfs: factor data block addition from xfs_dir2_node_addname_int()
Factor out the code that adds a data block to a directory from
xfs_dir2_node_addname_int(). This makes the code flow cleaner and
more obvious and provides clear isolation of upcoming optimsations.

Signed-off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:57 -07:00
Dave Chinner aee7754bbe xfs: move xfs_dir2_addname()
This gets rid of the need for a forward  declaration of the static
function xfs_dir2_addname_int() and readies the code for factoring
of xfs_dir2_addname_int().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-30 22:43:56 -07:00
Darrick J. Wong 39ee2239a5 xfs: remove all *_ITER_CONTINUE values
Iterator functions already use 0 to signal "continue iterating", so get
rid of the #defines and just do it directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-30 22:43:56 -07:00
Christoph Hellwig f3b17320db openrisc: map as uncached in ioremap
Openrisc is the only architecture not mapping ioremap as uncached,
which has been the default since the Linux 2.6.x days.  Switch it
over to implement uncached semantics by default.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stafford Horne <shorne@gmail.com>
2019-08-31 11:57:53 +09:00
Stafford Horne eabe7e9a21 or1k: dts: Add ethoc device to SMP devicetree
This patch adds the ethoc device configuration to the OpenRISC basic SMP
device tree config.  This was tested with qemu.

Signed-off-by: Stafford Horne <shorne@gmail.com>
2019-08-31 11:56:19 +09:00
Dan Williams c312ef1763 libata/ahci: Drop PCS quirk for Denverton and beyond
The Linux ahci driver has historically implemented a configuration fixup
for platforms / platform-firmware that fails to enable the ports prior
to OS hand-off at boot. The fixup was originally implemented way back
before ahci moved from drivers/scsi/ to drivers/ata/, and was updated in
2007 via commit 49f2909039 "ahci: update PCS programming". The quirk
sets a port-enable bitmap in the PCS register at offset 0x92.

This quirk could be applied generically up until the arrival of the
Denverton (DNV) platform. The DNV AHCI controller architecture supports
more than 6 ports and along with that the PCS register location and
format were updated to allow for more possible ports in the bitmap. DNV
AHCI expands the register to 32-bits and moves it to offset 0x94.

As it stands there are no known problem reports with existing Linux
trying to set bits at offset 0x92 which indicates that the quirk is not
applicable. Likely it is not applicable on a wider range of platforms,
but it is difficult to discern which platforms if any still depend on
the quirk.

Rather than try to fix the PCS quirk to consider the DNV register layout
instead require explicit opt-in. The assumption is that the OS driver
need not touch this register, and platforms can be added with a new
boad_ahci_pcs7 board-id when / if problematic platforms are found in the
future. The logic in ahci_intel_pcs_quirk() looks for all Intel AHCI
instances with "legacy" board-ids and otherwise skips the quirk if the
board was matched by class-code.

Reported-by: Stephen Douthit <stephend@silicom-usa.com>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Stephen Douthit <stephend@silicom-usa.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-30 20:53:42 -06:00
Denis Efremov 974ceb21fc udp: Remove unlikely() from IS_ERR*() condition
"unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses
unlikely() internally.

Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Joe Perches <joe@perches.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 19:49:37 -07:00
Denis Efremov 7cf92ccb85 net/mlx5e: Remove unlikely() from WARN*() condition
"unlikely(WARN_ON_ONCE(x))" is excessive. WARN_ON_ONCE() already uses
unlikely() internally.

Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: Boris Pismenny <borisp@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netdev@vger.kernel.org
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 19:49:03 -07:00
Stafford Horne ae2930583c or1k: dts: Fix ethoc network configuration in or1ksim devicetree
This fixes several issues with the ethoc network device config.

Fisrt off, the compatible property used an obsolete compatibility
string; this caused the initialization to be skipped.  Next, the
register map was not given enough space to allocate ring descriptors,
this caused module initialization to abort.  Finally, we need to mark
this device as big-endian as needed by openrisc.

This was tested by me in qemu, the setup is documented on the qemu wiki:

  https://wiki.qemu.org/Documentation/Platforms/OpenRISC

Signed-off-by: Stafford Horne <shorne@gmail.com>
2019-08-31 11:46:59 +09:00
Linus Torvalds e0f14b8ca3 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "Fix a potential crash in the ccp driver"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: ccp - Ignore unconfigured CCP device on suspend/resume
2019-08-30 18:56:08 -07:00
Tejun Heo 0feacaa216 writeback: don't access page->mapping directly in track_foreign_dirty TP
page->mapping may encode different values in it and page_mapping()
should always be used to access the mapping pointer.
track_foreign_dirty tracepoint was incorrectly accessing page->mapping
directly.  Use page_mapping() instead.  Also, add NULL checks while at
it.

Fixes: 3a8e9ac89e ("writeback: add tracepoints for cgroup foreign writebacks")
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-30 19:54:28 -06:00
Linus Torvalds ab9bb6318b Partially revert "kfifo: fix kfifo_alloc() and kfifo_init()"
Commit dfe2a77fd2 ("kfifo: fix kfifo_alloc() and kfifo_init()") made
the kfifo code round the number of elements up.  That was good for
__kfifo_alloc(), but it's actually wrong for __kfifo_init().

The difference? __kfifo_alloc() will allocate the rounded-up number of
elements, but __kfifo_init() uses an allocation done by the caller.  We
can't just say "use more elements than the caller allocated", and have
to round down.

The good news? All the normal cases will be using power-of-two arrays
anyway, and most users of kfifo's don't use kfifo_init() at all, but one
of the helper macros to declare a KFIFO that enforce the proper
power-of-two behavior.  But it looks like at least ibmvscsis might be
affected.

The bad news? Will Deacon refers to an old thread and points points out
that the memory ordering in kfifo's is questionable.  See

  https://lore.kernel.org/lkml/20181211034032.32338-1-yuleixzhang@tencent.com/

for more.

Fixes: dfe2a77fd2 ("kfifo: fix kfifo_alloc() and kfifo_init()")
Reported-by: laokz <laokz@foxmail.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:47:15 -07:00
Al Viro ee594bfff3 fs/namei.c: new helper - legitimize_root()
identical logics in unlazy_walk() and unlazy_child()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30 21:30:13 -04:00
Al Viro ce6595a28a kill the last users of user_{path,lpath,path_dir}()
old wrappers with few callers remaining; put them out of their misery...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30 21:30:13 -04:00
Al Viro 6b61aed06a namei.h: get the comments on LOOKUP_... in sync with reality
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30 21:30:12 -04:00
Al Viro fbb7d9d56d kill LOOKUP_NO_EVAL, don't bother including namei.h from audit.h
The former has no users left; the latter was only to get LOOKUP_...
values to remapper in audit_inode() and that's an ex-parrot now.

All places that use symbols from namei.h include it either directly
or (in a few cases) via a local header, like fs/autofs/autofs_i.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30 21:29:32 -04:00
Shakeel Butt 6c1c280805 mm: memcontrol: fix percpu vmstats and vmevents flush
Instead of using raw_cpu_read() use per_cpu() to read the actual data of
the corresponding cpu otherwise we will be reading the data of the
current cpu for the number of online CPUs.

Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com
Fixes: bb65f89b7d ("mm: memcontrol: flush percpu vmevents before releasing memcg")
Fixes: c350a99ea2 ("mm: memcontrol: flush percpu vmstats before releasing memcg")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:00:50 -07:00
Michal Hocko d2e5fb927e mm, memcg: do not set reclaim_state on soft limit reclaim
Adric Blake has noticed[1] the following warning:

  WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
  [...]
  Call Trace:
   mem_cgroup_shrink_node+0x9b/0x1d0
   mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
   balance_pgdat+0x276/0x540
   kswapd+0x200/0x3f0
   ? wait_woken+0x80/0x80
   kthread+0xfd/0x130
   ? balance_pgdat+0x540/0x540
   ? kthread_park+0x80/0x80
   ret_from_fork+0x35/0x40
  ---[ end trace 727343df67b2398a ]---

which tells us that soft limit reclaim is about to overwrite the
reclaim_state configured up in the call chain (kswapd in this case but
the direct reclaim is equally possible).  This means that reclaim stats
would get misleading once the soft reclaim returns and another reclaim
is done.

Fix the warning by dropping set_task_reclaim_state from the soft reclaim
which is always called with reclaim_state set up.

[1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com

Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Adric Blake <promarbler14@gmail.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:00:50 -07:00
Dmitry Safonov a6c135bb1a mailmap: add aliases for Dmitry Safonov
I don't work for Virtuozzo or Samsung anymore and I've noticed that they
have started sending annoying html email-replies.

And I prioritize my personal emails over work email box, so while at it
add an entry for Arista too - so I can reply faster when needed.

Link: http://lkml.kernel.org/r/20190827220346.11123-1-dima@arista.com
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:00:50 -07:00
Gustavo A. R. Silva 14108b9131 mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolate
Fix lock/unlock imbalance by unlocking *zhdr* before return.

Addresses Coverity ID 1452811 ("Missing unlock")

Link: http://lkml.kernel.org/r/20190826030634.GA4379@embeddedor
Fixes: d776aaa989 ("mm/z3fold.c: fix race between migration and destruction")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:00:50 -07:00
Roman Gushchin b4c46484dc mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"
Commit 766a4c19d8 ("mm/memcontrol.c: keep local VM counters in sync
with the hierarchical ones") effectively decreased the precision of
per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.

That's good for displaying in memory.stat, but brings a serious
regression into the reclaim process.

One issue I've discovered and debugged is the following:
lruvec_lru_size() can return 0 instead of the actual number of pages in
the lru list, preventing the kernel to reclaim last remaining pages.
Result is yet another dying memory cgroups flooding.  The opposite is
also happening: scanning an empty lru list is the waste of cpu time.

Also, inactive_list_is_low() can return incorrect values, preventing the
active lru from being scanned and freed.  It can fail both because the
size of active and inactive lists are inaccurate, and because the number
of workingset refaults isn't precise.  In other words, the result is
pretty random.

I'm not sure, if using the approximate number of slab pages in
count_shadow_number() is acceptable, but issues described above are
enough to partially revert the patch.

Let's keep per-memcg vmstat_local batched (they are only used for
displaying stats to the userspace), but keep lruvec stats precise.  This
change fixes the dead memcg flooding on my setup.

Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com
Fixes: 766a4c19d8 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:00:50 -07:00
Andrew Morton 441e254cd4 mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n
Fixes: 701d678599 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
Link: http://lkml.kernel.org/r/201908251039.5oSbEEUT%25lkp@intel.com
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:00:50 -07:00
Roman Gushchin bee07b33db mm: memcontrol: flush percpu slab vmstats on kmem offlining
I've noticed that the "slab" value in memory.stat is sometimes 0, even
if some children memory cgroups have a non-zero "slab" value.  The
following investigation showed that this is the result of the kmem_cache
reparenting in combination with the per-cpu batching of slab vmstats.

At the offlining some vmstat value may leave in the percpu cache, not
being propagated upwards by the cgroup hierarchy.  It means that stats
on ancestor levels are lower than actual.  Later when slab pages are
released, the precise number of pages is substracted on the parent
level, making the value negative.  We don't show negative values, 0 is
printed instead.

To fix this issue, let's flush percpu slab memcg and lruvec stats on
memcg offlining.  This guarantees that numbers on all ancestor levels
are accurate and match the actual number of outstanding slab pages.

Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com
Fixes: fb2f2b0adb ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30 18:00:50 -07:00
David S. Miller c3d7a089f9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Spurious warning when loading rules using the physdev match,
   from Todd Seidelmann.

2) Fix FTP conntrack helper debugging output, from Thomas Jarosch.

3) Restore per-netns nf_conntrack_{acct,helper,timeout} sysctl knobs,
   from Florian Westphal.

4) Clear skbuff timestamp from the flowtable datapath, also from Florian.

5) Fix incorrect byteorder of NFT_META_BRI_IIFVPROTO, from wenxu.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 17:50:10 -07:00
David S. Miller 94880a5b2e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-08-31

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix 32-bit zero-extension during constant blinding which
   has been causing a regression on ppc64, from Naveen.

2) Fix a latency bug in nfp driver when updating stack index
   register, from Jiong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 17:39:37 -07:00
Michael Chan e72cb7d624 bnxt_en: Fix compile error regression with CONFIG_BNXT_SRIOV not set.
Add a new function bnxt_get_registered_vfs() to handle the work
of getting the number of registered VFs under #ifdef CONFIG_BNXT_SRIOV.
The main code will call this function and will always work correctly
whether CONFIG_BNXT_SRIOV is set or not.

Fixes: 230d1f0de7 ("bnxt_en: Handle firmware reset.")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 17:38:24 -07:00
Masahiro Yamada 909548d6c5 riscv: add arch/riscv/Kbuild
Use the standard obj-y form to specify the sub-directories under
arch/riscv/. No functional change intended.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-08-30 17:34:00 -07:00
Al Viro f2683bd8d5 [PATCH] fix d_absolute_path() interplay with fsmount()
stuff in anon namespace should be treated as unattached.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30 19:31:09 -04:00
Daniel Borkmann bdb15a29cc Merge branch 'bpf-xdp-unaligned-chunk'
Kevin Laatz says:

====================
This patch set adds the ability to use unaligned chunks in the XDP umem.

Currently, all chunk addresses passed to the umem are masked to be chunk
size aligned (max is PAGE_SIZE). This limits where we can place chunks
within the umem as well as limiting the packet sizes that are supported.

The changes in this patch set removes these restrictions, allowing XDP to
be more flexible in where it can place a chunk within a umem. By relaxing
where the chunks can be placed, it allows us to use an arbitrary buffer
size and place that wherever we have a free address in the umem. These
changes add the ability to support arbitrary frame sizes up to 4k
(PAGE_SIZE) and make it easy to integrate with other existing frameworks
that have their own memory management systems, such as DPDK.
In DPDK, for example, there is already support for AF_XDP with zero-copy.
However, with this patch set the integration will be much more seamless.
You can find the DPDK AF_XDP driver at:
https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp

Since we are now dealing with arbitrary frame sizes, we need also need to
update how we pass around addresses. Currently, the addresses can simply be
masked to 2k to get back to the original address. This becomes less trivial
when using frame sizes that are not a 'power of 2' size. This patch set
modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr
field for an offset value, leaving the lower 48-bits for the address (this
leaves us with 256 Terabytes, which should be enough!). We only need to use
the upper 16-bits to store the offset when running in unaligned mode.
Rather than adding the offset (headroom etc) to the address, we will store
it in the upper 16-bits of the address field. This way, we can easily add
the offset to the address where we need it, using some bit manipulation and
addition, and we can also easily get the original address wherever we need
it (for example in i40e_zca_free) by simply masking to get the lower
48-bits of the address field.

The patch set was tested with the following set up:
  - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
  - Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
  - Driver: i40e
  - Application: xdpsock with l2fwd (single interface)
  - Turbo disabled in BIOS

There are no changes to performance before and after these patches for SKB
mode and Copy mode. Zero-copy mode saw a performance degradation of ~1.5%.

This patch set has been applied against
commit 0bb52b0dfc ("tools: bpftool: add 'bpftool map freeze' subcommand")

Structure of the patch set:

Patch 1:
  - Remove unnecessary masking and headroom addition during zero-copy Rx
    buffer recycling in i40e. This change is required in order for the
    buffer recycling to work in the unaligned chunk mode.

Patch 2:
  - Remove unnecessary masking and headroom addition during
    zero-copy Rx buffer recycling in ixgbe. This change is required in
    order for the  buffer recycling to work in the unaligned chunk mode.

Patch 3:
  - Add infrastructure for unaligned chunks. Since we are dealing with
    unaligned chunks that could potentially cross a physical page boundary,
    we add checks to keep track of that information. We can later use this
    information to correctly handle buffers that are placed at an address
    where they cross a page boundary.  This patch also modifies the
    existing Rx and Tx functions to use the new descriptor format. To
    handle addresses correctly, we need to mask appropriately based on
    whether we are in aligned or unaligned mode.

Patch 4:
  - This patch updates the i40e driver to make use of the new descriptor
    format.

Patch 5:
  - This patch updates the ixgbe driver to make use of the new descriptor
    format.

Patch 6:
  - This patch updates the mlx5e driver to make use of the new descriptor
    format. These changes are required to handle the new descriptor format
    and for unaligned chunks support.

Patch 7:
  - This patch allows XSK frames smaller than page size in the mlx5e
    driver. Relax the requirements to the XSK frame size to allow it to be
    smaller than a page and even not a power of two. The current
    implementation can work in this mode, both with Striding RQ and without
    it.

Patch 8:
  - Add flags for umem configuration to libbpf. Since we increase the size
    of the struct by adding flags, we also need to add the ABI versioning
    in this patch.

Patch 9:
  - Modify xdpsock application to add a command line option for
    unaligned chunks

Patch 10:
  - Since we can now run the application in unaligned chunk mode, we need
    to make sure we recycle the buffers appropriately.

Patch 11:
  - Adds hugepage support to the xdpsock application

Patch 12:
  - Documentation update to include the unaligned chunk scenario. We need
    to explicitly state that the incoming addresses are only masked in the
    aligned chunk mode and not the unaligned chunk mode.

v2:
  - fixed checkpatch issues
  - fixed Rx buffer recycling for unaligned chunks in xdpsock
  - removed unused defines
  - fixed how chunk_size is calculated in xsk_diag.c
  - added some performance numbers to cover letter
  - modified descriptor format to make it easier to retrieve original
    address
  - removed patch adding off_t off to the zero copy allocator. This is no
    longer needed with the new descriptor format.

v3:
  - added patch for mlx5 driver changes needed for unaligned chunks
  - moved offset handling to new helper function
  - changed value used for the umem chunk_mask. Now using the new
    descriptor format to save us doing the calculations in a number of
    places meaning more of the code is left unchanged while adding
    unaligned chunk support.

v4:
  - reworked the next_pg_contig field in the xdp_umem_page struct. We now
    use the low 12 bits of the addr for flags rather than adding an extra
    field in the struct.
  - modified unaligned chunks flag define
  - fixed page_start calculation in __xsk_rcv_memcpy().
  - move offset handling to the xdp_umem_get_* functions
  - modified the len field in xdp_umem_reg struct. We now use 16 bits from
    this for the flags field.
  - fixed headroom addition to handle in the mlx5e driver
  - other minor changes based on review comments

v5:
  - Added ABI versioning in the libbpf patch
  - Removed bitfields in the xdp_umem_reg struct. Adding new flags field.
  - Added accessors for getting addr and offset.
  - Added helper function for adding the offset to the addr.
  - Fixed conflicts with 'bpf-af-xdp-wakeup' which was merged recently.
  - Fixed typo in mlx driver patch.
  - Moved libbpf patch to later in the set (7/11, just before the sample
    app changes)

v6:
  - Added support for XSK frames smaller than page in mlx5e driver (Maxim
    Mikityanskiy <maximmi@mellanox.com).
  - Fixed offset handling in xsk_generic_rcv.
  - Added check for base address in xskq_is_valid_addr_unaligned.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:27 +02:00
Kevin Laatz d57f172c99 doc/af_xdp: include unaligned chunk case
The addition of unaligned chunks mode, the documentation needs to be
updated to indicate that the incoming addr to the fill ring will only be
masked if the user application is run in the aligned chunk mode. This patch
also adds a line to explicitly indicate that the incoming addr will not be
masked if running the user application in the unaligned chunk mode.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:27 +02:00
Kevin Laatz 3945b37a97 samples/bpf: use hugepages in xdpsock app
This patch modifies xdpsock to use mmap instead of posix_memalign. With
this change, we can use hugepages when running the application in unaligned
chunks mode. Using hugepages makes it more likely that we have physically
contiguous memory, which supports the unaligned chunk mode better.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:27 +02:00
Kevin Laatz 03895e63ff samples/bpf: add buffer recycling for unaligned chunks to xdpsock
This patch adds buffer recycling support for unaligned buffers. Since we
don't mask the addr to 2k at umem_reg in unaligned mode, we need to make
sure we give back the correct (original) addr to the fill queue. We achieve
this using the new descriptor format and associated masks. The new format
uses the upper 16-bits for the offset and the lower 48-bits for the addr.
Since we have a field for the offset, we no longer need to modify the
actual address. As such, all we have to do to get back the original address
is mask for the lower 48 bits (i.e. strip the offset and we get the address
on it's own).

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz c543f54698 samples/bpf: add unaligned chunks mode support to xdpsock
This patch adds support for the unaligned chunks mode. The addition of the
unaligned chunks option will allow users to run the application with more
relaxed chunk placement in the XDP umem.

Unaligned chunks mode can be used with the '-u' or '--unaligned' command
line options.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz 10d30e3017 libbpf: add flags to umem config
This patch adds a 'flags' field to the umem_config and umem_reg structs.
This will allow for more options to be added for configuring umems.

The first use for the flags field is to add a flag for unaligned chunks
mode. These flags can either be user-provided or filled with a default.

Since we change the size of the xsk_umem_config struct, we need to version
the ABI. This patch includes the ABI versioning for xsk_umem__create. The
Makefile was also updated to handle multiple function versions in
check-abi.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Maxim Mikityanskiy 282c0c798f net/mlx5e: Allow XSK frames smaller than a page
Relax the requirements to the XSK frame size to allow it to be smaller
than a page and even not a power of two. The current implementation can
work in this mode, both with Striding RQ and without it.

The code that checks `mtu + headroom <= XSK frame size` is modified
accordingly. Any frame size between 2048 and PAGE_SIZE is accepted.

Functions that worked with pages only now work with XSK frames, even if
their size is different from PAGE_SIZE.

With XSK queues, regardless of the frame size, Striding RQ uses the
stride size of PAGE_SIZE, and UMR MTTs are posted using starting
addresses of frames, but PAGE_SIZE as page size. MTU guarantees that no
packet data will overlap with other frames. UMR MTT size is made equal
to the stride size of the RQ, because UMEM frames may come in random
order, and we need to handle them one by one. PAGE_SIZE is just a power
of two that is bigger than any allowed XSK frame size, and also it
doesn't require making additional changes to the code.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz beb3e4b295 mlx5e: modify driver for handling offsets
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz d8c3061e5e ixgbe: modify driver for handling offsets
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz 2f86c806a8 i40e: modify driver for handling offsets
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz c05cd36458 xsk: add support to allow unaligned chunk placement
Currently, addresses are chunk size aligned. This means, we are very
restricted in terms of where we can place chunk within the umem. For
example, if we have a chunk size of 2k, then our chunks can only be placed
at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).

This patch introduces the ability to use unaligned chunks. With these
changes, we are no longer bound to having to place chunks at a 2k (or
whatever your chunk size is) interval. Since we are no longer dealing with
aligned chunks, they can now cross page boundaries. Checks for page
contiguity have been added in order to keep track of which pages are
followed by a physically contiguous page.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz b35a2d3e89 ixgbe: simplify Rx buffer recycle
Currently, the dma, addr and handle are modified when we reuse Rx buffers
in zero-copy mode. However, this is not required as the inputs to the
function are copies, not the original values themselves. As we use the
copies within the function, we can use the original 'obi' values
directly without having to mask and add the headroom.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Kevin Laatz 10912fc9fa i40e: simplify Rx buffer recycle
Currently, the dma, addr and handle are modified when we reuse Rx buffers
in zero-copy mode. However, this is not required as the inputs to the
function are copies, not the original values themselves. As we use the
copies within the function, we can use the original 'old_bi' values
directly without having to mask and add the headroom.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Masanari Iida 1c6d6e021c selftests/bpf: Fix a typo in test_offload.py
This patch fix a spelling typo in test_offload.py

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:00:28 +02:00
Petar Penkov 0741be358d bpf: fix error check in bpf_tcp_gen_syncookie
If a SYN cookie is not issued by tcp_v#_gen_syncookie, then the return
value will be exactly 0, rather than <= 0. Let's change the check to
reflect that, especially since mss is an unsigned value and cannot be
negative.

Fixes: 70d6624431 ("bpf: add bpf_tcp_gen_syncookie helper")
Reported-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Petar Penkov <ppenkov@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 00:58:06 +02:00
Daniel Borkmann 736a55309d Merge branch 'bpf-nfp-map-op-cache'
Jakub Kicinski says:

====================
This set adds a small batching and cache mechanism to the driver.
Map dumps require two operations per element - get next, and
lookup. Each of those needs a round trip to the device, and on
a loaded system scheduling out and in of the dumping process.
This set makes the driver request a number of entries at the same
time, and if no operation which would modify the map happens
from the host side those entries are used to serve lookup
requests for up to 250us, at which point they are considered
stale.

This set has been measured to provide almost 4x dumping speed
improvement, Jaco says:

OLD dump times
    500 000 elements: 26.1s
  1 000 000 elements: 54.5s

NEW dump times
    500 000 elements: 7.6s
  1 000 000 elements: 16.5s
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 00:49:05 +02:00
Jakub Kicinski f24e29099f nfp: bpf: add simple map op cache
Each get_next and lookup call requires a round trip to the device.
However, the device is capable of giving us a few entries back,
instead of just one.

In this patch we ask for a small yet reasonable number of entries
(4) on every get_next call, and on subsequent get_next/lookup calls
check this little cache for a hit. The cache is only kept for 250us,
and is invalidated on every operation which may modify the map
(e.g. delete or update call). Note that operations may be performed
simultaneously, so we have to keep track of operations in flight.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 00:49:05 +02:00
Jakub Kicinski bc2796db5a nfp: bpf: rework MTU checking
If control channel MTU is too low to support map operations a warning
will be printed. This is not enough, we want to make sure probe fails
in such scenario, as this would clearly be a faulty configuration.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 00:49:05 +02:00
Daniel Borkmann c5a2c734b4 Merge branch 'bpf-bpftool-build-improvements'
Quentin Monnet says:

====================
This set attempts to make it easier to build bpftool, in particular when
passing a specific output directory. This is a follow-up to the
conversation held last month by Lorenz, Ilya and Jakub [0].

The first patch is a minor fix to bpftool's Makefile, regarding the
retrieval of kernel version (which currently prints a non-relevant make
warning on some invocations).

Second patch improves the Makefile commands to support more "make"
invocations, or to fix building with custom output directory. On Jakub's
suggestion, a script is also added to BPF selftests in order to keep track
of the supported build variants.

Building bpftool with "make tools/bpf" from the top of the repository
generates files in "libbpf/" and "feature/" directories under tools/bpf/
and tools/bpf/bpftool/. The third patch ensures such directories are taken
care of on "make clean", and add them to the relevant .gitignore files.

At last, fourth patch is a sligthly modified version of Ilya's fix
regarding libbpf.a appearing twice on the linking command for bpftool.

[0] https://lore.kernel.org/bpf/CACAyw9-CWRHVH3TJ=Tke2x8YiLsH47sLCijdp=V+5M836R9aAA@mail.gmail.com/

v2:
- Return error from check script if one of the make invocations returns
  non-zero (even if binary is successfully produced).
- Run "make clean" from bpf/ and not only bpf/bpftool/ in that same script,
  when relevant.
- Add a patch to clean up generated "feature/" and "libbpf/" directories.
====================

Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 00:38:17 +02:00