mirror-linux

Commit Graph

Author	SHA1	Message	Date
Dave Chinner	756c6f0f7e	xfs: reverse search directory freespace indexes When a directory is growing rapidly, new blocks tend to get added at the end of the directory. These end up at the end of the freespace index, and when the directory gets large finding these new freespaces gets expensive. The code does a linear search across the frespace index from the first block in the directory to the last, hence meaning the newly added space is the last index searched. Instead, do a reverse order index search, starting from the last block and index in the freespace index. This makes most lookups for free space on rapidly growing directories O(1) instead of O(N), but should not have any impact on random insert workloads because the average search length is the same regardless of which end of the array we start at. The result is a major improvement in large directory grow rates: create time(sec) / rate (files/s) File count vanilla Prev commit Patched 10k 0.41 / 24.3k 0.42 / 23.8k 0.41 / 24.3k 20k 0.74 / 27.0k 0.76 / 26.3k 0.75 / 26.7k 100k 3.81 / 26.4k 3.47 / 28.8k 3.27 / 30.6k 200k 8.58 / 23.3k 7.19 / 27.8k 6.71 / 29.8k 1M 85.69 / 11.7k 48.53 / 20.6k 37.67 / 26.5k 2M 280.31 / 7.1k 130.14 / 15.3k 79.55 / 25.2k 10M 3913.26 / 2.5k 552.89 / 18.1k Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-08-30 22:43:57 -07:00
Dave Chinner	610125ab1e	xfs: speed up directory bestfree block scanning When running a "create millions inodes in a directory" test recently, I noticed we were spending a huge amount of time converting freespace block headers from disk format to in-memory format: 31.47% [kernel] [k] xfs_dir2_node_addname 17.86% [kernel] [k] xfs_dir3_free_hdr_from_disk 3.55% [kernel] [k] xfs_dir3_free_bests_p We shouldn't be hitting the best free block scanning code so hard when doing sequential directory creates, and it turns out there's a highly suboptimal loop searching the the best free array in the freespace block - it decodes the block header before checking each entry inside a loop, instead of decoding the header once before running the entry search loop. This makes a massive difference to create rates. Profile now looks like this: 13.15% [kernel] [k] xfs_dir2_node_addname 3.52% [kernel] [k] xfs_dir3_leaf_check_int 3.11% [kernel] [k] xfs_log_commit_cil And the wall time/average file create rate differences are just as stark: create time(sec) / rate (files/s) File count vanilla patched 10k 0.41 / 24.3k 0.42 / 23.8k 20k 0.74 / 27.0k 0.76 / 26.3k 100k 3.81 / 26.4k 3.47 / 28.8k 200k 8.58 / 23.3k 7.19 / 27.8k 1M 85.69 / 11.7k 48.53 / 20.6k 2M 280.31 / 7.1k 130.14 / 15.3k The larger the directory, the bigger the performance improvement. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-08-30 22:43:57 -07:00
Dave Chinner	0e822255f9	xfs: factor free block index lookup from xfs_dir2_node_addname_int() Simplify the logic in xfs_dir2_node_addname_int() by factoring out the free block index lookup code that finds a block with enough free space for the entry to be added. The code that is moved gets a major cleanup at the same time, but there is no algorithm change here. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-08-30 22:43:57 -07:00
Dave Chinner	a07258a695	xfs: factor data block addition from xfs_dir2_node_addname_int() Factor out the code that adds a data block to a directory from xfs_dir2_node_addname_int(). This makes the code flow cleaner and more obvious and provides clear isolation of upcoming optimsations. Signed-off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-08-30 22:43:57 -07:00
Dave Chinner	aee7754bbe	xfs: move xfs_dir2_addname() This gets rid of the need for a forward declaration of the static function xfs_dir2_addname_int() and readies the code for factoring of xfs_dir2_addname_int(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-08-30 22:43:56 -07:00
Darrick J. Wong	39ee2239a5	xfs: remove all *_ITER_CONTINUE values Iterator functions already use 0 to signal "continue iterating", so get rid of the #defines and just do it directly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2019-08-30 22:43:56 -07:00
Christoph Hellwig	f3b17320db	openrisc: map as uncached in ioremap Openrisc is the only architecture not mapping ioremap as uncached, which has been the default since the Linux 2.6.x days. Switch it over to implement uncached semantics by default. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stafford Horne <shorne@gmail.com>	2019-08-31 11:57:53 +09:00
Stafford Horne	eabe7e9a21	or1k: dts: Add ethoc device to SMP devicetree This patch adds the ethoc device configuration to the OpenRISC basic SMP device tree config. This was tested with qemu. Signed-off-by: Stafford Horne <shorne@gmail.com>	2019-08-31 11:56:19 +09:00
Dan Williams	c312ef1763	libata/ahci: Drop PCS quirk for Denverton and beyond The Linux ahci driver has historically implemented a configuration fixup for platforms / platform-firmware that fails to enable the ports prior to OS hand-off at boot. The fixup was originally implemented way back before ahci moved from drivers/scsi/ to drivers/ata/, and was updated in 2007 via commit `49f2909039` "ahci: update PCS programming". The quirk sets a port-enable bitmap in the PCS register at offset 0x92. This quirk could be applied generically up until the arrival of the Denverton (DNV) platform. The DNV AHCI controller architecture supports more than 6 ports and along with that the PCS register location and format were updated to allow for more possible ports in the bitmap. DNV AHCI expands the register to 32-bits and moves it to offset 0x94. As it stands there are no known problem reports with existing Linux trying to set bits at offset 0x92 which indicates that the quirk is not applicable. Likely it is not applicable on a wider range of platforms, but it is difficult to discern which platforms if any still depend on the quirk. Rather than try to fix the PCS quirk to consider the DNV register layout instead require explicit opt-in. The assumption is that the OS driver need not touch this register, and platforms can be added with a new boad_ahci_pcs7 board-id when / if problematic platforms are found in the future. The logic in ahci_intel_pcs_quirk() looks for all Intel AHCI instances with "legacy" board-ids and otherwise skips the quirk if the board was matched by class-code. Reported-by: Stephen Douthit <stephend@silicom-usa.com> Cc: Christoph Hellwig <hch@infradead.org> Reviewed-by: Stephen Douthit <stephend@silicom-usa.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-30 20:53:42 -06:00
Denis Efremov	974ceb21fc	udp: Remove unlikely() from IS_ERR*() condition "unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses unlikely() internally. Signed-off-by: Denis Efremov <efremov@linux.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-30 19:49:37 -07:00
Denis Efremov	7cf92ccb85	net/mlx5e: Remove unlikely() from WARN*() condition "unlikely(WARN_ON_ONCE(x))" is excessive. WARN_ON_ONCE() already uses unlikely() internally. Signed-off-by: Denis Efremov <efremov@linux.com> Cc: Boris Pismenny <borisp@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netdev@vger.kernel.org Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-30 19:49:03 -07:00
Stafford Horne	ae2930583c	or1k: dts: Fix ethoc network configuration in or1ksim devicetree This fixes several issues with the ethoc network device config. Fisrt off, the compatible property used an obsolete compatibility string; this caused the initialization to be skipped. Next, the register map was not given enough space to allocate ring descriptors, this caused module initialization to abort. Finally, we need to mark this device as big-endian as needed by openrisc. This was tested by me in qemu, the setup is documented on the qemu wiki: https://wiki.qemu.org/Documentation/Platforms/OpenRISC Signed-off-by: Stafford Horne <shorne@gmail.com>	2019-08-31 11:46:59 +09:00
Linus Torvalds	e0f14b8ca3	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix a potential crash in the ccp driver" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: ccp - Ignore unconfigured CCP device on suspend/resume	2019-08-30 18:56:08 -07:00
Tejun Heo	0feacaa216	writeback: don't access page->mapping directly in track_foreign_dirty TP page->mapping may encode different values in it and page_mapping() should always be used to access the mapping pointer. track_foreign_dirty tracepoint was incorrectly accessing page->mapping directly. Use page_mapping() instead. Also, add NULL checks while at it. Fixes: `3a8e9ac89e` ("writeback: add tracepoints for cgroup foreign writebacks") Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-30 19:54:28 -06:00
Linus Torvalds	ab9bb6318b	Partially revert "kfifo: fix kfifo_alloc() and kfifo_init()" Commit `dfe2a77fd2` ("kfifo: fix kfifo_alloc() and kfifo_init()") made the kfifo code round the number of elements up. That was good for __kfifo_alloc(), but it's actually wrong for __kfifo_init(). The difference? __kfifo_alloc() will allocate the rounded-up number of elements, but __kfifo_init() uses an allocation done by the caller. We can't just say "use more elements than the caller allocated", and have to round down. The good news? All the normal cases will be using power-of-two arrays anyway, and most users of kfifo's don't use kfifo_init() at all, but one of the helper macros to declare a KFIFO that enforce the proper power-of-two behavior. But it looks like at least ibmvscsis might be affected. The bad news? Will Deacon refers to an old thread and points points out that the memory ordering in kfifo's is questionable. See https://lore.kernel.org/lkml/20181211034032.32338-1-yuleixzhang@tencent.com/ for more. Fixes: `dfe2a77fd2` ("kfifo: fix kfifo_alloc() and kfifo_init()") Reported-by: laokz <laokz@foxmail.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Greg KH <greg@kroah.com> Cc: Kees Cook <keescook@chromium.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:47:15 -07:00
Al Viro	ee594bfff3	fs/namei.c: new helper - legitimize_root() identical logics in unlazy_walk() and unlazy_child() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-08-30 21:30:13 -04:00
Al Viro	ce6595a28a	kill the last users of user_{path,lpath,path_dir}() old wrappers with few callers remaining; put them out of their misery... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-08-30 21:30:13 -04:00
Al Viro	6b61aed06a	namei.h: get the comments on LOOKUP_... in sync with reality Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-08-30 21:30:12 -04:00
Al Viro	fbb7d9d56d	kill LOOKUP_NO_EVAL, don't bother including namei.h from audit.h The former has no users left; the latter was only to get LOOKUP_... values to remapper in audit_inode() and that's an ex-parrot now. All places that use symbols from namei.h include it either directly or (in a few cases) via a local header, like fs/autofs/autofs_i.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-08-30 21:29:32 -04:00
Shakeel Butt	6c1c280805	mm: memcontrol: fix percpu vmstats and vmevents flush Instead of using raw_cpu_read() use per_cpu() to read the actual data of the corresponding cpu otherwise we will be reading the data of the current cpu for the number of online CPUs. Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com Fixes: `bb65f89b7d` ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: `c350a99ea2` ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:00:50 -07:00
Michal Hocko	d2e5fb927e	mm, memcg: do not set reclaim_state on soft limit reclaim Adric Blake has noticed[1] the following warning: WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40 [...] Call Trace: mem_cgroup_shrink_node+0x9b/0x1d0 mem_cgroup_soft_limit_reclaim+0x10c/0x3a0 balance_pgdat+0x276/0x540 kswapd+0x200/0x3f0 ? wait_woken+0x80/0x80 kthread+0xfd/0x130 ? balance_pgdat+0x540/0x540 ? kthread_park+0x80/0x80 ret_from_fork+0x35/0x40 ---[ end trace 727343df67b2398a ]--- which tells us that soft limit reclaim is about to overwrite the reclaim_state configured up in the call chain (kswapd in this case but the direct reclaim is equally possible). This means that reclaim stats would get misleading once the soft reclaim returns and another reclaim is done. Fix the warning by dropping set_task_reclaim_state from the soft reclaim which is always called with reclaim_state set up. [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Adric Blake <promarbler14@gmail.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:00:50 -07:00
Dmitry Safonov	a6c135bb1a	mailmap: add aliases for Dmitry Safonov I don't work for Virtuozzo or Samsung anymore and I've noticed that they have started sending annoying html email-replies. And I prioritize my personal emails over work email box, so while at it add an entry for Arista too - so I can reply faster when needed. Link: http://lkml.kernel.org/r/20190827220346.11123-1-dima@arista.com Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:00:50 -07:00
Gustavo A. R. Silva	14108b9131	mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolate Fix lock/unlock imbalance by unlocking zhdr before return. Addresses Coverity ID 1452811 ("Missing unlock") Link: http://lkml.kernel.org/r/20190826030634.GA4379@embeddedor Fixes: `d776aaa989` ("mm/z3fold.c: fix race between migration and destruction") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:00:50 -07:00
Roman Gushchin	b4c46484dc	mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones" Commit `766a4c19d8` ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") effectively decreased the precision of per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters. That's good for displaying in memory.stat, but brings a serious regression into the reclaim process. One issue I've discovered and debugged is the following: lruvec_lru_size() can return 0 instead of the actual number of pages in the lru list, preventing the kernel to reclaim last remaining pages. Result is yet another dying memory cgroups flooding. The opposite is also happening: scanning an empty lru list is the waste of cpu time. Also, inactive_list_is_low() can return incorrect values, preventing the active lru from being scanned and freed. It can fail both because the size of active and inactive lists are inaccurate, and because the number of workingset refaults isn't precise. In other words, the result is pretty random. I'm not sure, if using the approximate number of slab pages in count_shadow_number() is acceptable, but issues described above are enough to partially revert the patch. Let's keep per-memcg vmstat_local batched (they are only used for displaying stats to the userspace), but keep lruvec stats precise. This change fixes the dead memcg flooding on my setup. Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com Fixes: `766a4c19d8` ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:00:50 -07:00
Andrew Morton	441e254cd4	mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n Fixes: `701d678599` ("mm/zsmalloc.c: fix race condition in zs_destroy_pool") Link: http://lkml.kernel.org/r/201908251039.5oSbEEUT%25lkp@intel.com Reported-by: kbuild test robot <lkp@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:00:50 -07:00
Roman Gushchin	bee07b33db	mm: memcontrol: flush percpu slab vmstats on kmem offlining I've noticed that the "slab" value in memory.stat is sometimes 0, even if some children memory cgroups have a non-zero "slab" value. The following investigation showed that this is the result of the kmem_cache reparenting in combination with the per-cpu batching of slab vmstats. At the offlining some vmstat value may leave in the percpu cache, not being propagated upwards by the cgroup hierarchy. It means that stats on ancestor levels are lower than actual. Later when slab pages are released, the precise number of pages is substracted on the parent level, making the value negative. We don't show negative values, 0 is printed instead. To fix this issue, let's flush percpu slab memcg and lruvec stats on memcg offlining. This guarantees that numbers on all ancestor levels are accurate and match the actual number of outstanding slab pages. Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com Fixes: `fb2f2b0adb` ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-08-30 18:00:50 -07:00
David S. Miller	c3d7a089f9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Spurious warning when loading rules using the physdev match, from Todd Seidelmann. 2) Fix FTP conntrack helper debugging output, from Thomas Jarosch. 3) Restore per-netns nf_conntrack_{acct,helper,timeout} sysctl knobs, from Florian Westphal. 4) Clear skbuff timestamp from the flowtable datapath, also from Florian. 5) Fix incorrect byteorder of NFT_META_BRI_IIFVPROTO, from wenxu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-30 17:50:10 -07:00
David S. Miller	94880a5b2e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2019-08-31 The following pull-request contains BPF updates for your net tree. The main changes are: 1) Fix 32-bit zero-extension during constant blinding which has been causing a regression on ppc64, from Naveen. 2) Fix a latency bug in nfp driver when updating stack index register, from Jiong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-30 17:39:37 -07:00
Michael Chan	e72cb7d624	bnxt_en: Fix compile error regression with CONFIG_BNXT_SRIOV not set. Add a new function bnxt_get_registered_vfs() to handle the work of getting the number of registered VFs under #ifdef CONFIG_BNXT_SRIOV. The main code will call this function and will always work correctly whether CONFIG_BNXT_SRIOV is set or not. Fixes: `230d1f0de7` ("bnxt_en: Handle firmware reset.") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-30 17:38:24 -07:00
Masahiro Yamada	909548d6c5	riscv: add arch/riscv/Kbuild Use the standard obj-y form to specify the sub-directories under arch/riscv/. No functional change intended. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>	2019-08-30 17:34:00 -07:00
Al Viro	f2683bd8d5	[PATCH] fix d_absolute_path() interplay with fsmount() stuff in anon namespace should be treated as unattached. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-08-30 19:31:09 -04:00
Daniel Borkmann	bdb15a29cc	Merge branch 'bpf-xdp-unaligned-chunk' Kevin Laatz says: ==================== This patch set adds the ability to use unaligned chunks in the XDP umem. Currently, all chunk addresses passed to the umem are masked to be chunk size aligned (max is PAGE_SIZE). This limits where we can place chunks within the umem as well as limiting the packet sizes that are supported. The changes in this patch set removes these restrictions, allowing XDP to be more flexible in where it can place a chunk within a umem. By relaxing where the chunks can be placed, it allows us to use an arbitrary buffer size and place that wherever we have a free address in the umem. These changes add the ability to support arbitrary frame sizes up to 4k (PAGE_SIZE) and make it easy to integrate with other existing frameworks that have their own memory management systems, such as DPDK. In DPDK, for example, there is already support for AF_XDP with zero-copy. However, with this patch set the integration will be much more seamless. You can find the DPDK AF_XDP driver at: https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp Since we are now dealing with arbitrary frame sizes, we need also need to update how we pass around addresses. Currently, the addresses can simply be masked to 2k to get back to the original address. This becomes less trivial when using frame sizes that are not a 'power of 2' size. This patch set modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr field for an offset value, leaving the lower 48-bits for the address (this leaves us with 256 Terabytes, which should be enough!). We only need to use the upper 16-bits to store the offset when running in unaligned mode. Rather than adding the offset (headroom etc) to the address, we will store it in the upper 16-bits of the address field. This way, we can easily add the offset to the address where we need it, using some bit manipulation and addition, and we can also easily get the original address wherever we need it (for example in i40e_zca_free) by simply masking to get the lower 48-bits of the address field. The patch set was tested with the following set up: - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz - Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) - Driver: i40e - Application: xdpsock with l2fwd (single interface) - Turbo disabled in BIOS There are no changes to performance before and after these patches for SKB mode and Copy mode. Zero-copy mode saw a performance degradation of ~1.5%. This patch set has been applied against commit `0bb52b0dfc` ("tools: bpftool: add 'bpftool map freeze' subcommand") Structure of the patch set: Patch 1: - Remove unnecessary masking and headroom addition during zero-copy Rx buffer recycling in i40e. This change is required in order for the buffer recycling to work in the unaligned chunk mode. Patch 2: - Remove unnecessary masking and headroom addition during zero-copy Rx buffer recycling in ixgbe. This change is required in order for the buffer recycling to work in the unaligned chunk mode. Patch 3: - Add infrastructure for unaligned chunks. Since we are dealing with unaligned chunks that could potentially cross a physical page boundary, we add checks to keep track of that information. We can later use this information to correctly handle buffers that are placed at an address where they cross a page boundary. This patch also modifies the existing Rx and Tx functions to use the new descriptor format. To handle addresses correctly, we need to mask appropriately based on whether we are in aligned or unaligned mode. Patch 4: - This patch updates the i40e driver to make use of the new descriptor format. Patch 5: - This patch updates the ixgbe driver to make use of the new descriptor format. Patch 6: - This patch updates the mlx5e driver to make use of the new descriptor format. These changes are required to handle the new descriptor format and for unaligned chunks support. Patch 7: - This patch allows XSK frames smaller than page size in the mlx5e driver. Relax the requirements to the XSK frame size to allow it to be smaller than a page and even not a power of two. The current implementation can work in this mode, both with Striding RQ and without it. Patch 8: - Add flags for umem configuration to libbpf. Since we increase the size of the struct by adding flags, we also need to add the ABI versioning in this patch. Patch 9: - Modify xdpsock application to add a command line option for unaligned chunks Patch 10: - Since we can now run the application in unaligned chunk mode, we need to make sure we recycle the buffers appropriately. Patch 11: - Adds hugepage support to the xdpsock application Patch 12: - Documentation update to include the unaligned chunk scenario. We need to explicitly state that the incoming addresses are only masked in the aligned chunk mode and not the unaligned chunk mode. v2: - fixed checkpatch issues - fixed Rx buffer recycling for unaligned chunks in xdpsock - removed unused defines - fixed how chunk_size is calculated in xsk_diag.c - added some performance numbers to cover letter - modified descriptor format to make it easier to retrieve original address - removed patch adding off_t off to the zero copy allocator. This is no longer needed with the new descriptor format. v3: - added patch for mlx5 driver changes needed for unaligned chunks - moved offset handling to new helper function - changed value used for the umem chunk_mask. Now using the new descriptor format to save us doing the calculations in a number of places meaning more of the code is left unchanged while adding unaligned chunk support. v4: - reworked the next_pg_contig field in the xdp_umem_page struct. We now use the low 12 bits of the addr for flags rather than adding an extra field in the struct. - modified unaligned chunks flag define - fixed page_start calculation in __xsk_rcv_memcpy(). - move offset handling to the xdp_umem_get_* functions - modified the len field in xdp_umem_reg struct. We now use 16 bits from this for the flags field. - fixed headroom addition to handle in the mlx5e driver - other minor changes based on review comments v5: - Added ABI versioning in the libbpf patch - Removed bitfields in the xdp_umem_reg struct. Adding new flags field. - Added accessors for getting addr and offset. - Added helper function for adding the offset to the addr. - Fixed conflicts with 'bpf-af-xdp-wakeup' which was merged recently. - Fixed typo in mlx driver patch. - Moved libbpf patch to later in the set (7/11, just before the sample app changes) v6: - Added support for XSK frames smaller than page in mlx5e driver (Maxim Mikityanskiy <maximmi@mellanox.com). - Fixed offset handling in xsk_generic_rcv. - Added check for base address in xskq_is_valid_addr_unaligned. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:27 +02:00
Kevin Laatz	d57f172c99	doc/af_xdp: include unaligned chunk case The addition of unaligned chunks mode, the documentation needs to be updated to indicate that the incoming addr to the fill ring will only be masked if the user application is run in the aligned chunk mode. This patch also adds a line to explicitly indicate that the incoming addr will not be masked if running the user application in the unaligned chunk mode. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:27 +02:00
Kevin Laatz	3945b37a97	samples/bpf: use hugepages in xdpsock app This patch modifies xdpsock to use mmap instead of posix_memalign. With this change, we can use hugepages when running the application in unaligned chunks mode. Using hugepages makes it more likely that we have physically contiguous memory, which supports the unaligned chunk mode better. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:27 +02:00
Kevin Laatz	03895e63ff	samples/bpf: add buffer recycling for unaligned chunks to xdpsock This patch adds buffer recycling support for unaligned buffers. Since we don't mask the addr to 2k at umem_reg in unaligned mode, we need to make sure we give back the correct (original) addr to the fill queue. We achieve this using the new descriptor format and associated masks. The new format uses the upper 16-bits for the offset and the lower 48-bits for the addr. Since we have a field for the offset, we no longer need to modify the actual address. As such, all we have to do to get back the original address is mask for the lower 48 bits (i.e. strip the offset and we get the address on it's own). Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	c543f54698	samples/bpf: add unaligned chunks mode support to xdpsock This patch adds support for the unaligned chunks mode. The addition of the unaligned chunks option will allow users to run the application with more relaxed chunk placement in the XDP umem. Unaligned chunks mode can be used with the '-u' or '--unaligned' command line options. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	10d30e3017	libbpf: add flags to umem config This patch adds a 'flags' field to the umem_config and umem_reg structs. This will allow for more options to be added for configuring umems. The first use for the flags field is to add a flag for unaligned chunks mode. These flags can either be user-provided or filled with a default. Since we change the size of the xsk_umem_config struct, we need to version the ABI. This patch includes the ABI versioning for xsk_umem__create. The Makefile was also updated to handle multiple function versions in check-abi. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Maxim Mikityanskiy	282c0c798f	net/mlx5e: Allow XSK frames smaller than a page Relax the requirements to the XSK frame size to allow it to be smaller than a page and even not a power of two. The current implementation can work in this mode, both with Striding RQ and without it. The code that checks `mtu + headroom <= XSK frame size` is modified accordingly. Any frame size between 2048 and PAGE_SIZE is accepted. Functions that worked with pages only now work with XSK frames, even if their size is different from PAGE_SIZE. With XSK queues, regardless of the frame size, Striding RQ uses the stride size of PAGE_SIZE, and UMR MTTs are posted using starting addresses of frames, but PAGE_SIZE as page size. MTU guarantees that no packet data will overlap with other frames. UMR MTT size is made equal to the stride size of the RQ, because UMEM frames may come in random order, and we need to handle them one by one. PAGE_SIZE is just a power of two that is bigger than any allowed XSK frame size, and also it doesn't require making additional changes to the code. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	beb3e4b295	mlx5e: modify driver for handling offsets With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	d8c3061e5e	ixgbe: modify driver for handling offsets With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	2f86c806a8	i40e: modify driver for handling offsets With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	c05cd36458	xsk: add support to allow unaligned chunk placement Currently, addresses are chunk size aligned. This means, we are very restricted in terms of where we can place chunk within the umem. For example, if we have a chunk size of 2k, then our chunks can only be placed at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0). This patch introduces the ability to use unaligned chunks. With these changes, we are no longer bound to having to place chunks at a 2k (or whatever your chunk size is) interval. Since we are no longer dealing with aligned chunks, they can now cross page boundaries. Checks for page contiguity have been added in order to keep track of which pages are followed by a physically contiguous page. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	b35a2d3e89	ixgbe: simplify Rx buffer recycle Currently, the dma, addr and handle are modified when we reuse Rx buffers in zero-copy mode. However, this is not required as the inputs to the function are copies, not the original values themselves. As we use the copies within the function, we can use the original 'obi' values directly without having to mask and add the headroom. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Kevin Laatz	10912fc9fa	i40e: simplify Rx buffer recycle Currently, the dma, addr and handle are modified when we reuse Rx buffers in zero-copy mode. However, this is not required as the inputs to the function are copies, not the original values themselves. As we use the copies within the function, we can use the original 'old_bi' values directly without having to mask and add the headroom. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:08:26 +02:00
Masanari Iida	1c6d6e021c	selftests/bpf: Fix a typo in test_offload.py This patch fix a spelling typo in test_offload.py Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 01:00:28 +02:00
Petar Penkov	0741be358d	bpf: fix error check in bpf_tcp_gen_syncookie If a SYN cookie is not issued by tcp_v#_gen_syncookie, then the return value will be exactly 0, rather than <= 0. Let's change the check to reflect that, especially since mss is an unsigned value and cannot be negative. Fixes: `70d6624431` ("bpf: add bpf_tcp_gen_syncookie helper") Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Petar Penkov <ppenkov@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 00:58:06 +02:00
Daniel Borkmann	736a55309d	Merge branch 'bpf-nfp-map-op-cache' Jakub Kicinski says: ==================== This set adds a small batching and cache mechanism to the driver. Map dumps require two operations per element - get next, and lookup. Each of those needs a round trip to the device, and on a loaded system scheduling out and in of the dumping process. This set makes the driver request a number of entries at the same time, and if no operation which would modify the map happens from the host side those entries are used to serve lookup requests for up to 250us, at which point they are considered stale. This set has been measured to provide almost 4x dumping speed improvement, Jaco says: OLD dump times 500 000 elements: 26.1s 1 000 000 elements: 54.5s NEW dump times 500 000 elements: 7.6s 1 000 000 elements: 16.5s ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 00:49:05 +02:00
Jakub Kicinski	f24e29099f	nfp: bpf: add simple map op cache Each get_next and lookup call requires a round trip to the device. However, the device is capable of giving us a few entries back, instead of just one. In this patch we ask for a small yet reasonable number of entries (4) on every get_next call, and on subsequent get_next/lookup calls check this little cache for a hit. The cache is only kept for 250us, and is invalidated on every operation which may modify the map (e.g. delete or update call). Note that operations may be performed simultaneously, so we have to keep track of operations in flight. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 00:49:05 +02:00
Jakub Kicinski	bc2796db5a	nfp: bpf: rework MTU checking If control channel MTU is too low to support map operations a warning will be printed. This is not enough, we want to make sure probe fails in such scenario, as this would clearly be a faulty configuration. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 00:49:05 +02:00
Daniel Borkmann	c5a2c734b4	Merge branch 'bpf-bpftool-build-improvements' Quentin Monnet says: ==================== This set attempts to make it easier to build bpftool, in particular when passing a specific output directory. This is a follow-up to the conversation held last month by Lorenz, Ilya and Jakub [0]. The first patch is a minor fix to bpftool's Makefile, regarding the retrieval of kernel version (which currently prints a non-relevant make warning on some invocations). Second patch improves the Makefile commands to support more "make" invocations, or to fix building with custom output directory. On Jakub's suggestion, a script is also added to BPF selftests in order to keep track of the supported build variants. Building bpftool with "make tools/bpf" from the top of the repository generates files in "libbpf/" and "feature/" directories under tools/bpf/ and tools/bpf/bpftool/. The third patch ensures such directories are taken care of on "make clean", and add them to the relevant .gitignore files. At last, fourth patch is a sligthly modified version of Ilya's fix regarding libbpf.a appearing twice on the linking command for bpftool. [0] https://lore.kernel.org/bpf/CACAyw9-CWRHVH3TJ=Tke2x8YiLsH47sLCijdp=V+5M836R9aAA@mail.gmail.com/ v2: - Return error from check script if one of the make invocations returns non-zero (even if binary is successfully produced). - Run "make clean" from bpf/ and not only bpf/bpftool/ in that same script, when relevant. - Add a patch to clean up generated "feature/" and "libbpf/" directories. ==================== Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-08-31 00:38:17 +02:00

... 144 145 146 147 148 ...

873808 Commits (d2cd795c4ece1a24fda170c35eeb4f17d9826cbb) All Branches Search

873808 Commits (d2cd795c4ece1a24fda170c35eeb4f17d9826cbb)

All Branches