Commit Graph

47115 Commits (4cfad559da170508f96879bc12d039765befcf3e)

Author SHA1 Message Date
Daniel Borkmann d67b9cd28c xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.

The intended model for generic XDP from b5cdae3291 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.

However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.

Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.

For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.

Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 21:30:57 -04:00
Daniel Borkmann 0489df9a43 xdp: add flag to enforce driver mode
After commit b5cdae3291 ("net: Generic XDP") we automatically fall
back to a generic XDP variant if the driver does not support native
XDP. Allow for an option where the user can specify that always the
native XDP variant should be selected and in case it's not supported
by a driver, just bail out.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 21:30:57 -04:00
WANG Cong 83eaddab43 ipv6/dccp: do not inherit ipv6_mc_list from parent
Like commit 657831ffc3 ("dccp/tcp: do not inherit mc_list from parent")
we should clear ipv6_mc_list etc. for IPv6 sockets too.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 12:17:02 -04:00
Linus Torvalds c70422f760 Another RDMA update from Chuck Lever, and a bunch of miscellaneous
bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZE2UeAAoJECebzXlCjuG+St8P/0vG+ps9sY012E6Wh9gy4Ev4
 BtxG/c3CtcxrbNzW+cFhdEloBGtC0VvcrKNCozJTK4LdaPYErkyRBpjgXvIggT9I
 GWY4ftpH3eJ6uByN9Okgc3/1la2poDflJO/nYhdRed3YHOnXTtx/746tu1xAnVCV
 tFtDGrbJZTprt5c3zETtdtquCUSy2aMT5ZbrdU3yWBCwQMNSIufN3an8epfB++xx
 Ct+G0HTRffcWAdYuLT0N1HKqm8pkncdNMFpm7mVw0hMCRy552G3fuj8LtkhVTvKE
 1KN3zXY4jhaUYWD5Yt6AJcpLEro65b8swYk4e9FP2TNUpCmuRdXT9cb9vE8YztxC
 8s4N23RHaEx9I6pC3OU64a2HfhiQM/oOIvjlhTBsjojXsQcqZFD1vsoSYA8Byl0w
 m9EQWqPqge4m6yEYl7uAyL6xSthbrhcU1Ks5jvNXGcWzEQj7BATnynJANsfZ+y6r
 ZoVcsRNX49m1BG+p9br+9DFffPiNFUMqxbfr73L9HRep3OsPeFKazFG0bKd3hOqA
 E6L/AnBd9soSqTuTvbisWrGWbomhtd5G/fAa1uHrWTPHMXUWCmkguiau51FNfcHu
 xcJlBBVCvUmmd5u3wF6QeiyjPs4KEBzQzsOUsWKHRxDBp6s+5PX/lHuXRBlDP+fN
 TQq0KbvBtea1OyMaRtoV
 =Rtl/
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Another RDMA update from Chuck Lever, and a bunch of miscellaneous
  bugfixes"

* tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd: Fix up the "supattr_exclcreat" attributes
  nfsd: encoders mustn't use unitialized values in error cases
  nfsd: fix undefined behavior in nfsd4_layout_verify
  lockd: fix lockd shutdown race
  NFSv4: Fix callback server shutdown
  SUNRPC: Refactor svc_set_num_threads()
  NFSv4.x/callback: Create the callback service through svc_create_pooled
  lockd: remove redundant check on block
  svcrdma: Clean out old XDR encoders
  svcrdma: Remove the req_map cache
  svcrdma: Remove unused RDMA Write completion handler
  svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
  svcrdma: Clean up RPC-over-RDMA backchannel reply processing
  svcrdma: Report Write/Reply chunk overruns
  svcrdma: Clean up RDMA_ERROR path
  svcrdma: Use rdma_rw API in RPC reply path
  svcrdma: Introduce local rdma_rw API helpers
  svcrdma: Clean up svc_rdma_get_inv_rkey()
  svcrdma: Add helper to save pages under I/O
  svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  ...
2017-05-10 13:29:23 -07:00
Linus Torvalds 73ccb023a2 NFS client updates for Linux 4.12
Highlights include:
 
 Stable bugfixes:
 - Fix use after free in write error path
 - Use GFP_NOIO for two allocations in writeback
 - Fix a hang in OPEN related to server reboot
 - Check the result of nfs4_pnfs_ds_connect
 - Fix an rcu lock leak
 
 Features:
 - Removal of the unmaintained and unused OSD pNFS layout
 - Cleanup and removal of lots of unnecessary dprintk()s
 - Cleanup and removal of some memory failure paths now that
   GFP_NOFS is guaranteed to never fail.
 - Remove the v3-only data server limitation on pNFS/flexfiles
 
 Bugfixes:
 - RPC/RDMA connection handling bugfixes
 - Copy offload: fixes to ensure the copied data is COMMITed to disk.
 - Readdir: switch back to using the ->iterate VFS interface
 - File locking fixes from Ben Coddington
 - Various use-after-free and deadlock issues in pNFS
 - Write path bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIbBAABAgAGBQJZE0KiAAoJEGcL54qWCgDy/moP93wZ+cGnN5sC+nsirqj4eUiM
 BQKKonweNQIoYRwp5B9jLTxsUMIxRasV5W3BbqEm4PUtBYXfqQ7SfLv7RboKbd4M
 RJB9PS+sjx3Fxf65mhveKziwUFLvQCQ3+we0TpUga6+7SBiGlgPKBfisk7frC0nt
 BbYBuGaWXMPxO0BnR8adNwqiGINPDSzB+8sgjiT8zkZLm4lrew2eV7TDvwVOguD+
 S2vLPGhg1F9wu8aG731MgiSNaeCgsBP6I5D29fTTD7z1DCNMQXOoHcX8k4KwwIDB
 sHRR0tVBsg+1B7WdH4y41GQ03rn3o2DHeJB5cdYGaEu4lx7CecCzt0o0dfAkNizT
 5LxbQxIHPNYMeZmP2T0oD41zQyfjKqrdRSPnXi3dPD98NwaM1Lqv+Kzb/eXzupXp
 vJ7859PQCa3KjQ1IFhwdXTmh53J1c8SzEDpzz7WX0R0saRyxeIJsm30MmdPqKu7Z
 notjsXxrTmjIhC+0vFLey1kejFDh+b0gT6UIwoMdx39VL9AM6DVL7HsrU1kEwCdf
 f8otaLcm0WoUaseF+cMtfRNGEqCMxPywwz7mEKlGiVZgyAM8VfzH+s5j6/u6ncwS
 ASwRclwwPAZN97rzl0exZxuaRwFZd7oFT1zrviPWvv+0SUPuy258J6QpolUSavgi
 Qh7f3QR65K+QX9QbO1g=
 =7Nm2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:
   - Fix use after free in write error path
   - Use GFP_NOIO for two allocations in writeback
   - Fix a hang in OPEN related to server reboot
   - Check the result of nfs4_pnfs_ds_connect
   - Fix an rcu lock leak

  Features:
   - Removal of the unmaintained and unused OSD pNFS layout
   - Cleanup and removal of lots of unnecessary dprintk()s
   - Cleanup and removal of some memory failure paths now that GFP_NOFS
     is guaranteed to never fail.
   - Remove the v3-only data server limitation on pNFS/flexfiles

  Bugfixes:
   - RPC/RDMA connection handling bugfixes
   - Copy offload: fixes to ensure the copied data is COMMITed to disk.
   - Readdir: switch back to using the ->iterate VFS interface
   - File locking fixes from Ben Coddington
   - Various use-after-free and deadlock issues in pNFS
   - Write path bugfixes"

* tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits)
  pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
  NFSv4.1: Work around a Linux server bug...
  NFS append COMMIT after synchronous COPY
  NFSv4: Fix exclusive create attributes encoding
  NFSv4: Fix an rcu lock leak
  nfs: use kmap/kunmap directly
  NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
  Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
  NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
  pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
  pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
  pNFS: Fix a deadlock when coalescing writes and returning the layout
  pNFS: Don't clear the layout return info if there are segments to return
  pNFS: Ensure we commit the layout if it has been invalidated
  pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
  pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
  pNFS: Ensure we check layout validity before marking it for return
  NFS4.1 handle interrupted slot reuse from ERR_DELAY
  NFSv4: check return value of xdr_inline_decode
  nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
  ...
2017-05-10 13:03:38 -07:00
Linus Torvalds c44b594303 virtio: fixes, cleanups, performance
A bunch of changes to virtio, most affecting virtio net.
 ptr_ring batched zeroing - first of batching enhancements
 that seems ready.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZEceWAAoJECgfDbjSjVRpg8YIAIbB1UJZkrHh/fdCQjM2O53T
 ESS62W91LBT+weYH/N79RrfnGWzDOHrCQ8Or1nAKQZx2vU1aroqYXeJ9o0liRBYr
 hGZB1/bg8obA5XkKUfME2+XClakvuXABUJbky08iDL9nILlrvIVLoUgZ9ouL0vTj
 oUv4n6jtguNFV7tk/injGNRparEVdcefX0dbPxXomx5tSeD2fOE96/Q4q2eD3f7r
 NHD4DailEJC7ndJGa6b4M9g8shkXzgEnSw+OJHpcJcxCnAzkVT94vsU7OluiDvmG
 bfdGZNb0ohDYZLbJDR8aiMkoad8bIVLyGlhqnMBiZQEOF4oiWM9UJLvp5Lq9g7A=
 =Sb7L
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "Fixes, cleanups, performance

  A bunch of changes to virtio, most affecting virtio net. Also ptr_ring
  batched zeroing - first of batching enhancements that seems ready."

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  s390/virtio: change maintainership
  tools/virtio: fix spelling mistake: "wakeus" -> "wakeups"
  virtio_net: tidy a couple debug statements
  ptr_ring: support testing different batching sizes
  ringtest: support test specific parameters
  ptr_ring: batch ring zeroing
  virtio: virtio_driver doc
  virtio_net: don't reset twice on XDP on/off
  virtio_net: fix support for small rings
  virtio_net: reduce alignment for buffers
  virtio_net: rework mergeable buffer handling
  virtio_net: allow specifying context for rx
  virtio: allow extra context per descriptor
  tools/virtio: fix build breakage
  virtio: add context flag to find vqs
  virtio: wrap find_vqs
  ringtest: fix an assert statement
2017-05-10 11:33:08 -07:00
Linus Torvalds de4d195308 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes are:

   - Debloat RCU headers

   - Parallelize SRCU callback handling (plus overlapping patches)

   - Improve the performance of Tree SRCU on a CPU-hotplug stress test

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Open-code the rcu_cblist_n_lazy_cbs() function
  rcu: Open-code the rcu_cblist_n_cbs() function
  rcu: Open-code the rcu_cblist_empty() function
  rcu: Separately compile large rcu_segcblist functions
  srcu: Debloat the <linux/rcu_segcblist.h> header
  srcu: Adjust default auto-expediting holdoff
  srcu: Specify auto-expedite holdoff time
  srcu: Expedite first synchronize_srcu() when idle
  srcu: Expedited grace periods with reduced memory contention
  srcu: Make rcutorture writer stalls print SRCU GP state
  srcu: Exact tracking of srcu_data structures containing callbacks
  srcu: Make SRCU be built by default
  srcu: Fix Kconfig botch when SRCU not selected
  rcu: Make non-preemptive schedule be Tasks RCU quiescent state
  srcu: Expedite srcu_schedule_cbs_snp() callback invocation
  srcu: Parallelize callback handling
  kvm: Move srcu_struct fields to end of struct kvm
  rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
  rcu: Use true/false in assignment to bool
  rcu: Use bool value directly
  ...
2017-05-10 10:30:46 -07:00
Linus Torvalds 26c5eaa132 The two main items are support for disabling automatic rbd exclusive
lock transfers from myself and the long awaited -ENOSPC handling series
 from Jeff.  The former will allow rbd users to take advantage of
 exclusive lock's built-in blacklist/break-lock functionality while
 staying in control of who owns the lock.  With the latter in place, we
 will abort filesystem writes on -ENOSPC instead of having them block
 indefinitely.
 
 Beyond that we've got the usual pile of filesystem fixes from Zheng,
 some refcount_t conversion patches from Elena and a patch for an
 ancient open() flags handling bug from Alexander.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZEt/kAAoJEEp/3jgCEfOLpzAIAIld0N06DuHKG2F9mHEnLeGl
 Y60BZ3Ajo32i9qPT/u9ntI99ZMlkuHcNWg6WpCCh8umbwk2eiAKRP/KcfGcWmmp9
 EHj9COCmBR9TRM1pNS1lSMzljDnxf9sQmbIO9cwMQBUya5g19O0OpApzxF1YQhCR
 V9B/FYV5IXELC3b/NH45oeDAD9oy/WgwbhQ2feTBQJmzIVJx+Je9hdhR1PH1rI06
 ysyg3VujnUi/hoDhvPTBznNOxnHx/HQEecHH8b01MkbaCgxPH88jsUK/h7PYF3Gh
 DE/sCN69HXeu1D/al3zKoZdahsJ5GWkj9Q+vvBoQJm+ZPsndC+qpgSj761n9v38=
 =vamy
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The two main items are support for disabling automatic rbd exclusive
  lock transfers from myself and the long awaited -ENOSPC handling
  series from Jeff.

  The former will allow rbd users to take advantage of exclusive lock's
  built-in blacklist/break-lock functionality while staying in control
  of who owns the lock. With the latter in place, we will abort
  filesystem writes on -ENOSPC instead of having them block
  indefinitely.

  Beyond that we've got the usual pile of filesystem fixes from Zheng,
  some refcount_t conversion patches from Elena and a patch for an
  ancient open() flags handling bug from Alexander"

* tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
  ceph: fix memory leak in __ceph_setxattr()
  ceph: fix file open flags on ppc64
  ceph: choose readdir frag based on previous readdir reply
  rbd: exclusive map option
  rbd: return ResponseMessage result from rbd_handle_request_lock()
  rbd: kill rbd_is_lock_supported()
  rbd: support updating the lock cookie without releasing the lock
  rbd: store lock cookie
  rbd: ignore unlock errors
  rbd: fix error handling around rbd_init_disk()
  rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
  rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
  ceph: when seeing write errors on an inode, switch to sync writes
  Revert "ceph: SetPageError() for writeback pages if writepages fails"
  ceph: handle epoch barriers in cap messages
  libceph: add an epoch_barrier field to struct ceph_osd_client
  libceph: abort already submitted but abortable requests when map or pool goes full
  libceph: allow requests to return immediately on full conditions if caller wishes
  libceph: remove req->r_replay_version
  ceph: make seeky readdir more efficient
  ...
2017-05-10 08:42:33 -07:00
Linus Torvalds 50fb55d88c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix multiqueue in stmmac driver on PCI, from Andy Shevchenko.

 2) cdc_ncm doesn't actually fully zero out the padding area is
    allocates on TX, from Jim Baxter.

 3) Don't leak map addresses in BPF verifier, from Daniel Borkmann.

 4) If we randomize TCP timestamps, we have to do it everywhere
    including SYN cookies. From Eric Dumazet.

 5) Fix "ethtool -S" crash in aquantia driver, from Pavel Belous.

 6) Fix allocation size for ntp filter bitmap in bnxt_en driver, from
    Dan Carpenter.

 7) Add missing memory allocation return value check to DSA loop driver,
    from Christophe Jaillet.

 8) Fix XDP leak on driver unload in qed driver, from Suddarsana Reddy
    Kalluru.

 9) Don't inherit MC list from parent inet connection sockets, another
    syzkaller spotted gem. Fix from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  dccp/tcp: do not inherit mc_list from parent
  qede: Split PF/VF ndos.
  qed: Correct doorbell configuration for !4Kb pages
  qed: Tell QM the number of tasks
  qed: Fix VF removal sequence
  qede: Fix XDP memory leak on unload
  net/mlx4_core: Reduce harmless SRIOV error message to debug level
  net/mlx4_en: Avoid adding steering rules with invalid ring
  net/mlx4_en: Change the error print to debug print
  drivers: net: wimax: i2400m: i2400m-usb: Use time_after for time comparison
  DECnet: Use container_of() for embedded struct
  Revert "ipv4: restore rt->fi for reference counting"
  net: mdio-mux: bcm-iproc: call mdiobus_free() in error path
  net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control
  ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
  net: cdc_ncm: Fix TX zero padding
  stmmac: pci: split out common_default_data() helper
  stmmac: pci: RX queue routing configuration
  stmmac: pci: TX and RX queue priority configuration
  stmmac: pci: set default number of rx and tx queues
  ...
2017-05-09 15:42:31 -07:00
Eric Dumazet 657831ffc3 dccp/tcp: do not inherit mc_list from parent
syzkaller found a way to trigger double frees from ip_mc_drop_socket()

It turns out that leave a copy of parent mc_list at accept() time,
which is very bad.

Very similar to commit 8b485ce698 ("tcp: do not inherit
fastopen_req from parent")

Initial report from Pray3r, completed by Andrey one.
Thanks a lot to them !

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Pray3r <pray3r.z@gmail.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-09 15:17:49 -04:00
Kees Cook f92ceb01c2 DECnet: Use container_of() for embedded struct
Instead of a direct cross-type cast, use conatiner_of() to locate
the embedded structure, even in the face of future struct layout
randomization.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-09 09:39:49 -04:00
David S. Miller 32f1bc0f3d Revert "ipv4: restore rt->fi for reference counting"
This reverts commit 82486aa6f1.

As implemented, this causes dangling netdevice refs.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 22:35:32 -04:00
Vlastimil Babka f108304872 treewide: convert PF_MEMALLOC manipulations to new helpers
We now have memalloc_noreclaim_{save,restore} helpers for robust setting
and clearing of PF_MEMALLOC.  Let's convert the code which was using the
generic tsk_restore_flags().  No functional change.

[vbabka@suse.cz: in net/core/sock.c the hunk is missing]
Link: http://lkml.kernel.org/r/20170405074700.29871-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Chris Leech <cleech@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Wouter Verhelst <w@uter.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani 1134e09100 fs: ceph: CURRENT_TIME with ktime_get_real_ts()
CURRENT_TIME is not y2038 safe.  The macro will be deleted and all the
references to it will be replaced by ktime_get_* apis.

struct timespec is also not y2038 safe.  Retain timespec for timestamp
representation here as ceph uses it internally everywhere.  These
references will be changed to use struct timespec64 in a separate patch.

The current_fs_time() api is being changed to use vfs struct inode* as
an argument instead of struct super_block*.

Set the new mds client request r_stamp field using ktime_get_real_ts()
instead of using current_fs_time().

Also, since r_stamp is used as mtime on the server, use timespec_trunc()
to truncate the timestamp, using the right granularity from the
superblock.

This api will be transitioned to be y2038 safe along with vfs.

Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
M:	Ilya Dryomov <idryomov@gmail.com>
M:	"Yan, Zheng" <zyan@redhat.com>
M:	Sage Weil <sage@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Kees Cook 063246641d format-security: move static strings to const
While examining output from trial builds with -Wformat-security enabled,
many strings were found that should be defined as "const", or as a char
array instead of char pointer.  This makes some static analysis easier,
by producing fewer false positives.

As these are all trivial changes, it seemed best to put them all in a
single patch rather than chopping them up per maintainer.

Link: http://lkml.kernel.org/r/20170405214711.GA5711@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Jes Sorensen <jes@trained-monkey.org>	[runner.c]
Cc: Tony Lindgren <tony@atomide.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Patrice Chotard <patrice.chotard@st.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Antonio Quartulli <a@unstable.cc>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Kejian Yan <yankejian@huawei.com>
Cc: Daode Huang <huangdaode@hisilicon.com>
Cc: Qianqian Xie <xieqianqian@huawei.com>
Cc: Philippe Reynes <tremyfr@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Christian Gromm <christian.gromm@microchip.com>
Cc: Andrey Shvetsov <andrey.shvetsov@k2l.de>
Cc: Jason Litzinger <jlitzingerdev@gmail.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:14 -07:00
Michal Hocko 19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko da6bc57a8f net: use kvmalloc with __GFP_REPEAT rather than open coded variant
fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
vmalloc fallback.  Use the kvmalloc variant instead.  Keep the
__GFP_REPEAT flag based on explanation from Eric:

 "At the time, tests on the hardware I had in my labs showed that
  vmalloc() could deliver pages spread all over the memory and that was
  a small penalty (once memory is fragmented enough, not at boot time)"

The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code.  This is rather disruptive
for something that can be achived with the fallback.  On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests.  So
the effect of this patch is that requests which fit into 32kB will fall
back to vmalloc easier now.

Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko 752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko 847f716f9e net/ipv6/ila/ila_xlat.c: simplify a strange allocation pattern
alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation pattern
which is quite unusual.  The default allocation size is 320 *
sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
performance benefit is really questionable and not worth the subtle code
IMHO.  Also note that the context when we call ila_init_net (modprobe or
a task creating a net namespace) has to be properly configured.

Let's just simplify the code and use kvmalloc helper which is a
transparent way to use kmalloc with vmalloc fallback.

Link: http://lkml.kernel.org/r/20170306103032.2540-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
WANG Cong 242d3a49a2 ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
For each netns (except init_net), we initialize its null entry
in 3 places:

1) The template itself, as we use kmemdup()
2) Code around dst_init_metrics() in ip6_route_net_init()
3) ip6_route_dev_notify(), which is supposed to initialize it after
   loopback registers

Unfortunately the last one still happens in a wrong order because
we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
net->loopback_dev's idev, thus we have to do that after we add
idev to loopback. However, this notifier has priority == 0 same as
ipv6_dev_notf, and ipv6_dev_notf is registered after
ip6_route_dev_notifier so it is called actually after
ip6_route_dev_notifier. This is similar to commit 2f460933f5
("ipv6: initialize route null entry in addrconf_init()") which
fixes init_net.

Fix it by picking a smaller priority for ip6_route_dev_notifier.
Also, we have to release the refcnt accordingly when unregistering
loopback_dev because device exit functions are called before subsys
exit functions.

Acked-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 17:31:24 -04:00
David S. Miller 29cee56c0b A couple more fixes:
* don't try to authenticate during reconfiguration, which causes
    drivers to get confused
  * fix a kernel-doc warning for a recently merged change
  * fix MU-MIMO group configuration (relevant only for monitor mode)
  * more rate flags fix: remove stray RX_ENC_FLAG_40MHZ
  * fix IBSS probe response allocation size
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlkQhDgACgkQa3t4Rpy0
 AB07Pg/+MGBW/OFoxdsQtq7eGPzVuQXC4NCXjzBunueY/cTpExuzoOWvKTRA+p7o
 pxgqxhngO3l8u/FqhWE7jh+aGxydmq8BhQefrAMi6VgkH7oh4JwP6Mjxf4xGpxZk
 W15oNncNaxLiC4U+GaVUZ0oEsc0fCFuqsmAEGas25VOOQyr4NNJ9jivecOI70bHH
 b1wvCilwDAIeg7CKAtxja40/81ldnm9A7mSABGM6M13AJ1yiNnf9FEteSxAr7s4r
 xx6flFQQzlT+pzoVZeEg0u6yGWqucL/4V9OGcjJcoyLVnbey+1hlypLef4n+Cgol
 yP8yR5n1I3RWsJEsOftfVvjG54e/UAIR4xkGe+LHiWn7XjIK6EbCrmYt1uxknZU7
 LTkFj6b4EHgGPRL0BrIRA4FpQdLeslbwsMF96gWP5VQJE3T4gocuTv4McG2g9Isl
 suiB1zn4Y24UuEHdg+mDlIiEuOr5h4+XvIBOHAw1ZfbVKSKBDLFbqvYF/sHbgsDF
 uR6CvMuGeRM2wOxD8QXFLueNGS7Znrd2ETVx2hx35/qAR/X54nu5kEqcsvjgEamY
 vPUr0RO/+plltQgiBBtTrr6x6uGJg1AsNEyrlMng5hP/mT3yLziU2G8qBozU5e7c
 mDA/7Lz7wk3a6MKVTzt8vaCIcpC42oVhvwS/6kKQ/BZ4GbajdBM=
 =FcxE
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A couple more fixes:
 * don't try to authenticate during reconfiguration, which causes
   drivers to get confused
 * fix a kernel-doc warning for a recently merged change
 * fix MU-MIMO group configuration (relevant only for monitor mode)
 * more rate flags fix: remove stray RX_ENC_FLAG_40MHZ
 * fix IBSS probe response allocation size
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 16:02:23 -04:00
Hangbin Liu 8ed508fd4b vti: check nla_put_* return value
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 15:10:31 -04:00
Vlad Yasevich 8403debeea vlan: Keep NETIF_F_HW_CSUM similar to other software devices
Vlan devices, like all other software devices, enable
NETIF_F_HW_CSUM feature.  However, unlike all the othe other
software devices, vlans will switch to using IP|IPV6_CSUM
features, if the underlying devices uses them.  In these situations,
checksum offload features on the vlan device can't be controlled
via ethtool.

This patch makes vlans keep HW_CSUM feature if the underlying
device supports checksum offloading.  This makes vlan devices
behave like other software devices, and restores control to the
user.

A side-effect is that some offload settings (typically UFO)
may be enabled on the vlan device while being disabled on the HW.
However, the GSO code will correctly process the packets. This
actually results in slightly better raw throughput.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 14:39:19 -04:00
Wei Wang 1b1fc3fdda tcp: make congestion control optionally skip slow start after idle
Congestion control modules that want full control over congestion
control behavior do not want the cwnd modifications controlled by
the sysctl_tcp_slow_start_after_idle code path.
So skip those code paths for CC modules that use the cong_control()
API.
As an example, those cwnd effects are not desired for the BBR congestion
control algorithm.

Fixes: c0402760f5 ("tcp: new CC hook to set sending rate with rate_sample in any CA state")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 14:37:07 -04:00
WANG Cong 82486aa6f1 ipv4: restore rt->fi for reference counting
IPv4 dst could use fi->fib_metrics to store metrics but fib_info
itself is refcnt'ed, so without taking a refcnt fi and
fi->fib_metrics could be freed while dst metrics still points to
it. This triggers use-after-free as reported by Andrey twice.

This patch reverts commit 2860583fe8 ("ipv4: Kill rt->fi") to
restore this reference counting. It is a quick fix for -net and
-stable, for -net-next, as Eric suggested, we can consider doing
reference counting for metrics itself instead of relying on fib_info.

IPv6 is very different, it copies or steals the metrics from mx6_config
in fib6_commit_metrics() so probably doesn't need a refcnt.

Decnet has already done the refcnt'ing, see dn_fib_semantic_match().

Fixes: 2860583fe8 ("ipv4: Kill rt->fi")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 14:35:03 -04:00
Julian Anastasov 3c5ab3f395 ipvs: SNAT packet replies only for NATed connections
We do not check if packet from real server is for NAT
connection before performing SNAT. This causes problems
for setups that use DR/TUN and allow local clients to
access the real server directly, for example:

- local client in director creates IPVS-DR/TUN connection
CIP->VIP and the request packets are routed to RIP.
Talks are finished but IPVS connection is not expired yet.

- second local client creates non-IPVS connection CIP->RIP
with same reply tuple RIP->CIP and when replies are received
on LOCAL_IN we wrongly assign them for the first client
connection because RIP->CIP matches the reply direction.
As result, IPVS SNATs replies for non-IPVS connections.

The problem is more visible to local UDP clients but in rare
cases it can happen also for TCP or remote clients when the
real server sends the reply traffic via the director.

So, better to be more precise for the reply traffic.
As replies are not expected for DR/TUN connections, better
to not touch them.

Reported-by: Nick Moriarty <nick.moriarty@york.ac.uk>
Tested-by: Nick Moriarty <nick.moriarty@york.ac.uk>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-05-08 11:38:35 +02:00
Johannes Berg f1f3e9e2a5 mac80211: fix IBSS presp allocation size
When VHT IBSS support was added, the size of the extra elements
wasn't considered in ieee80211_ibss_build_presp(), which makes
it possible that it would overrun the allocated buffer. Fix it
by allocating the necessary space.

Fixes: abcff6ef01 ("mac80211: add VHT support for IBSS")
Reported-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-05-08 11:25:04 +02:00
Johannes Berg 4954601f82 nl80211: correctly validate MU-MIMO groups
Since groups 0 and 63 are invalid, we should check for those bits.
Note that the 802.11 spec specifies the *bit* order, but the CPU
doesn't care about bit order since it can't address bits, so it's
always treating BIT(0) as the lowest bit within a byte.

Reported-by: Jan Fuchs <jan.fuchs@lancom.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-05-08 11:24:34 +02:00
Luca Coelho f8860ce836 mac80211: bail out from prep_connection() if a reconfig is ongoing
If ieee80211_hw_restart() is called during authentication, the
authentication process will continue, causing the driver to be called
in a wrong state.  This ultimately causes an oops in the iwlwifi
driver (at least).

This fixes bugzilla 195299 partly.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195299
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-05-08 11:23:50 +02:00
Ilan Tayari 2c1497bbc8 xfrm: Fix NETDEV_DOWN with IPSec offload
Upon NETDEV_DOWN event, all xfrm_state objects which are bound to
the device are flushed.

The condition for this is wrong, though, testing dev->hw_features
instead of dev->features. If a device has non-user-modifiable
NETIF_F_HW_ESP, then its xfrm_state objects are not flushed,
causing a crash later on after the device is deleted.

Check dev->features instead of dev->hw_features.

Fixes: d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-05-08 09:41:09 +02:00
Steffen Klassert d90c902449 af_key: Fix slab-out-of-bounds in pfkey_compile_policy.
The sadb_x_sec_len is stored in the unit 'byte divided by eight'.
So we have to multiply this value by eight before we can do
size checks. Otherwise we may get a slab-out-of-bounds when
we memcpy the user sec_ctx.

Fixes: df71837d50 ("[LSM-IPSec]: Security association restriction.")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-05-08 08:03:01 +02:00
Linus Torvalds e579dde654 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a set of small fixes that were mostly stumbled over during
  more significant development. This proc fix and the fix to
  posix-timers are the most significant of the lot.

  There is a lot of good development going on but unfortunately it
  didn't quite make the merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Fix unbalanced hard link numbers
  signal: Make kill_proc_info static
  rlimit: Properly call security_task_setrlimit
  signal: Remove unused definition of sig_user_definied
  ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
  ipc: Remove unused declaration of recompute_msgmni
  posix-timers: Correct sanity check in posix_cpu_nsleep
  sysctl: Remove dead register_sysctl_root
2017-05-05 11:08:43 -07:00
Eric Dumazet 84b114b984 tcp: randomize timestamps on syncookies
Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.

Also, TSval sent by the server to a particular remote address vary
depending on syncookies being sent or not, potentially triggering PAWS
drops for innocent clients.

Lets implement proper randomization, including for SYNcookies.

Also we do not need to export sysctl_tcp_timestamps, since it is not
used from a module.

In v2, I added Florian feedback and contribution, adding tsoff to
tcp_get_cookie_sock().

v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

Fixes: 95a22caee3 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-05 12:00:11 -04:00
Tobias Klauser 9051247dcf bridge: netlink: account for IFLA_BRPORT_{B, M}CAST_FLOOD size and policy
The attribute sizes for IFLA_BRPORT_MCAST_FLOOD and
IFLA_BRPORT_BCAST_FLOOD weren't accounted for in br_port_info_size()
when they were added. Do so now and also add the corresponding policy
entries:

Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: Mike Manning <mmanning@brocade.com>
Fixes: b6cb5ac833 ("net: bridge: add per-port multicast flood flag")
Fixes: 99f906e9ad ("bridge: add per-port broadcast flood flag")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-05 11:21:03 -04:00
Linus Torvalds 4ac4d58488 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) The wireless rate info fix from Johannes Berg.

 2) When a RAW socket is in hdrincl mode, we need to make sure that the
    user provided at least a minimally sized ipv4/ipv6 header. Fix from
    Alexander Potapenko.

 3) We must emit IFLA_PHYS_PORT_NAME netlink attributes using
    nla_put_string() so that it is NULL terminated.

 4) Fix a bug in TCP fastopen handling, wherein child sockets
    erroneously inherit the fastopen_req from the parent, and later can
    end up derefencing freed memory or doing a double free. From Eric
    Dumazet.

 5) Don't clear out netdev stats at close time in tg3 driver, from
    YueHaibing.

 6) Fix refcount leak in xt_CT, from Gao Feng.

 7) In nft_set_bitmap() don't leak dummy elements, from Liping Zhang.

 8) Fix deadlock due to taking the expectation lock twice, also from
    Liping Zhang.

 9) Make xt_socket work again with ipv6, from Peter Tirsek.

10) Don't allow IPV6 to be used with IPVS if ipv6.disable=1, from Paolo
    Abeni.

11) Make the BPF loader more flexible wrt. changes to the bpf MAP entry
    layout. From Jesper Dangaard Brouer.

12) Fix ethtool reported device name in aquantia driver, from Pavel
    Belous.

13) Fix build failures due to the compile time size test not working in
    netfilter conntrack. From Geert Uytterhoeven.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  cfg80211: make RATE_INFO_BW_20 the default
  ipv6: initialize route null entry in addrconf_init()
  qede: Fix possible misconfiguration of advertised autoneg value.
  qed: Fix overriding of supported autoneg value.
  qed*: Fix possible overflow for status block id field.
  rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
  netvsc: make sure napi enabled before vmbus_open
  aquantia: Fix driver name reported by ethtool
  ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
  net/sched: remove redundant null check on head
  tcp: do not inherit fastopen_req from parent
  forcedeth: remove unnecessary carrier status check
  ibmvnic: Move queue restarting in ibmvnic_tx_complete
  ibmvnic: Record SKB RX queue during poll
  ibmvnic: Continue skb processing after skb completion error
  ibmvnic: Check for driver reset first in ibmvnic_xmit
  ibmvnic: Wait for any pending scrqs entries at driver close
  ibmvnic: Clean up tx pools when closing
  ibmvnic: Whitespace correction in release_rx_pools
  ibmvnic: Delete napi's when releasing driver resources
  ...
2017-05-04 12:26:43 -07:00
Linus Torvalds a96480723c xen: fixes and featrues for 4.12
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABAgAGBQJZChTBAAoJELDendYovxMvkXEIAJDpK5UKMsL1Ihgc0DL0OujQ
 UGxLfWJueSA1X7i8BgL/8vfgKxSEB9SUiM+ooHOKXS6oDhyk2RP4MuCe5+lhUbbv
 ZMK5KxHMlVUOD9EjYif8DhhiwRowBbWYEwr8XgY12s0Ya0a9TQLVC+noGsuzqNiH
 1UyzeeWlBae4nulUMMim6urPNq5AEPVeQKNX3S8rlnDp74IKVZuoISMM62b2KRSr
 +R8FVBshXR/HO53YNY0+AfmmUa8T1+dyjL50Eo/QnsG0i+3igOqNrzSKSc6T+nBt
 Zl3KDUE5W3/OlxuR+CIdZZ1KKtjzoAiR3cvVlHs2z7MIio87bJcYJforAqe6Evo=
 =k6in
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:
 "Xen fixes and featrues for 4.12. The main changes are:

   - enable building the kernel with Xen support but without enabling
     paravirtualized mode (Vitaly Kuznetsov)

   - add a new 9pfs xen frontend driver (Stefano Stabellini)

   - simplify Xen's cpuid handling by making use of cpu capabilities
     (Juergen Gross)

   - add/modify some headers for new Xen paravirtualized devices
     (Oleksandr Andrushchenko)

   - EFI reset_system support under Xen (Julien Grall)

   - and the usual cleanups and corrections"

* tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (57 commits)
  xen: Move xen_have_vector_callback definition to enlighten.c
  xen: Implement EFI reset_system callback
  arm/xen: Consolidate calls to shutdown hypercall in a single helper
  xen: Export xen_reboot
  xen/x86: Call xen_smp_intr_init_pv() on BSP
  xen: Revert commits da72ff5bfc and 72a9b18629
  xen/pvh: Do not fill kernel's e820 map in init_pvh_bootparams()
  xen/scsifront: use offset_in_page() macro
  xen/arm,arm64: rename __generic_dma_ops to xen_get_dma_ops
  xen/arm,arm64: fix xen_dma_ops after 815dd18 "Consolidate get_dma_ops..."
  xen/9pfs: select CONFIG_XEN_XENBUS_FRONTEND
  x86/cpu: remove hypervisor specific set_cpu_features
  vmware: set cpu capabilities during platform initialization
  x86/xen: use capabilities instead of fake cpuid values for xsave
  x86/xen: use capabilities instead of fake cpuid values for x2apic
  x86/xen: use capabilities instead of fake cpuid values for mwait
  x86/xen: use capabilities instead of fake cpuid values for acpi
  x86/xen: use capabilities instead of fake cpuid values for acc
  x86/xen: use capabilities instead of fake cpuid values for mtrr
  x86/xen: use capabilities instead of fake cpuid values for aperf
  ...
2017-05-04 11:37:09 -07:00
WANG Cong 2f460933f5 ipv6: initialize route null entry in addrconf_init()
Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
since it is always NULL.

This is clearly wrong, we have code to initialize it to loopback_dev,
unfortunately the order is still not correct.

loopback_dev is registered very early during boot, we lose a chance
to re-initialize it in notifier. addrconf_init() is called after
ip6_route_init(), which means we have no chance to correct it.

Fix it by moving this initialization explicitly after
ipv6_add_dev(init_net.loopback_dev) in addrconf_init().

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04 12:51:24 -04:00
Michal Schmidt 77ef033b68 rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
IFLA_PHYS_PORT_NAME is a string attribute, so terminate it with \0.
Otherwise libnl3 fails to validate netlink messages with this attribute.
"ip -detail a" assumes too that the attribute is NUL-terminated when
printing it. It often was, due to padding.

I noticed this as libvirtd failing to start on a system with sfc driver
after upgrading it to Linux 4.11, i.e. when sfc added support for
phys_port_name.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04 11:23:59 -04:00
Alexander Potapenko 86f4c90a1c ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
raw_send_hdrinc() and rawv6_send_hdrinc() expect that the buffer copied
from the userspace contains the IPv4/IPv6 header, so if too few bytes are
copied, parts of the header may remain uninitialized.

This bug has been detected with KMSAN.

For the record, the KMSAN report:

==================================================================
BUG: KMSAN: use of unitialized memory in nf_ct_frag6_gather+0xf5a/0x44a0
inter: 0
CPU: 0 PID: 1036 Comm: probe Not tainted 4.11.0-rc5+ #2455
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x143/0x1b0 lib/dump_stack.c:52
 kmsan_report+0x16b/0x1e0 mm/kmsan/kmsan.c:1078
 __kmsan_warning_32+0x5c/0xa0 mm/kmsan/kmsan_instr.c:510
 nf_ct_frag6_gather+0xf5a/0x44a0 net/ipv6/netfilter/nf_conntrack_reasm.c:577
 ipv6_defrag+0x1d9/0x280 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
 nf_hook_entry_hookfn ./include/linux/netfilter.h:102
 nf_hook_slow+0x13f/0x3c0 net/netfilter/core.c:310
 nf_hook ./include/linux/netfilter.h:212
 NF_HOOK ./include/linux/netfilter.h:255
 rawv6_send_hdrinc net/ipv6/raw.c:673
 rawv6_sendmsg+0x2fcb/0x41a0 net/ipv6/raw.c:919
 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633
 sock_sendmsg net/socket.c:643
 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
 SyS_sendto+0xbc/0xe0 net/socket.c:1664
 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
 entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246
RIP: 0033:0x436e03
RSP: 002b:00007ffce48baf38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000436e03
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007ffce48baf90 R08: 00007ffce48baf50 R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000401790 R14: 0000000000401820 R15: 0000000000000000
origin: 00000000d9400053
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:362
 kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:257
 kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:270
 slab_alloc_node mm/slub.c:2735
 __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4341
 __kmalloc_reserve net/core/skbuff.c:138
 __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231
 alloc_skb ./include/linux/skbuff.h:933
 alloc_skb_with_frags+0x209/0xbc0 net/core/skbuff.c:4678
 sock_alloc_send_pskb+0x9ff/0xe00 net/core/sock.c:1903
 sock_alloc_send_skb+0xe4/0x100 net/core/sock.c:1920
 rawv6_send_hdrinc net/ipv6/raw.c:638
 rawv6_sendmsg+0x2918/0x41a0 net/ipv6/raw.c:919
 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633
 sock_sendmsg net/socket.c:643
 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
 SyS_sendto+0xbc/0xe0 net/socket.c:1664
 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
 return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246
==================================================================

, triggered by the following syscalls:
  socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
  sendto(3, NULL, 0, 0, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "ff00::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EPERM

A similar report is triggered in net/ipv4/raw.c if we use a PF_INET socket
instead of a PF_INET6 one.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04 11:02:46 -04:00
Colin Ian King 985538eee0 net/sched: remove redundant null check on head
head is previously null checked and so the 2nd null check on head
is redundant and therefore can be removed.

Detected by CoverityScan, CID#1399505 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04 11:01:30 -04:00
Eric Dumazet 8b485ce698 tcp: do not inherit fastopen_req from parent
Under fuzzer stress, it is possible that a child gets a non NULL
fastopen_req pointer from its parent at accept() time, when/if parent
morphs from listener to active session.

We need to make sure this can not happen, by clearing the field after
socket cloning.

BUG: Double free or freeing an invalid pointer
Unexpected shadow byte: 0xFB
CPU: 3 PID: 20933 Comm: syz-executor3 Not tainted 4.11.0+ #306
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
 kasan_report_double_free+0x5c/0x70 mm/kasan/report.c:185
 kasan_slab_free+0x9d/0xc0 mm/kasan/kasan.c:580
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kfree+0xe8/0x2b0 mm/slub.c:3882
 tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
 tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
 inet_child_forget+0xb8/0x600 net/ipv4/inet_connection_sock.c:898
 inet_csk_reqsk_queue_add+0x1e7/0x250
net/ipv4/inet_connection_sock.c:928
 tcp_get_cookie_sock+0x21a/0x510 net/ipv4/syncookies.c:217
 cookie_v4_check+0x1a19/0x28b0 net/ipv4/syncookies.c:384
 tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1384 [inline]
 tcp_v4_do_rcv+0x731/0x940 net/ipv4/tcp_ipv4.c:1421
 tcp_v4_rcv+0x2dc0/0x31c0 net/ipv4/tcp_ipv4.c:1715
 ip_local_deliver_finish+0x4cc/0xc20 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip_local_deliver+0x1ce/0x700 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:492 [inline]
 ip_rcv_finish+0xb1d/0x20b0 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip_rcv+0xd8c/0x19c0 net/ipv4/ip_input.c:487
 __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4210
 __netif_receive_skb+0x2a/0x1a0 net/core/dev.c:4248
 process_backlog+0xe5/0x6c0 net/core/dev.c:4868
 napi_poll net/core/dev.c:5270 [inline]
 net_rx_action+0xe70/0x18e0 net/core/dev.c:5335
 __do_softirq+0x2fb/0xb99 kernel/softirq.c:284
 do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:899
 </IRQ>
 do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
 do_softirq kernel/softirq.c:176 [inline]
 __local_bh_enable_ip+0x1cf/0x1e0 kernel/softirq.c:181
 local_bh_enable include/linux/bottom_half.h:31 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:931 [inline]
 ip_finish_output2+0x9ab/0x15e0 net/ipv4/ip_output.c:230
 ip_finish_output+0xa35/0xdf0 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip_output+0x1f6/0x7b0 net/ipv4/ip_output.c:404
 dst_output include/net/dst.h:486 [inline]
 ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
 ip_queue_xmit+0x9a8/0x1a10 net/ipv4/ip_output.c:503
 tcp_transmit_skb+0x1ade/0x3470 net/ipv4/tcp_output.c:1057
 tcp_write_xmit+0x79e/0x55b0 net/ipv4/tcp_output.c:2265
 __tcp_push_pending_frames+0xfa/0x3a0 net/ipv4/tcp_output.c:2450
 tcp_push+0x4ee/0x780 net/ipv4/tcp.c:683
 tcp_sendmsg+0x128d/0x39b0 net/ipv4/tcp.c:1342
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x446059
RSP: 002b:00007faa6761fb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 0000000000446059
RDX: 0000000000000001 RSI: 0000000020ba3fcd RDI: 0000000000000017
RBP: 00000000006e40a0 R08: 0000000020ba4ff0 R09: 0000000000000010
R10: 0000000020000000 R11: 0000000000000282 R12: 0000000000708150
R13: 0000000000000000 R14: 00007faa676209c0 R15: 00007faa67620700
Object at ffff88003b5bbcb8, in cache kmalloc-64 size: 64
Allocated:
PID = 20909
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
 kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745
 kmalloc include/linux/slab.h:490 [inline]
 kzalloc include/linux/slab.h:663 [inline]
 tcp_sendmsg_fastopen net/ipv4/tcp.c:1094 [inline]
 tcp_sendmsg+0x221a/0x39b0 net/ipv4/tcp.c:1139
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1f/0xbe
Freed:
PID = 20909
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kfree+0xe8/0x2b0 mm/slub.c:3882
 tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
 tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
 __inet_stream_connect+0x20c/0xf90 net/ipv4/af_inet.c:593
 tcp_sendmsg_fastopen net/ipv4/tcp.c:1111 [inline]
 tcp_sendmsg+0x23a8/0x39b0 net/ipv4/tcp.c:1139
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 7db92362d2 ("tcp: fix potential double free issue for fastopen_req")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04 11:00:04 -04:00
Ilya Dryomov 14bb211d32 rbd: support updating the lock cookie without releasing the lock
As we no longer release the lock before potentially raising BLACKLISTED
in rbd_reregister_watch(), the "either locked or blacklisted" assert in
rbd_queue_workfn() needs to go: we can be both locked and blacklisted
at that point now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04 09:19:23 +02:00
Jeff Layton 58eb7932ae libceph: add an epoch_barrier field to struct ceph_osd_client
Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.

Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.

If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.

If we get a map with a full pool, or at quota condition, then set the
barrier to the current epoch value.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton fc36d0a42c libceph: abort already submitted but abortable requests when map or pool goes full
When a Ceph volume hits capacity, a flag is set in the OSD map to
indicate that, and a new map is sprayed around the cluster. With cephfs
we want it to shut down any abortable requests that are in progress with
an -ENOSPC error as they'd just hang otherwise.

Add a new ceph_osdc_abort_on_full helper function to handle this. It
will first check whether there is an out-of-space condition in the
cluster and then walk the tree and abort any request that has
r_abort_on_full set with a -ENOSPC error. Call this new function
directly whenever we get a new OSD map.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton a1f4020aab libceph: allow requests to return immediately on full conditions if caller wishes
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton aa26d662b9 libceph: remove req->r_replay_version
Nothing uses this anymore with the removal of the ack vs. commit code.
Remove the field and just encode zeroes into place in the request
encoding.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Elena Reshetova 0e1a5ee657 libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova 02113a0f14 libceph: convert ceph_osd.o_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova 06dfa96399 libceph: convert ceph_snap_context.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov d6a3408a77 libceph: supported_features module parameter
Add a readonly, exported to sysfs module parameter so that userspace
can generate meaningful error messages.  It's a bit funky, but there is
no other libceph-specific place.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov 74da4a0f57 libceph, ceph: always advertise all supported features
No reason to hide CephFS-specific features in the rbd case.  Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Sabrina Dubroca 9b3eb54106 xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
When CONFIG_XFRM_SUB_POLICY=y, xfrm_dst stores a copy of the flowi for
that dst. Unfortunately, the code that allocates and fills this copy
doesn't care about what type of flowi (flowi, flowi4, flowi6) gets
passed. In multiple code paths (from raw_sendmsg, from TCP when
replying to a FIN, in vxlan, geneve, and gre), the flowi that gets
passed to xfrm is actually an on-stack flowi4, so we end up reading
stuff from the stack past the end of the flowi4 struct.

Since xfrm_dst->origin isn't used anywhere following commit
ca116922af ("xfrm: Eliminate "fl" and "pol" args to
xfrm_bundle_ok()."), just get rid of it.  xfrm_dst->partner isn't used
either, so get rid of that too.

Fixes: 9d6ec93801 ("ipv4: Use flowi4 in public route lookup interfaces.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-05-04 07:30:59 +02:00
Steffen Klassert 0e78a87306 esp4: Fix udpencap for local TCP packets.
Locally generated TCP packets are usually cloned, so we
do skb_cow_data() on this packets. After that we need to
reload the pointer to the esp header. On udpencap this
header has an offset to skb_transport_header, so take this
offset into account.

Fixes: 67d349ed60 ("net/esp4: Fix invalid esph pointer crash")
Fixes: fca11ebde3 ("esp4: Reorganize esp_output")
Reported-by: Don Bowman <db@donbowman.ca>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-05-04 07:27:26 +02:00
Linus Torvalds 1684096b1e Updates for 4.12 kernel merge window
- idr usage and locking changes
 - build fix for hns
 - ipoib debug path record file fix
 - hfi1 updates
 - core RDMA netdev addition
 - Intel VNIC driver addition
 - Enhanced accelerators for IPoIB addition
 - Debug cleanups in cxgb3/4
 - Trivial cleanups from SF Markus Elfring
 - Misc rxe fixes from Mellanox
 - Misc ipoib fixes from Mellanox
 - Lots of mlx4/mlx5 changes from Mellanox
 - Misc fixes across the RDMA subsystem
 - ODP paging fixes and improvements
 - qedr updates
 - hfi1 updates
 - OPA port info patches
 - OPA AH patches
 - OPA SA Query patches
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZCfBsAAoJELgmozMOVy/d9GsP/je5/IyEwQOFVxhLM+BooDWy
 wfH/GWLoT4iSxviWtBzukZzrioxjfyFitZzkTWYxHMj3EIb63i52pDUTpes/soGl
 c3ob0SYv5mPB9b1mBZaIyyTWBWrXfm2pNSfyYryhI1cYxNX5ZLlXG51Xd3YxdB3D
 A8avUsCtH17zSb6Mimm04cT47pn5UIkVkcPKZDCir10hj1JiwLVwrWyC7abxLENp
 jHFw4uKQHOV3IN6jevM/tXfUenjALXwBHHKv+lJsBVijDUPTEmDsBiDXsvO++dmN
 Ph5ElY3KPfUmj4wIWIrY4L56j5Kr13Wxc+U8+MWNC6frbcHYoMCaSz3yaU15NLAd
 UYY5blzZsuNXqhgmudeV89qJpXYleW7KCgJQNiBmLkcQL38+ObdLTP0EmsC02K+W
 YpJbwecjNQtcb3KTJGnKCyMc3+Rs0u6Osz6YKuad4l8cNaxUI8NVujB2ru/wBczg
 fqXEunXjr6tEVM39zqwolImicsSSEzBKfpaFvB3D2Re5O22Eos6DM+DveUnzXAFR
 Hof5NhPURr/1aqNog2ymgGbjlg3tL4JAAG1PRBhvSFYywVMjV/LLBPQOgqaQzIU5
 J72jbSikRJYLCJaLFAeM7nNsTQgAMH58G0vhnrFoAjC7MglYaedcvouLjOs1jrpW
 d5f12NtIBIpC6DvQCNvH
 =pgEL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "More exchaustive description of primary updates in this release:

   - Lots of driver fixes and misc fixes across the board.

   - I had to base on a net-next tree because the IPoIB Accelorator
     patches needed it.

     Unfortunately, it was known to Mellanox that there would need to be
     an IPoIB accelorator patch to the net tree (which left some
     functions turned off by an #ifdef construct to avoid warnings about
     defined but unused functions), then one to the RDMA tree, then a
     fixup that went back and re-enabled the functions in the net tree
     and enabled their use in the rdma tree

     Also, a sparse fix was sent to the net tree after I did my pull,
     and the fixup patch conflicts quite directly with that sparse fix,
     so I'm going to submit the fixup patch towards the end of the merge
     window by itself and based upon your master branch at the time.

   - Two separate rounds of hfi1 fixes, one that got dropped from last
     release because it came in just a day or two before the end of the
     merge window and then the one from this release cycle.

     Of note is that I now have a third series that just landed from
     Intel yesterday. It is not included in this pull request, but I may
     submit it by the end of the week. I'll talk to Intel about
     improving the timing of thier submissions for my workflow.

   - Changes to our idr usage in the RDMA subsystem that will tie into
     our cgroup management and also into the upcoming changes for the
     RDMA kernel<->userspace API.

   - Addition of support for a netdev to be tied to an RDMA device at
     the core level

   - Addition of the VNIC driver from Intel.

     While IPoIB provides IP over InfiniBand (and *only* IP, no lower
     layer protocol headers are allowed or supported), the VNIC driver
     presents a virtual Ethernet device with support for things like
     varying Ethertypes, VLANs, priorities and other features of
     Ethernet.

     The virtual devices are centrally managed by the OPA fabric
     manager, making this (for the time being) a strictly OPA specific
     feature.

   - Improvements to the On-Demand Paging support in the RDMA subsystem.

   - Addition of three significant OPA changes.

     While we added OPA support some time ago (via the hfi1 driver), the
     RDMA subsystem has so far glossed over the areas where OPA and
     InfiniBand differ.

     With this release we are starting to add support for the OPA
     extensions into the RDMA core in the following area: Extended port
     information for OPA is now supported, extended Address Handle
     attributes for OPA are now supported, and extended SA Queries to
     get OPA specific subnet information is now supported.

  Concise summary from the tag:
   - idr usage and locking changes
   - build fix for hns
   - ipoib debug path record file fix
   - hfi1 updates
   - core RDMA netdev addition
   - Intel VNIC driver addition
   - Enhanced accelerators for IPoIB addition
   - Debug cleanups in cxgb3/4
   - Trivial cleanups from SF Markus Elfring
   - Misc rxe fixes from Mellanox
   - Misc ipoib fixes from Mellanox
   - Lots of mlx4/mlx5 changes from Mellanox
   - Misc fixes across the RDMA subsystem
   - ODP paging fixes and improvements
   - qedr updates
   - hfi1 updates
   - OPA port info patches
   - OPA AH patches
   - OPA SA Query patches"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (191 commits)
  infiniband: avoid dereferencing uninitialized dst on error path
  IB/SA: Add OPA addr header
  IB/mlx5: Add port_xmit_wait to counter registers read
  IB/ocrdma: fix out of bounds access to local buffer
  IB/mlx4: Fix incorrect order of formal and actual parameters
  IB/mlx4: Change flush logic so it adheres to the variable name
  mlx5: Fix mlx5_ib_map_mr_sg mr length
  IB/rxe: Don't clamp residual length to mtu
  IB/SA: Add support to query OPA path records
  IB/SA: Add OPA path record type
  IB/SA: Split struct sa_path_rec based on IB and ROCE specific fields
  IB/SA: Introduce path record specific types
  IB/SA: Rename ib_sa_path_rec to sa_path_rec
  IB/CM: Add braces when using sizeof
  IB/core: Define 'opa' rdma_ah_attr type
  IB/core: Define 'ib' and 'roce' rdma_ah_attr types
  IB/core: Use rdma_ah_attr accessor functions
  IB/core: Add accessor functions for rdma_ah_attr fields
  IB/PVRDMA: Rename ib_ah_attr related functions
  IB/mthca: Rename to_ib_ah_attr to to_rdma_ah_attr
  ...
2017-05-03 12:45:55 -07:00
Linus Torvalds 46f0537b1e Merge branch 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "Fourteen audit patches for v4.12 that span the full range of fixes,
  new features, and internal cleanups.

  We have a patches to move to 64-bit timestamps, convert refcounts from
  atomic_t to refcount_t, track PIDs using the pid struct instead of
  pid_t, convert our own private audit buffer cache to a standard
  kmem_cache, log kernel module names when they are unloaded, and
  normalize the NETFILTER_PKT to make the userspace folks happier.

  From a fixes perspective, the most important is likely the auditd
  connection tracking RCU fix; it was a rather brain dead bug that I'll
  take the blame for, but thankfully it didn't seem to affect many
  people (only one report).

  I think the patch subject lines and commit descriptions do a pretty
  good job of explaining the details and why the changes are important
  so I'll point you there instead of duplicating it here; as usual, if
  you have any questions you know where to find us.

  We also manage to take out more code than we put in this time, that
  always makes me happy :)"

* 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix the RCU locking for the auditd_connection structure
  audit: use kmem_cache to manage the audit_buffer cache
  audit: Use timespec64 to represent audit timestamps
  audit: store the auditd PID as a pid struct instead of pid_t
  audit: kernel generated netlink traffic should have a portid of 0
  audit: combine audit_receive() and audit_receive_skb()
  audit: convert audit_watch.count from atomic_t to refcount_t
  audit: convert audit_tree.count from atomic_t to refcount_t
  audit: normalize NETFILTER_PKT
  netfilter: use consistent ipv4 network offset in xt_AUDIT
  audit: log module name on delete_module
  audit: remove unnecessary semicolon in audit_watch_handle_event()
  audit: remove unnecessary semicolon in audit_mark_handle_event()
  audit: remove unnecessary semicolon in audit_field_valid()
2017-05-03 09:21:59 -07:00
David S. Miller 4d89ac2dd5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS/OVS fixes for net

The following patchset contains a rather large batch of Netfilter, IPVS
and OVS fixes for your net tree. This includes fixes for ctnetlink, the
userspace conntrack helper infrastructure, conntrack OVS support,
ebtables DNAT target, several leaks in error path among other. More
specifically, they are:

1) Fix reference count leak in the CT target error path, from Gao Feng.

2) Remove conntrack entry clashing with a matching expectation, patch
   from Jarno Rajahalme.

3) Fix bogus EEXIST when registering two different userspace helpers,
   from Liping Zhang.

4) Don't leak dummy elements in the new bitmap set type in nf_tables,
   from Liping Zhang.

5) Get rid of module autoload from conntrack update path in ctnetlink,
   we don't need autoload at this late stage and it is happening with
   rcu read lock held which is not good. From Liping Zhang.

6) Fix deadlock due to double-acquire of the expect_lock from conntrack
   update path, this fixes a bug that was introduced when the central
   spinlock got removed. Again from Liping Zhang.

7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock
   protection that was selected when the central spinlock was removed was
   not really protecting anything at all.

8) Protect sequence adjustment under ct->lock.

9) Missing socket match with IPv6, from Peter Tirsek.

10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from
    Linus Luessing.

11) Don't give up on evaluating the expression on new entries added via
    dynset expression in nf_tables, from Liping Zhang.

12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals
    with non-linear skbuffs.

13) Don't allow IPv6 service in IPVS if no IPv6 support is available,
    from Paolo Abeni.

14) Missing mutex release in error path of xt_find_table_lock(), from
    Dan Carpenter.

15) Update maintainers files, Netfilter section. Add Florian to the
    file, refer to nftables.org and change project status from Supported
    to Maintained.

16) Bail out on mismatching extensions in element updates in nf_tables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-03 10:11:26 -04:00
Geert Uytterhoeven ab71632c45 netfilter: conntrack: Force inlining of build check to prevent build failure
If gcc (e.g. 4.1.2) decides not to inline total_extension_size(), the
build will fail with:

    net/built-in.o: In function `nf_conntrack_init_start':
    (.text+0x9baf6): undefined reference to `__compiletime_assert_1893'

or

    ERROR: "__compiletime_assert_1893" [net/netfilter/nf_conntrack.ko] undefined!

Fix this by forcing inlining of total_extension_size().

Fixes: b3a5db109e ("netfilter: conntrack: use u8 for extension sizes again")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-03 09:51:26 -04:00
David Ahern 6d717134a1 net: ipv6: Do not duplicate DAD on link up
Andrey reported a warning triggered by the rcu code:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 5911 at lib/debugobjects.c:289
debug_print_object+0x175/0x210
ODEBUG: activate active (active state 1) object type: rcu_head hint:
        (null)
Modules linked in:
CPU: 1 PID: 5911 Comm: a.out Not tainted 4.11.0-rc8+ #271
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x192/0x22d lib/dump_stack.c:52
 __warn+0x19f/0x1e0 kernel/panic.c:549
 warn_slowpath_fmt+0xe0/0x120 kernel/panic.c:564
 debug_print_object+0x175/0x210 lib/debugobjects.c:286
 debug_object_activate+0x574/0x7e0 lib/debugobjects.c:442
 debug_rcu_head_queue kernel/rcu/rcu.h:75
 __call_rcu.constprop.76+0xff/0x9c0 kernel/rcu/tree.c:3229
 call_rcu_sched+0x12/0x20 kernel/rcu/tree.c:3288
 rt6_rcu_free net/ipv6/ip6_fib.c:158
 rt6_release+0x1ea/0x290 net/ipv6/ip6_fib.c:188
 fib6_del_route net/ipv6/ip6_fib.c:1461
 fib6_del+0xa42/0xdc0 net/ipv6/ip6_fib.c:1500
 __ip6_del_rt+0x100/0x160 net/ipv6/route.c:2174
 ip6_del_rt+0x140/0x1b0 net/ipv6/route.c:2187
 __ipv6_ifa_notify+0x269/0x780 net/ipv6/addrconf.c:5520
 addrconf_ifdown+0xe60/0x1a20 net/ipv6/addrconf.c:3672
...

Andrey's reproducer program runs in a very tight loop, calling
'unshare -n' and then spawning 2 sets of 14 threads running random ioctl
calls. The relevant networking sequence:

1. New network namespace created via unshare -n
- ip6tnl0 device is created in down state

2. address added to ip6tnl0
- equivalent to ip -6 addr add dev ip6tnl0 fd00::bb/1
- DAD is started on the address and when it completes the host
  route is inserted into the FIB

3. ip6tnl0 is brought up
- the new fixup_permanent_addr function restarts DAD on the address

4. exit namespace
- teardown / cleanup sequence starts
- once in a blue moon, lo teardown appears to happen BEFORE teardown
  of ip6tunl0
  + down on 'lo' removes the host route from the FIB since the dst->dev
    for the route is loobback
  + host route added to rcu callback list
    * rcu callback has not run yet, so rt is NOT on the gc list so it has
      NOT been marked obsolete

5. in parallel to 4. worker_thread runs addrconf_dad_completed
- DAD on the address on ip6tnl0 completes
- calls ipv6_ifa_notify which inserts the host route

All of that happens very quickly. The result is that a host route that
has been deleted from the IPv6 FIB and added to the RCU list is re-inserted
into the FIB.

The exit namespace eventually gets to cleaning up ip6tnl0 which removes the
host route from the FIB again, calls the rcu function for cleanup -- and
triggers the double rcu trace.

The root cause is duplicate DAD on the address -- steps 2 and 3. Arguably,
DAD should not be started in step 2. The interface is in the down state,
so it can not really send out requests for the address which makes starting
DAD pointless.

Since the second DAD was introduced by a recent change, seems appropriate
to use it for the Fixes tag and have the fixup function only start DAD for
addresses in the PREDAD state which occurs in addrconf_ifdown if the
address is retained.

Big thanks to Andrey for isolating a reliable reproducer for this problem.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-03 09:45:56 -04:00
Pablo Neira Ayuso 9744a6fcef netfilter: nf_tables: check if same extensions are set when adding elements
If no NLM_F_EXCL is set and the element already exists in the set, make
sure that both elements have the same extensions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-03 10:58:00 +02:00
Linus Torvalds 8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Linus Torvalds 5a0387a8a8 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "Here is the crypto update for 4.12:

  API:
   - Add batch registration for acomp/scomp
   - Change acomp testing to non-unique compressed result
   - Extend algorithm name limit to 128 bytes
   - Require setkey before accept(2) in algif_aead

  Algorithms:
   - Add support for deflate rfc1950 (zlib)

  Drivers:
   - Add accelerated crct10dif for powerpc
   - Add crc32 in stm32
   - Add sha384/sha512 in ccp
   - Add 3des/gcm(aes) for v5 devices in ccp
   - Add Queue Interface (QI) backend support in caam
   - Add new Exynos RNG driver
   - Add ThunderX ZIP driver
   - Add driver for hardware random generator on MT7623 SoC"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (101 commits)
  crypto: stm32 - Fix OF module alias information
  crypto: algif_aead - Require setkey before accept(2)
  crypto: scomp - add support for deflate rfc1950 (zlib)
  crypto: scomp - allow registration of multiple scomps
  crypto: ccp - Change ISR handler method for a v5 CCP
  crypto: ccp - Change ISR handler method for a v3 CCP
  crypto: crypto4xx - rename ce_ring_contol to ce_ring_control
  crypto: testmgr - Allow ecb(cipher_null) in FIPS mode
  Revert "crypto: arm64/sha - Add constant operand modifier to ASM_EXPORT"
  crypto: ccp - Disable interrupts early on unload
  crypto: ccp - Use only the relevant interrupt bits
  hwrng: mtk - Add driver for hardware random generator on MT7623 SoC
  dt-bindings: hwrng: Add Mediatek hardware random generator bindings
  crypto: crct10dif-vpmsum - Fix missing preempt_disable()
  crypto: testmgr - replace compression known answer test
  crypto: acomp - allow registration of multiple acomps
  hwrng: n2 - Use devm_kcalloc() in n2rng_probe()
  crypto: chcr - Fix error handling related to 'chcr_alloc_shash'
  padata: get_next is never NULL
  crypto: exynos - Add new Exynos RNG driver
  ...
2017-05-02 15:53:46 -07:00
Michael S. Tsirkin 9b2bbdb227 virtio: wrap find_vqs
We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-05-02 23:41:42 +03:00
Jon Paul Maloy ec8a09fbbe tipc: refactor function tipc_sk_recv_stream()
We try to make this function more readable by improving variable names
and comments, using more stack variables, and doing some smaller changes
to the logics. We also rename the function to make it consistent with
naming conventions used elsewhere in the code.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:56:54 -04:00
Jon Paul Maloy e9f8b10101 tipc: refactor function tipc_sk_recvmsg()
We try to make this function more readable by improving variable names
and comments, plus some minor changes to the logics.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:56:54 -04:00
Dan Carpenter ee0d8d8482 ipx: call ipxitf_put() in ioctl error path
We should call ipxitf_put() if the copy_to_user() fails.

Reported-by: 李强 <liqiang6-s@360.cn>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:34:53 -04:00
Jiri Pirko 9da3242e6a net: sched: add helpers to handle extended actions
Jump is now the only one using value action opcode. This is going to
change soon. So introduce helpers to work with this. Convert TC_ACT_JUMP.

This also fixes the TC_ACT_JUMP check, which is incorrectly done as a
bit check, not a value check.

Fixes: e0ee84ded7 ("net sched actions: Complete the JUMPX opcode")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:33:54 -04:00
Eric Dumazet a9f11f963a tcp: fix wraparound issue in tcp_lp
Be careful when comparing tcp_time_stamp to some u32 quantity,
otherwise result can be surprising.

Fixes: 7c106d7e78 ("[TCP]: TCP Low Priority congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:07:02 -04:00
Linus Torvalds da7b66ffb2 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "These actually missed the last cycle; the branch itself is from last
  December"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make nr_pages calculation in default_file_splice_read() a bit less ugly
  splice/tee/vmsplice: validate flags
  splice_pipe_desc: kill ->flags
  remove spd_release_page()
2017-05-02 11:38:06 -07:00
Linus Torvalds 5b13475a5e Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "Cleanups that sat in -next + -stable fodder that has just missed 4.11.

  There's more iov_iter work in my local tree, but I'd prefer to push
  the stuff that had been in -next first"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  iov_iter: don't revert iov buffer if csum error
  generic_file_read_iter(): make use of iov_iter_revert()
  generic_file_direct_write(): make use of iov_iter_revert()
  orangefs: use iov_iter_revert()
  sctp: switch to copy_from_iter_full()
  net/9p: switch to copy_from_iter_full()
  switch memcpy_from_msg() to copy_from_iter_full()
  rds: make use of iov_iter_revert()
2017-05-02 11:18:50 -07:00
David Miller 586f852597 bpf: Align packet data properly in program testing framework.
Make sure we apply NET_IP_ALIGN when reserving headroom for SKB
and XDP test runs, just like a real driver would.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-02 11:46:28 -04:00
David Miller 78e5227237 bpf: Do not dereference user pointer in bpf_test_finish().
Instead, pass the kattr in which has a kernel side copy of this
data structure from userspace already.

Fix based upon a suggestion from Alexei Starovoitov.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2017-05-02 11:46:28 -04:00
Richard Guy Briggs 2173c519d5 audit: normalize NETFILTER_PKT
Eliminate flipping in and out of message fields, dropping fields in the
process.

Sample raw message format IPv4 UDP:
type=NETFILTER_PKT msg=audit(1487874761.386:228):  mark=0xae8a2732 saddr=127.0.0.1 daddr=127.0.0.1 proto=17^]
Sample raw message format IPv6 ICMP6:
type=NETFILTER_PKT msg=audit(1487874761.381:227):  mark=0x223894b7 saddr=::1 daddr=::1 proto=58^]

Issue: https://github.com/linux-audit/audit-kernel/issues/11
Test case: https://github.com/linux-audit/audit-testsuite/issues/43

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-02 10:16:04 -04:00
Richard Guy Briggs 0cb88b6ff0 netfilter: use consistent ipv4 network offset in xt_AUDIT
Even though the skb->data pointer has been moved from the link layer
header to the network layer header, use the same method to calculate the
offset in ipv4 and ipv6 routines.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: munged subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-02 10:16:04 -04:00
Jakub Kicinski b5d60989c6 xdp: fix parameter kdoc for extack
Fix kdoc parameter spelling from extact to extack.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 09:29:35 -04:00
Arnd Bergmann 4a80601622 xen/9pfs: select CONFIG_XEN_XENBUS_FRONTEND
All Xen frontends need to select this symbol to avoid a link error:

net/built-in.o: In function `p9_trans_xen_init':
:(.text+0x149e9c): undefined reference to `__xenbus_register_frontend'

Fixes: d4b40a02f837 ("xen/9pfs: build 9pfs Xen transport driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2017-05-02 11:14:36 +02:00
Stefano Stabellini 31d47266c6 xen/9pfs: initialize len to 0 to detect xenbus_read errors
In order to use "len" to check for xenbus_read errors properly, we need
to initialize len to 0 before passing it to xenbus_read.

CC: dan.carpenter@oracle.com
CC: jgross@suse.com
CC: boris.ostrovsky@oracle.com
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2017-05-02 11:13:23 +02:00
Stefano Stabellini 7f25483a88 xen/9pfs: build 9pfs Xen transport driver
This patch adds a Kconfig option and Makefile support for building the
9pfs Xen driver.

CC: groug@kaod.org
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net

Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2017-05-02 11:11:49 +02:00
Stefano Stabellini f66c72bea1 xen/9pfs: receive responses
Upon receiving a notification from the backend, schedule the
p9_xen_response work_struct. p9_xen_response checks if any responses are
available, if so, it reads them one by one, calling p9_client_cb to send
them up to the 9p layer (p9_client_cb completes the request). Handle the
ring following the Xen 9pfs specification.

CC: groug@kaod.org
CC: jgross@suse.com
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net

Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2017-05-02 11:11:43 +02:00
Stefano Stabellini f023f18ddf xen/9pfs: send requests to the backend
Implement struct p9_trans_module create and close functions by looking
at the available Xen 9pfs frontend-backend connections. We don't expect
many frontend-backend connections, thus walking a list is OK.

Send requests to the backend by copying each request to one of the
available rings (each frontend-backend connection comes with multiple
rings). Handle the ring and notifications following the 9pfs
specification. If there are not enough free bytes on the ring for the
request, wait on the wait_queue: the backend will send a notification
after consuming more requests.

CC: groug@kaod.org
CC: jgross@suse.com
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net

Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2017-05-02 11:11:37 +02:00
Stefano Stabellini 71ebd71921 xen/9pfs: connect to the backend
Implement functions to handle the xenbus handshake. Upon connection,
allocate the rings according to the protocol specification.

Initialize a work_struct and a wait_queue. The work_struct will be used
to schedule work upon receiving an event channel notification from the
backend. The wait_queue will be used to wait when the ring is full and
we need to send a new request.

CC: groug@kaod.org
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net

Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2017-05-02 11:11:28 +02:00
Stefano Stabellini 868eb12273 xen/9pfs: introduce Xen 9pfs transport driver
Introduce the Xen 9pfs transport driver: add struct xenbus_driver to
register as a xenbus driver and add struct p9_trans_module to register
as v9fs driver.

All functions are empty stubs for now.

CC: groug@kaod.org
CC: jgross@suse.com
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net

Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2017-05-02 11:11:17 +02:00
Linus Torvalds 3527d3e951 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - another round of rq-clock handling debugging, robustization and
     fixes

   - PELT accounting improvements

   - CPU hotplug related ->cpus_allowed affinity handling fixes all
     around the tree

   - ... plus misc fixes, cleanups and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  sched/x86: Update reschedule warning text
  crypto: N2 - Replace racy task affinity logic
  cpufreq/sparc-us2e: Replace racy task affinity logic
  cpufreq/sparc-us3: Replace racy task affinity logic
  cpufreq/sh: Replace racy task affinity logic
  cpufreq/ia64: Replace racy task affinity logic
  ACPI/processor: Replace racy task affinity logic
  ACPI/processor: Fix error handling in __acpi_processor_start()
  sparc/sysfs: Replace racy task affinity logic
  powerpc/smp: Replace open coded task affinity logic
  ia64/sn/hwperf: Replace racy task affinity logic
  ia64/salinfo: Replace racy task affinity logic
  workqueue: Provide work_on_cpu_safe()
  ia64/topology: Remove cpus_allowed manipulation
  sched/fair: Move the PELT constants into a generated header
  sched/fair: Increase PELT accuracy for small tasks
  sched/fair: Fix comments
  sched/Documentation: Add 'sched-pelt' tool
  sched/fair: Fix corner case in __accumulate_sum()
  sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
  ...
2017-05-01 19:12:53 -07:00
Linus Torvalds 5db6db0d40 Merge branch 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess unification updates from Al Viro:
 "This is the uaccess unification pile. It's _not_ the end of uaccess
  work, but the next batch of that will go into the next cycle. This one
  mostly takes copy_from_user() and friends out of arch/* and gets the
  zero-padding behaviour in sync for all architectures.

  Dealing with the nocache/writethrough mess is for the next cycle;
  fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
  sold on access_ok() in there, BTW; just not in this pile), same for
  reducing __copy_... callsites, strn*... stuff, etc. - there will be a
  pile about as large as this one in the next merge window.

  This one sat in -next for weeks. -3KLoC"

* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
  HAVE_ARCH_HARDENED_USERCOPY is unconditional now
  CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
  m32r: switch to RAW_COPY_USER
  hexagon: switch to RAW_COPY_USER
  microblaze: switch to RAW_COPY_USER
  get rid of padding, switch to RAW_COPY_USER
  ia64: get rid of copy_in_user()
  ia64: sanitize __access_ok()
  ia64: get rid of 'segment' argument of __do_{get,put}_user()
  ia64: get rid of 'segment' argument of __{get,put}_user_check()
  ia64: add extable.h
  powerpc: get rid of zeroing, switch to RAW_COPY_USER
  esas2r: don't open-code memdup_user()
  alpha: fix stack smashing in old_adjtimex(2)
  don't open-code kernel_setsockopt()
  mips: switch to RAW_COPY_USER
  mips: get rid of tail-zeroing in primitives
  mips: make copy_from_user() zero tail explicitly
  mips: clean and reorder the forest of macros...
  mips: consolidate __invoke_... wrappers
  ...
2017-05-01 14:41:04 -07:00
David S. Miller 5b8481fa42 ipv6: Need to export ipv6_push_frag_opts for tunneling now.
Since that change also made the nfrag function not necessary
for exports, remove it.

Fixes: 89a23c8b52 ("ip6_tunnel: Fix missing tunnel encapsulation limit option")
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 15:10:20 -04:00
Ilan Tayari 152afb9b45 xfrm: Indicate xfrm_state offload errors
Current code silently ignores driver errors when configuring
IPSec offload xfrm_state, and falls back to host-based crypto.

Fail the xfrm_state creation if the driver has an error, because
the NIC offloading was explicitly requested by the user program.

This will communicate back to the user that there was an error.

Fixes: d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 14:59:39 -04:00
Ilan Tayari 67d349ed60 net/esp4: Fix invalid esph pointer crash
Both esp_output and esp_xmit take a pointer to the ESP header
and place it in esp_info struct prior to calling esp_output_head.

Inside esp_output_head, the call to esp_output_udp_encap
makes sure to update the pointer if it gets invalid.
However, if esp_output_head itself calls skb_cow_data, the
pointer is not updated and stays invalid, causing a crash
after esp_output_head returns.

Update the pointer if it becomes invalid in esp_output_head

Fixes: fca11ebde3 ("esp4: Reorganize esp_output")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 14:58:50 -04:00
Craig Gallek 89a23c8b52 ip6_tunnel: Fix missing tunnel encapsulation limit option
The IPv6 tunneling code tries to insert IPV6_TLV_TNL_ENCAP_LIMIT and
IPV6_TLV_PADN options when an encapsulation limit is defined (the
default is a limit of 4).  An MTU adjustment is done to account for
these options as well.  However, the options are never present in the
generated packets.

The issue appears to be a subtlety between IPV6_DSTOPTS and
IPV6_RTHDRDSTOPTS defined in RFC 3542.  When the IPIP tunnel driver was
written, the encap limit options were included as IPV6_RTHDRDSTOPTS in
dst0opt of struct ipv6_txoptions.  Later, ipv6_push_nfrags_opts was
(correctly) updated to require IPV6_RTHDR options when IPV6_RTHDRDSTOPTS
are to be used.  This caused the options to no longer be included in v6
encapsulated packets.

The fix is to use IPV6_DSTOPTS (in dst1opt of struct ipv6_txoptions)
instead.  IPV6_DSTOPTS do not have the additional IPV6_RTHDR requirement.

Fixes: 1df64a8569c7: ("[IPV6]: Add ip6ip6 tunnel driver.")
Fixes: 333fad5364d6: ("[IPV6]: Support several new sockopt / ancillary data in Advanced API (RFC3542)")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 14:52:45 -04:00
Ding Tianhong a6a5993243 iov_iter: don't revert iov buffer if csum error
The patch 3278682123 (make skb_copy_datagram_msg() et.al. preserve
->msg_iter on error) will revert the iov buffer if copy to iter
failed, but it didn't copy any datagram if the skb_checksum_complete
error, so no need to revert any data at this place.

v2: Sabrina notice that return -EFAULT when checksum error is not correct
    here, it would confuse the caller about the return value, so fix it.

Fixes: 3278682123 ("make skb_copy_datagram_msg() et.al. preserve->msg_iter on error")
Cc: stable@vger.kernel.org # v4.11
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-01 14:49:53 -04:00
David S. Miller f9ed236c2a Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2017-04-30

Here's one last batch of Bluetooth patches in the bluetooth-next tree
targeting the 4.12 kernel.

 - Remove custom ECDH implementation and use new KPP API instead
 - Add protocol checks to hci_ldisc
 - Add module license to HCI UART Nokia H4+ driver
 - Minor fix for 32bit user space - 64 bit kernel combination

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 14:34:46 -04:00
Dasaratharaman Chandramouli 44c58487d5 IB/core: Define 'ib' and 'roce' rdma_ah_attr types
rdma_ah_attr can now be either ib or roce allowing
core components to use one type or the other and also
to define attributes unique to a specific type. struct
ib_ah is also initialized with the type when its first
created. This ensures that calls such as modify_ah
dont modify the type of the address handle attribute.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-05-01 14:32:43 -04:00
Dasaratharaman Chandramouli d8966fcd4c IB/core: Use rdma_ah_attr accessor functions
Modify core and driver components to use accessor functions
introduced to access individual fields of rdma_ah_attr

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-05-01 14:32:43 -04:00
Linus Torvalds 694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Benjamin LaHaise 1a7fca63cd flower: check unused bits in MPLS fields
Since several of the the netlink attributes used to configure the flower
classifier's MPLS TC, BOS and Label fields have additional bits which are
unused, check those bits to ensure that they are actually 0 as suggested
by Jamal.

Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Cc: David Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jakub Kicinski <kubakici@wp.pl>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 11:12:21 -04:00
David S. Miller a01aa920b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. A large bunch of code cleanups, simplify the conntrack extension
codebase, get rid of the fake conntrack object, speed up netns by
selective synchronize_net() calls. More specifically, they are:

1) Check for ct->status bit instead of using nfct_nat() from IPVS and
   Netfilter codebase, patch from Florian Westphal.

2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

4) Introduce nft_is_base_chain() helper function.

5) Enforce expectation limit from userspace conntrack helper,
   from Gao Feng.

6) Add nf_ct_remove_expect() helper function, from Gao Feng.

7) NAT mangle helper function return boolean, from Gao Feng.

8) ctnetlink_alloc_expect() should only work for conntrack with
   helpers, from Gao Feng.

9) Add nfnl_msg_type() helper function to nfnetlink to build the
   netlink message type.

10) Get rid of unnecessary cast on void, from simran singhal.

11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
    also from simran singhal.

12) Use list_prev_entry() from nf_tables, from simran signhal.

13) Remove unnecessary & on pointer function in the Netfilter and IPVS
    code.

14) Remove obsolete comment on set of rules per CPU in ip6_tables,
    no longer true. From Arushi Singhal.

15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

16) Remove unnecessary nested rcu_read_lock() in
    __nf_nat_decode_session(). Code running from hooks are already
    guaranteed to run under RCU read side.

17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
    also from Aaron.

19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

20) Don't propagate NF_DROP error to userspace via ctnetlink in
    __nf_nat_alloc_null_binding() function, from Gao Feng.

21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
    from Gao Feng.

22) Kill the fake untracked conntrack objects, use ctinfo instead to
    annotate a conntrack object is untracked, from Florian Westphal.

23) Remove nf_ct_is_untracked(), now obsolete since we have no
    conntrack template anymore, from Florian.

24) Add event mask support to nft_ct, also from Florian.

25) Move nf_conn_help structure to
    include/net/netfilter/nf_conntrack_helper.h.

26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
    Thus, we don't deal with variable conntrack extensions anymore.
    Make sure userspace conntrack helper doesn't go over that size.
    Remove variable size ct extension infrastructure now this code
    got no more clients. From Florian Westphal.

27) Restore offset and length of nf_ct_ext structure to 8 bytes now
    that wraparound is not possible any longer, also from Florian.

28) Allow to get rid of unassured flows under stress in conntrack,
    this applies to DCCP, SCTP and TCP protocols, from Florian.

29) Shrink size of nf_conntrack_ecache structure, from Florian.

30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
    from Gao Feng.

31) Register SYNPROXY hooks on demand, from Florian Westphal.

32) Use pernet hook whenever possible, instead of global hook
    registration, from Florian Westphal.

33) Pass hook structure to ebt_register_table() to consolidate some
    infrastructure code, from Florian Westphal.

34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
    SYNPROXY code, to make sure device stats are not fooled, patch
    from Gao Feng.

35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
    don't need anymore if we just select a fixed size instead of
    expensive runtime time calculation of this. From Florian.

36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
    from Florian.

37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
    Florian.

38) Attach NAT extension on-demand from masquerade and pptp helper
    path, from Florian.

39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

40) Speed up netns by selective calls of synchronize_net(), from
    Florian Westphal.

41) Silence stack size warning gcc in 32-bit arch in snmp helper,
    from Florian.

42) Inconditionally call nf_ct_ext_destroy(), even if we have no
    extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
    Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:47:53 -04:00
Jakub Kicinski ddf9f97076 xdp: propagate extended ack to XDP setup
Drivers usually have a number of restrictions for running XDP
- most common being buffer sizes, LRO and number of rings.
Even though some drivers try to be helpful and print error
messages experience shows that users don't often consult
kernel logs on netlink errors.  Try to use the new extended
ack mechanism to carry the message back to user space.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:35:47 -04:00
Liping Zhang 8eeef23504 netfilter: nf_ct_ext: invoke destroy even when ext is not attached
For NF_NAT_MANIP_SRC, we will insert the ct to the nat_bysource_table,
then remove it from the nat_bysource_table via nat_extend->destroy.

But now, the nat extension is attached on demand, so if the nat extension
is not attached, we will not be notified when the ct is destroyed, i.e.
we may fail to remove ct from the nat_bysource_table.

So just keep it simple, even if the extension is not attached, we will
still invoke the related ext->destroy. And this will also preserve the
flexibility for the future extension.

Fixes: 9a08ecfe74 ("netfilter: don't attach a nat extension by default")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:48:49 +02:00
Pablo Neira Ayuso d1908ca8dc Merge tag 'ipvs3-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Third Round of IPVS Updates for v4.12

please consider these enhancements to IPVS for v4.12.
If it is too late for v4.12 then please consider them for v4.13.

* Remove unused function
* Correct comparison of unsigned value
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:46:50 +02:00
Florian Westphal 0e72f55f35 netfilter: snmp: avoid stack size warning
net/ipv4/netfilter/nf_nat_snmp_basic.c:1158:1: warning: the frame size
of 1160 bytes is larger than 1024 bytes

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:43:58 +02:00
Florian Westphal 039b40ee58 netfilter: nf_queue: only call synchronize_net twice if nf_queue is active
nf_unregister_net_hook(s) can avoid a second call to synchronize_net,
provided there is no nfqueue active in that net namespace (which is
the common case).

This also gets rid of the extra arg to nf_queue_nf_hook_drop(), normally
this gets called during netns cleanup so no packets should be queued.

For the rare case of base chain being unregistered or module removal
while nfqueue is in use the extra hiccup due to the packet drops isn't
a big deal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:19:12 +02:00
Florian Westphal c83fa19603 netfilter: nf_log: don't call synchronize_rcu in nf_log_unset
nf_log_unregister() (which is what gets called in the logger backends
module exit paths) does a (required, module is removed) synchronize_rcu().

But nf_log_unset() is only called from pernet exit handlers. It doesn't
free any memory so there appears to be no need to call synchronize_rcu.

v2: Liping Zhang points out that nf_log_unregister() needs to be called
after pernet unregister, else rmmod would become unsafe.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:19:07 +02:00
Florian Westphal 933bd83ed6 netfilter: batch synchronize_net calls during hook unregister
synchronize_net is expensive and slows down netns cleanup a lot.

We have two APIs to unregister a hook:
nf_unregister_net_hook (which calls synchronize_net())
and
nf_unregister_net_hooks (calls nf_unregister_net_hook in a loop)

Make nf_unregister_net_hook a wapper around new helper
__nf_unregister_net_hook, which unlinks the hook but does not free it.

Then, we can call that helper in nf_unregister_net_hooks and then
call synchronize_net() only once.

Andrey Konovalov reports this change improves syzkaller fuzzing speed at
least twice.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:18:54 +02:00
Arkadi Sharshevsky 58073b32b0 net: bridge: Fix improper taking over HW learned FDB
Commit 7e26bf45e4 ("net: bridge: allow SW learn to take over HW fdb
entries") added the ability to "take over an entry which was previously
learned via HW when it shows up from a SW port".

However, if an entry was learned via HW and then a control packet
(e.g., ARP request) was trapped to the CPU, the bridge driver will
update the entry and remove the externally learned flag, although the
entry is still present in HW. Instead, only clear the externally learned
flag in case of roaming.

Fixes: 7e26bf45e4 ("net: bridge: allow SW learn to take over HW fdb entries")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Arkadi Sharashevsky <arkadis@mellanox.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:46:32 -04:00
WANG Cong ba3f571d5d ipv4: get rid of ip_ra_lock
After commit 1215e51eda ("ipv4: fix a deadlock in ip_ra_control")
we always take RTNL lock for ip_ra_control() which is the only place
we update the list ip_ra_chain, so the ip_ra_lock is no longer needed.

As Eric points out, BH does not need to disable either, RCU readers
don't care.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:44:04 -04:00
Dan Carpenter 39f3709599 lwtunnel: fix error path in lwtunnel_fill_encap()
We recently added a check to see if nla_nest_start() fails.  There are
two issues with that.  First, if it fails then I don't think we should
call nla_nest_cancel().  Second, it's slightly convoluted but the
current code returns success but we should return -EMSGSIZE instead.

Fixes: a50fe0ffd7 ("lwtunnel: check return value of nla_nest_start")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:41:29 -04:00
David Howells b5082df801 net: Initialise init_net.count to 1
Initialise init_net.count to 1 for its pointer from init_nsproxy lest
someone tries to do a get_net() and a put_net() in a process in which
current->ns_proxy->net_ns points to the initial network namespace.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:32:16 -04:00
David S. Miller 4750c7be5b linux-can-next-for-4.12-20170427
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAlkBofATHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAe/sog7Dgc9qY7CACnNQLrwm8fIV72DD9+pZ1QsNwvLHCr
 8ZiHSd7m/VI/FESje4Uc3D9vKi7K6MGX9/lWi70QY+2IPrTfRTrYEO5I4U3+1czc
 zQ/NnQGsKBoAJdPLoXp0LkQVAXRSXUoKiaiU0VdktaAzGewixz9YxGHAcCqcu6oM
 1+MftSR0iCP4mkcOyVMcnipcI9L8TzEOz2f3XB/RtMiRGcEsUekAHfdpH6NSHbAq
 mG5pOwENqNn4i1ZwGg16u3p+U4YfWAF9Y+/wuev6Ob4Kla2tYnuZNEAqf9ZubY93
 NBsT3pGjcbzbbihlKvu64NUlGiloQ8AGAYvttinrfx1tnsDgvP2m6nQP
 =mHEY
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-4.12-20170427' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2017-04-25

this is a pull request of 1 patch for net-next/master.

This patch by Oliver Hartkopp fixes the build of the broad cast manager
with CONFIG_PROC_FS disabled.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:25:59 -04:00
Davide Caratti d68be71ea1 tcp: fix access to sk->sk_state in tcp_poll()
avoid direct access to sk->sk_state when tcp_poll() is called on a socket
using active TCP fastopen with deferred connect. Use local variable
'state', which stores the result of sk_state_load(), like it was done in
commit 00fd38d938 ("tcp: ensure proper barriers in lockless contexts").

Fixes: 19f6d3f3c8 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:24:16 -04:00
Eric Dumazet d1f496fd8f bpf: restore skb->sk before pskb_trim() call
While testing a fix [1] in ___pskb_trim(), addressing the WARN_ON_ONCE()
in skb_try_coalesce() reported by Andrey, I found that we had an skb
with skb->sk set but no skb->destructor.

This invalidated heuristic found in commit 158f323b98 ("net: adjust
skb->truesize in pskb_expand_head()") and in cited patch.

Considering the BUG_ON(skb->sk) we have in skb_orphan(), we should
restrain the temporary setting to a minimal section.

[1] https://patchwork.ozlabs.org/patch/755570/
    net: adjust skb->truesize in ___pskb_trim()

Fixes: 8f917bba00 ("bpf: pass sk to helper functions")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:23:16 -04:00
Marcel Holtmann 71653eb64b Bluetooth: Add selftest for ECDH key generation
Since the ECDH key generation takes a different path, it needs to be
tested as well. For this generate the public debug key from the private
debug key and compare both.

This also moves the seeding of the private key into the SMP calling code
to allow for easier re-use of the ECDH key generation helper.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2017-04-30 16:52:43 +03:00
Marcel Holtmann f958315358 Bluetooth: zero kpp input for key generation
When generating new ECDH keys with kpp, the shared secret input needs to
be set to NULL. Fix this by including kpp_request_set_input call.

Fixes: 58771c1c ("Bluetooth: convert smp and selftest to crypto kpp
API")
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2017-04-30 16:52:39 +03:00
Szymon Janc ab89f0bdd6 Bluetooth: Fix user channel for 32bit userspace on 64bit kernel
Running 32bit userspace on 64bit kernel results in MSG_CMSG_COMPAT being
defined as 0x80000000. This results in sendmsg failure if used from 32bit
userspace running on 64bit kernel. Fix this by accounting for MSG_CMSG_COMPAT
in flags check in hci_sock_sendmsg.

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marko Kiiskila <marko@runtime.io>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
2017-04-30 12:22:14 +02:00
Salvatore Benedetto 763d9a302a Bluetooth: allocate data for kpp on heap
Bluetooth would crash when computing ECDH keys with kpp
if VMAP_STACK is enabled. Fix by allocating data passed
to kpp on heap.

Fixes: 58771c1c ("Bluetooth: convert smp and selftest to crypto kpp
API")
Signed-off-by: Salvatore Benedetto <salvatore.benedetto@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-30 12:22:05 +02:00
Linus Torvalds 0e9117882d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Just a couple more stragglers, I really hope this is it.

  1) Don't let frags slip down into the GRO segmentation handlers, from
     Steffen Klassert.

  2) Truesize under-estimation triggers warnings in TCP over loopback
     with socket filters, 2 part fix from Eric Dumazet.

  3) Fix undesirable reset of bonding MTU to ETH_HLEN on slave removal,
     from Paolo Abeni.

  4) If we flush the XFRM policy after garbage collection, it doesn't
     work because stray entries can be created afterwards. Fix from Xin
     Long.

  5) Hung socket connection fixes in TIPC from Parthasarathy Bhuvaragan.

  6) Fix GRO regression with IPSEC when netfilter is disabled, from
     Sabrina Dubroca.

  7) Fix cpsw driver Kconfig dependency regression, from Arnd Bergmann"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: hso: register netdev later to avoid a race condition
  net: adjust skb->truesize in ___pskb_trim()
  tcp: do not underestimate skb->truesize in tcp_trim_head()
  bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal
  ipv4: Don't pass IP fragments to upper layer GRO handlers.
  cpsw/netcp: refine cpts dependency
  tipc: close the connection if protocol messages contain errors
  tipc: improve error validations for sockets in CONNECTING state
  tipc: Fix missing connection request handling
  xfrm: fix GRO for !CONFIG_NETFILTER
  xfrm: do the garbage collection after flushing policy
2017-04-28 14:13:16 -07:00
Eric Dumazet c21b48cc1b net: adjust skb->truesize in ___pskb_trim()
Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in
skb_try_coalesce() using syzkaller and a filter attached to a TCP
socket.

As we did recently in commit 158f323b98 ("net: adjust skb->truesize in
pskb_expand_head()") we can adjust skb->truesize from ___pskb_trim(),
via a call to skb_condense().

If all frags were freed, then skb->truesize can be recomputed.

This call can be done if skb is not yet owned, or destructor is
sock_edemux().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 16:06:47 -04:00
Eric Dumazet 7162fb242c tcp: do not underestimate skb->truesize in tcp_trim_head()
Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in
skb_try_coalesce() using syzkaller and a filter attached to a TCP
socket over loopback interface.

I believe one issue with looped skbs is that tcp_trim_head() can end up
producing skb with under estimated truesize.

It hardly matters for normal conditions, since packets sent over
loopback are never truncated.

Bytes trimmed from skb->head should not change skb truesize, since
skb->head is not reallocated.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 16:05:22 -04:00
Steffen Klassert 9b83e03198 ipv4: Don't pass IP fragments to upper layer GRO handlers.
Upper layer GRO handlers can not handle IP fragments, so
exit GRO processing in this case.

This fixes ESP GRO because the packet must be reassembled
before we can decapsulate, otherwise we get authentication
failures.

It also aligns IPv4 to IPv6 where packets with fragmentation
headers are not passed to upper layer GRO handlers.

Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 16:00:38 -04:00
David S. Miller cd5487fb94 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-04-28

Just one patch to fix a misplaced spin_unlock_bh in an error path.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 15:43:24 -04:00
David S. Miller 5577e67956 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2017-04-28

1) Do garbage collecting after a policy flush to remove old
   bundles immediately. From Xin Long.

2) Fix GRO if netfilter is not defined.
   From Sabrina Dubroca.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 15:42:11 -04:00
David S. Miller cec3819198 Another set of patches for -next:
* API support for concurrent scheduled scan requests
  * API changes for roaming reporting
  * BSS max idle support in mac80211
  * API changes for TX status reporting in mac80211
  * API changes for RX rate reporting in mac80211
  * rewrite monitor logic to prepare for BPF filters
  * bugfix for rare devices without 2.4 GHz support
  * a bugfix for recent DFS changes
  * some further cleanups
 
 The API changes are actually at a nice time, since it's
 typically quiet just before the merge window, and trees
 can be synchronized easily during it.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlkDggoACgkQa3t4Rpy0
 AB2kpQ/+PUTNPKklYvV2TVRyT71oGoE4WNzPe4v5cgpbBkk48qCOD1ncAHCo6q2s
 L90gh9yOXnjht3pobvjd0PlYHy5faq4VyDohBdyId3YlR78FYyk41KSMkp4BAKn0
 aeTI3QeA0FOsXECiagqG6pwYpMJM1nQFhFbL1AvVIf1MxBGYNgh+iVLzk2tyTGXB
 OYdJBHTjgKW3nuIRYgtUoLQaWNUlUhK+5wpypb3wkgn41DEq4sL4ay5VgNKSd0AE
 5AkCrhLbiZlR4xrxivqOuS3nNPPIDOq5imuRvMMbQDChZ/4p60l0f1VIo6MR6UAQ
 N5Cn2ReD0bV+GFVaWmpDnmdQJIoLLYHWdlX362XdVbQCFOWOfsaD9zM2j5wXMeNV
 YBCMYa7Lt52ewjz4BABsAtH4/ZFReKCDmkFOMyakA2LnWzUxxqjWourcAM7XV2Kc
 RZIcM36Xx6yhsSReOzzd/HV9CUjqF8xuKrMKd/vWxBwWiaWdywtWqhEksxi0aOvq
 LUnCqgvcIxCh9dd8ygcfNdGbpZVqPVxmPdixRQbNL50M7gXtqUwZgnHHUdExgs8E
 8sP+ua8H9RlXVGItuBFURShaV4hToJrxKw0xVjtVkpVXkqibgOIjxRv2mh+nkKXr
 tq8VWxnrKOH4nAVIyZJonoXp0Hi0vYLyt7bAwAj01CoGyZZdqUw=
 =dYTw
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2017-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Another set of patches for -next:
 * API support for concurrent scheduled scan requests
 * API changes for roaming reporting
 * BSS max idle support in mac80211
 * API changes for TX status reporting in mac80211
 * API changes for RX rate reporting in mac80211
 * rewrite monitor logic to prepare for BPF filters
 * bugfix for rare devices without 2.4 GHz support
 * a bugfix for recent DFS changes
 * some further cleanups

The API changes are actually at a nice time, since it's
typically quiet just before the merge window, and trees
can be synchronized easily during it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 14:41:15 -04:00
Parthasarathy Bhuvaragan c1be775628 tipc: close the connection if protocol messages contain errors
When a socket is shutting down, we notify the peer node about the
connection termination by reusing an incoming message if possible.
If the last received message was a connection acknowledgment
message, we reverse this message and set the error code to
TIPC_ERR_NO_PORT and send it to peer.

In tipc_sk_proto_rcv(), we never check for message errors while
processing the connection acknowledgment or probe messages. Thus
this message performs the usual flow control accounting and leaves
the session hanging.

In this commit, we terminate the connection when we receive such
error messages.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan 4e0df4951e tipc: improve error validations for sockets in CONNECTING state
Until now, the checks for sockets in CONNECTING state was based on
the assumption that the incoming message was always from the
peer's accepted data socket.

However an application using a non-blocking socket sends an implicit
connect, this socket which is in CONNECTING state can receive error
messages from the peer's listening socket. As we discard these
messages, the application socket hangs as there due to inactivity.
In addition to this, there are other places where we process errors
but do not notify the user.

In this commit, we process such incoming error messages and notify
our users about them using sk_state_change().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan 42b531de17 tipc: Fix missing connection request handling
In filter_connect, we use waitqueue_active() to check for any
connections to wakeup. But waitqueue_active() is missing memory
barriers while accessing the critical sections, leading to
inconsistent results.

In this commit, we replace this with an SMP safe wq_has_sleeper()
using the generic socket callback sk_data_ready().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
Pablo Neira Ayuso 1a41dbce0d Merge tag 'ipvs-fixes-for-v4.11' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs
Simon Horman says:

====================
IPVS Fixes for v4.11

I would also like it considered for stable.

* Explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
  to avoid oops caused by IPVS accesing IPv6 routing code in such
  circumstances.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-28 16:22:11 +02:00
Dan Carpenter 7dde07e9c5 netfilter: x_tables: unlock on error in xt_find_table_lock()
According to my static checker we should unlock here before the return.
That seems reasonable to me as well.

Fixes" b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-28 15:49:48 +02:00
Arend Van Spriel b34939b983 cfg80211: add request id to cfg80211_sched_scan_*() api
Have proper request id filled in the SCHED_SCAN_RESULTS and
SCHED_SCAN_STOPPED notifications toward user-space by having the
driver provide it through the api.

Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 14:51:43 +02:00
Avraham Stern e38a017bf0 mac80211: Add support for BSS max idle period element
Parse the BSS max idle period element and set the BSS configuration
accordingly so the driver can use this information to configure the
max idle period and to use protected management frames for keep alive
when required.

The BSS max idle period element is defined in IEEE802.11-2016,
section 9.4.2.79

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 12:28:45 +02:00
Avraham Stern 29ce6ecbb8 cfg80211: unify cfg80211_roamed() and cfg80211_roamed_bss()
cfg80211_roamed() and cfg80211_roamed_bss() take the same arguments
except that cfg80211_roamed() requires the BSSID and
cfg80211_roamed_bss() requires the bss entry.

Unify the two functions by using a struct for driver initiated
roaming information so that either the BSSID or the bss entry can be
passed as an argument to the unified function.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
[modified the ath6k, brcm80211, rndis and wlan-ng drivers accordingly]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[modify brcmfmac to remove the useless cast, spotted by Arend]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 12:28:44 +02:00
Mohammed Shafi Shajakhan 21a8e9dd52 mac80211: Fix possible sband related NULL pointer de-reference
Existing API 'ieee80211_get_sdata_band' returns default 2 GHz band even
if the channel context configuration is NULL. This crashes for chipsets
which support 5 Ghz alone when it tries to access members of 'sband'.
Channel context configuration can be NULL in multivif case and when
channel switch is in progress (or) when it fails. Fix this by replacing
the API 'ieee80211_get_sdata_band' with  'ieee80211_get_sband' which
returns a NULL pointer for sband when the channel configuration is NULL.

An example scenario is as below:

In multivif mode (AP + STA) with drivers like ath10k, when we do a
channel switch in the AP vif (which has a number of clients connected)
and a STA vif which is connected to some other AP, when the channel
switch in AP vif fails, while the STA vifs tries to connect to the
other AP, there is a window where the channel context is NULL/invalid
and this results in a crash  while the clients connected to the AP vif
tries to reconnect and this race is very similar to the one investigated
by Michal in https://patchwork.kernel.org/patch/3788161/ and this does
happens with hardware that supports 5Ghz alone after long hours of
testing with continuous channel switch on the AP vif

ieee80211 phy0: channel context reservation cannot be finalized because
some interfaces aren't switching
wlan0: failed to finalize CSA, disconnecting
wlan0-1: deauthenticating from 8c:fd:f0:01:54:9c by local choice
	(Reason: 3=DEAUTH_LEAVING)

	WARNING: CPU: 1 PID: 19032 at net/mac80211/ieee80211_i.h:1013 sta_info_alloc+0x374/0x3fc [mac80211]
	[<bf77272c>] (sta_info_alloc [mac80211])
	[<bf78776c>] (ieee80211_add_station [mac80211]))
	[<bf73cc50>] (nl80211_new_station [cfg80211])

	Unable to handle kernel NULL pointer dereference at virtual
	address 00000014
	pgd = d5f4c000
	Internal error: Oops: 17 [#1] PREEMPT SMP ARM
	PC is at sta_info_alloc+0x380/0x3fc [mac80211]
	LR is at sta_info_alloc+0x37c/0x3fc [mac80211]
	[<bf772738>] (sta_info_alloc [mac80211])
	[<bf78776c>] (ieee80211_add_station [mac80211])
	[<bf73cc50>] (nl80211_new_station [cfg80211]))

Cc: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 12:28:44 +02:00
Paolo Abeni 1442f6f7c1 ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
When creating a new ipvs service, ipv6 addresses are always accepted
if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family
is not explicitly checked.

This allows the user-space to configure ipvs services even if the
system is booted with ipv6.disable=1. On specific configuration, ipvs
can try to call ipv6 routing code at setup time, causing the kernel to
oops due to fib6_rules_ops being NULL.

This change addresses the issue adding a check for the ipv6
module being enabled while validating ipv6 service operations and
adding the same validation for dest operations.

According to git history, this issue is apparently present since
the introduction of ipv6 support, and the oops can be triggered
since commit 09571c7ae3 ("IPVS: Add function to determine
if IPv6 address is local")

Fixes: 09571c7ae3 ("IPVS: Add function to determine if IPv6 address is local")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28 12:04:35 +02:00
Aaron Conole fb90e8dedb ipvs: change comparison on sync_refresh_period
The sync_refresh_period variable is unsigned, so it can never be < 0.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28 12:00:10 +02:00
Aaron Conole 65ba101ebc ipvs: remove unused function ip_vs_set_state_timeout
There are no in-tree callers of this function and it isn't exported.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28 12:00:10 +02:00
Felix Fietkau 5fe49a9d11 mac80211: add ieee80211_tx_status_ext
This allows the driver to pass in struct ieee80211_tx_status directly.
Make ieee80211_tx_status_noskb a wrapper around it.

As with ieee80211_tx_status_noskb, there is no _ni variant of this call,
because it probably won't be needed.

Even if the driver won't provide any extra status info other than what's
in struct ieee80211_tx_info already, it can optimize status reporting
this way by passing in the station pointer.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
[use C99 initializers]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 11:08:21 +02:00
Felix Fietkau eefebd3164 mac80211: move ieee80211_tx_status_noskb below ieee80211_tx_status
Makes further cleanups more readable

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 10:57:44 +02:00
Felix Fietkau 18fb84d986 mac80211: make rate control tx status API more extensible
Rename .tx_status_noskb to .tx_status_ext and pass a new on-stack
struct ieee80211_tx_status instead of struct ieee80211_tx_info.

This struct can be used to pass extra information, e.g. for dynamic tx
power control

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 10:57:33 +02:00
Johannes Berg dcba665b1f mac80211: use bitfield macros for encoded rate
Instead of hand-coding the bit manipulations, use the bitfield
macros to generate the code for the encoded bitrate.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 10:41:58 +02:00
Johannes Berg 8613c94815 mac80211: rename ieee80211_rx_status::vht_nss to just nss
This field will need to be used again for HE, so rename it now.

Again, mostly done with this spatch:

@@
expression status;
@@
-status->vht_nss
+status->nss
@@
expression status;
@@
-status.vht_nss
+status.nss

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 10:41:53 +02:00
Johannes Berg da6a4352e7 mac80211: separate encoding/bandwidth from flags
We currently use a lot of flags that are mutually incompatible,
separate this out into actual encoding and bandwidth enum values.

Much of this again done with spatch, with manual post-editing,
mostly to add the switch statements and get rid of the conversions.

@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_80MHZ
+status->bw = RATE_INFO_BW_80
@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_40MHZ
+status->bw = RATE_INFO_BW_40
@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_20MHZ
+status->bw = RATE_INFO_BW_20
@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_160MHZ
+status->bw = RATE_INFO_BW_160
@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_5MHZ
+status->bw = RATE_INFO_BW_5
@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_10MHZ
+status->bw = RATE_INFO_BW_10

@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_VHT
+status->encoding = RX_ENC_VHT
@@
expression status;
@@
-status->enc_flags |= RX_ENC_FLAG_HT
+status->encoding = RX_ENC_HT
@@
expression status;
@@
-status.enc_flags |= RX_ENC_FLAG_VHT
+status.encoding = RX_ENC_VHT
@@
expression status;
@@
-status.enc_flags |= RX_ENC_FLAG_HT
+status.encoding = RX_ENC_HT

@@
expression status;
@@
-(status->enc_flags & RX_ENC_FLAG_HT)
+(status->encoding == RX_ENC_HT)
@@
expression status;
@@
-(status->enc_flags & RX_ENC_FLAG_VHT)
+(status->encoding == RX_ENC_VHT)

@@
expression status;
@@
-(status->enc_flags & RX_ENC_FLAG_5MHZ)
+(status->bw == RATE_INFO_BW_5)
@@
expression status;
@@
-(status->enc_flags & RX_ENC_FLAG_10MHZ)
+(status->bw == RATE_INFO_BW_10)
@@
expression status;
@@
-(status->enc_flags & RX_ENC_FLAG_40MHZ)
+(status->bw == RATE_INFO_BW_40)
@@
expression status;
@@
-(status->enc_flags & RX_ENC_FLAG_80MHZ)
+(status->bw == RATE_INFO_BW_80)
@@
expression status;
@@
-(status->enc_flags & RX_ENC_FLAG_160MHZ)
+(status->bw == RATE_INFO_BW_160)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 10:41:45 +02:00
Johannes Berg 7fdd69c5af mac80211: clean up rate encoding bits in RX status
In preparation for adding support for HE rates, clean up
the driver report encoding for rate/bandwidth reporting
on RX frames.

Much of this patch was done with the following spatch:

@@
expression status;
@@
-status->flag & (RX_FLAG_HT | RX_FLAG_VHT)
+status->enc_flags & (RX_ENC_FLAG_HT | RX_ENC_FLAG_VHT)

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_SHORTPRE
+status->enc_flags op RX_ENC_FLAG_SHORTPRE
@@
expression status;
@@
-status->flag & RX_FLAG_SHORTPRE
+status->enc_flags & RX_ENC_FLAG_SHORTPRE

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_HT
+status->enc_flags op RX_ENC_FLAG_HT
@@
expression status;
@@
-status->flag & RX_FLAG_HT
+status->enc_flags & RX_ENC_FLAG_HT

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_40MHZ
+status->enc_flags op RX_ENC_FLAG_40MHZ
@@
expression status;
@@
-status->flag & RX_FLAG_40MHZ
+status->enc_flags & RX_ENC_FLAG_40MHZ

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_SHORT_GI
+status->enc_flags op RX_ENC_FLAG_SHORT_GI
@@
expression status;
@@
-status->flag & RX_FLAG_SHORT_GI
+status->enc_flags & RX_ENC_FLAG_SHORT_GI

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_HT_GF
+status->enc_flags op RX_ENC_FLAG_HT_GF
@@
expression status;
@@
-status->flag & RX_FLAG_HT_GF
+status->enc_flags & RX_ENC_FLAG_HT_GF

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_VHT
+status->enc_flags op RX_ENC_FLAG_VHT
@@
expression status;
@@
-status->flag & RX_FLAG_VHT
+status->enc_flags & RX_ENC_FLAG_VHT

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_STBC_MASK
+status->enc_flags op RX_ENC_FLAG_STBC_MASK
@@
expression status;
@@
-status->flag & RX_FLAG_STBC_MASK
+status->enc_flags & RX_ENC_FLAG_STBC_MASK

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_LDPC
+status->enc_flags op RX_ENC_FLAG_LDPC
@@
expression status;
@@
-status->flag & RX_FLAG_LDPC
+status->enc_flags & RX_ENC_FLAG_LDPC

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_10MHZ
+status->enc_flags op RX_ENC_FLAG_10MHZ
@@
expression status;
@@
-status->flag & RX_FLAG_10MHZ
+status->enc_flags & RX_ENC_FLAG_10MHZ

@@
assignment operator op;
expression status;
@@
-status->flag op RX_FLAG_5MHZ
+status->enc_flags op RX_ENC_FLAG_5MHZ
@@
expression status;
@@
-status->flag & RX_FLAG_5MHZ
+status->enc_flags & RX_ENC_FLAG_5MHZ

@@
assignment operator op;
expression status;
@@
-status->vht_flag op RX_VHT_FLAG_80MHZ
+status->enc_flags op RX_ENC_FLAG_80MHZ
@@
expression status;
@@
-status->vht_flag & RX_VHT_FLAG_80MHZ
+status->enc_flags & RX_ENC_FLAG_80MHZ

@@
assignment operator op;
expression status;
@@
-status->vht_flag op RX_VHT_FLAG_160MHZ
+status->enc_flags op RX_ENC_FLAG_160MHZ
@@
expression status;
@@
-status->vht_flag & RX_VHT_FLAG_160MHZ
+status->enc_flags & RX_ENC_FLAG_160MHZ

@@
assignment operator op;
expression status;
@@
-status->vht_flag op RX_VHT_FLAG_BF
+status->enc_flags op RX_ENC_FLAG_BF
@@
expression status;
@@
-status->vht_flag & RX_VHT_FLAG_BF
+status->enc_flags & RX_ENC_FLAG_BF

@@
assignment operator op;
expression status, STBC;
@@
-status->flag op STBC << RX_FLAG_STBC_SHIFT
+status->enc_flags op STBC << RX_ENC_FLAG_STBC_SHIFT

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_SHORTPRE
+status.enc_flags op RX_ENC_FLAG_SHORTPRE
@@
expression status;
@@
-status.flag & RX_FLAG_SHORTPRE
+status.enc_flags & RX_ENC_FLAG_SHORTPRE

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_HT
+status.enc_flags op RX_ENC_FLAG_HT
@@
expression status;
@@
-status.flag & RX_FLAG_HT
+status.enc_flags & RX_ENC_FLAG_HT

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_40MHZ
+status.enc_flags op RX_ENC_FLAG_40MHZ
@@
expression status;
@@
-status.flag & RX_FLAG_40MHZ
+status.enc_flags & RX_ENC_FLAG_40MHZ

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_SHORT_GI
+status.enc_flags op RX_ENC_FLAG_SHORT_GI
@@
expression status;
@@
-status.flag & RX_FLAG_SHORT_GI
+status.enc_flags & RX_ENC_FLAG_SHORT_GI

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_HT_GF
+status.enc_flags op RX_ENC_FLAG_HT_GF
@@
expression status;
@@
-status.flag & RX_FLAG_HT_GF
+status.enc_flags & RX_ENC_FLAG_HT_GF

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_VHT
+status.enc_flags op RX_ENC_FLAG_VHT
@@
expression status;
@@
-status.flag & RX_FLAG_VHT
+status.enc_flags & RX_ENC_FLAG_VHT

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_STBC_MASK
+status.enc_flags op RX_ENC_FLAG_STBC_MASK
@@
expression status;
@@
-status.flag & RX_FLAG_STBC_MASK
+status.enc_flags & RX_ENC_FLAG_STBC_MASK

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_LDPC
+status.enc_flags op RX_ENC_FLAG_LDPC
@@
expression status;
@@
-status.flag & RX_FLAG_LDPC
+status.enc_flags & RX_ENC_FLAG_LDPC

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_10MHZ
+status.enc_flags op RX_ENC_FLAG_10MHZ
@@
expression status;
@@
-status.flag & RX_FLAG_10MHZ
+status.enc_flags & RX_ENC_FLAG_10MHZ

@@
assignment operator op;
expression status;
@@
-status.flag op RX_FLAG_5MHZ
+status.enc_flags op RX_ENC_FLAG_5MHZ
@@
expression status;
@@
-status.flag & RX_FLAG_5MHZ
+status.enc_flags & RX_ENC_FLAG_5MHZ

@@
assignment operator op;
expression status;
@@
-status.vht_flag op RX_VHT_FLAG_80MHZ
+status.enc_flags op RX_ENC_FLAG_80MHZ
@@
expression status;
@@
-status.vht_flag & RX_VHT_FLAG_80MHZ
+status.enc_flags & RX_ENC_FLAG_80MHZ

@@
assignment operator op;
expression status;
@@
-status.vht_flag op RX_VHT_FLAG_160MHZ
+status.enc_flags op RX_ENC_FLAG_160MHZ
@@
expression status;
@@
-status.vht_flag & RX_VHT_FLAG_160MHZ
+status.enc_flags & RX_ENC_FLAG_160MHZ

@@
assignment operator op;
expression status;
@@
-status.vht_flag op RX_VHT_FLAG_BF
+status.enc_flags op RX_ENC_FLAG_BF
@@
expression status;
@@
-status.vht_flag & RX_VHT_FLAG_BF
+status.enc_flags & RX_ENC_FLAG_BF

@@
assignment operator op;
expression status, STBC;
@@
-status.flag op STBC << RX_FLAG_STBC_SHIFT
+status.enc_flags op STBC << RX_ENC_FLAG_STBC_SHIFT

@@
@@
-RX_FLAG_STBC_SHIFT
+RX_ENC_FLAG_STBC_SHIFT

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-28 10:41:38 +02:00
Trond Myklebust ed6473ddc7 NFSv4: Fix callback server shutdown
We want to use kthread_stop() in order to ensure the threads are
shut down before we tear down the nfs_callback_info in nfs_callback_down.

Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: bb6aeba736 ("NFSv4.x: Switch to using svc_set_num_threads()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 18:00:16 -04:00
Trond Myklebust 9e0d87680d SUNRPC: Refactor svc_set_num_threads()
Refactor to separate out the functions of starting and stopping threads
so that they can be used in other helpers.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 17:59:58 -04:00
Wei Yongjun adeb45cbb5 fib_rules: fix error return code
Fix to return error code -EINVAL from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 622ec2c9d5 ("net: core: add UID to flows, rules, and routes")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:35:57 -04:00
Mike Manning 99f906e9ad bridge: add per-port broadcast flood flag
Support for l2 multicast flood control was added in commit b6cb5ac833
("net: bridge: add per-port multicast flood flag"). It allows broadcast
as it was introduced specifically for unknown multicast flood control.
But as broadcast is a special case of multicast, this may also need to
be disabled. For this purpose, introduce a flag to disable the flooding
of received l2 broadcasts. This approach is backwards compatible and
provides flexibility in filtering for the desired packet types.

Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Mike Manning <mmanning@brocade.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:34:29 -04:00
Gao Feng 06b4fc520d net: fib: Decrease one unnecessary rt cache flush in fib_disable_ip
The func fib_flush already flushes the rt cache if necessary, so it
is not necessary to invoke rt_cache_flush again in fib_disable_ip.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:33:30 -04:00
Guillaume Nault 1514dc857f l2tp: remove useless device duplication test in l2tp_eth_create()
There's no need to verify that cfg->ifname is unique at this point.
register_netdev() will return -EEXIST if asked to create a device with
a name that's alrealy in use.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:32:13 -04:00
Zhang Shengju 0575c86b5d net: remove unnecessary carrier status check
Since netif_carrier_on() will do nothing if device's carrier is already
on, so it's unnecessary to do carrier status check.

It's the same for netif_carrier_off().

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 16:31:05 -04:00
Linus Torvalds f56fc7bdaa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - fix orangefs handling of faults on write() - I'd missed that one back
   when orangefs was going through review.

 - readdir counterpart of "9p: cope with bogus responses from server in
   p9_client_{read,write}" - server might be lying or broken, and we'd
   better not overrun the kmalloc'ed buffer we are copying the results
   into.

 - NFS O_DIRECT read/write can leave iov_iter advanced by too much;
   that's what had been causing iov_iter_pipe() warnings davej had been
   seeing.

 - statx_timestamp.tv_nsec type fix (s32 -> u32). That one really should
   go in before 4.11.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  uapi: change the type of struct statx_timestamp.tv_nsec to unsigned
  fix nfs O_DIRECT advancing iov_iter too much
  p9_client_readdir() fix
  orangefs_bufmap_copy_from_iovec(): fix EFAULT handling
2017-04-27 11:09:37 -07:00
Eric Dumazet 4b726e81da tcp: tcp_rack_reo_timeout() must update tp->tcp_mstamp
I wrongly assumed tp->tcp_mstamp was up to date at the time
tcp_rack_reo_timeout() was called.

It is not true, since we only update tcp->tcp_mstamp when receiving
a packet (as initially done in commit 69e996c58a ("tcp: add
tp->tcp_mstamp field")

tcp_rack_reo_timeout() being called by a timer and not an incoming
packet, we need to refresh tp->tcp_mstamp

Fixes: 7c1c730859 ("tcp: do not pass timestamp to tcp_rack_detect_loss()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27 11:46:15 -04:00
Sabrina Dubroca cfcf99f987 xfrm: fix GRO for !CONFIG_NETFILTER
In xfrm_input() when called from GRO, async == 0, and we end up
skipping the processing in xfrm4_transport_finish(). GRO path will
always skip the NF_HOOK, so we don't need the special-case for
!NETFILTER during GRO processing.

Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-27 12:20:19 +02:00
Oliver Hartkopp c2701b370e can: fix CAN BCM build with CONFIG_PROC_FS disabled
The introduced namespace support moved the BCM variables for procfs into a
per-net data structure. This leads to a build failure with disabled procfs:

on x86_64:

when CONFIG_PROC_FS is not enabled:

../net/can/bcm.c:1541:14: error: 'struct netns_can' has no member named 'bcmproc_dir'
../net/can/bcm.c:1601:14: error: 'struct netns_can' has no member named 'bcmproc_dir'
../net/can/bcm.c:1696:11: error: 'struct netns_can' has no member named 'bcmproc_dir'
../net/can/bcm.c:1707:15: error: 'struct netns_can' has no member named 'bcmproc_dir'

http://marc.info/?l=linux-can&m=149321842526524&w=2

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-27 09:34:13 +02:00
David S. Miller b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Luca Coelho e5f2e0671e mac80211: make multicast variable a bool in ieee80211_accept_frame()
The multicast variable in the ieee80211_accept_frame() function is
treated as a boolean, but defined as int.  Fix that.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-26 23:17:44 +02:00
Johannes Berg 4a19906823 mac80211: disentangle iflist_mtx and chanctx_mtx
At least on iwlwifi, sometimes lockdep complains that we can
lock
 chanctx_mtx -> mvm.mutex -> iflist_mtx
 (due to iterate_interfaces)
and
 iflist_mtx -> chanctx_mtx

Remove the latter dependency in mac80211 by using the RTNL
that we already hold in one case, and can relatively easily
achieve in the other case.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-26 23:17:44 +02:00
Emmanuel Grumbach cf147085fd mac80211: don't parse encrypted management frames in ieee80211_frame_acked
ieee80211_frame_acked is called when a frame is acked by
the peer. In case this is a management frame, we check
if this an SMPS frame, in which case we can update our
antenna configuration.

When we parse the management frame we look at the category
in case it is an action frame. That byte sits after the IV
in case the frame was encrypted. This means that if the
frame was encrypted, we basically look at the IV instead
of looking at the category. It is then theorically
possible that we think that an SMPS action frame was acked
where really we had another frame that was encrypted.

Since the only management frame whose ack needs to be
tracked is the SMPS action frame, and that frame is not
a robust management frame, it will never be encrypted.
The easiest way to fix this problem is then to not look
at frames that were encrypted.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-26 23:17:43 +02:00
Arend Van Spriel 3a3ecf1d59 cfg80211: add request id parameter to .sched_scan_stop() signature
For multiple scheduled scan support the driver needs to know which
scheduled scan request is being stopped. Pass the request id in the
.sched_scan_stop() callback.

Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-26 23:17:40 +02:00
Arend Van Spriel 3007e3529c nl80211: add support for BSSIDs in scheduled scan matchsets
This patch allows for the scheduled scan request to specify matchsets
for specific BSSIDs.

Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
[docs, netlink policy fix]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-26 23:17:39 +02:00
Arend Van Spriel ca986ad9bc nl80211: allow multiple active scheduled scan requests
This patch implements the idea to have multiple scheduled scan requests
running concurrently. It mainly illustrates how to deal with the incoming
request from user-space in terms of backward compatibility. In order to
use multiple scheduled scans user-space needs to provide a flag attribute
NL80211_ATTR_SCHED_SCAN_MULTI to indicate support. If not the request is
treated as a legacy scan.

Drivers currently supporting scheduled scan are now indicating they support
a single scheduled scan request. This obsoletes WIPHY_FLAG_SUPPORTS_SCHED_SCAN.

Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
[clean up netlink destroy path to avoid allocations, code cleanups]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-26 23:17:38 +02:00
Johannes Berg ab81007a7b cfg80211: simplify netlink socket owner interface deletion
There's no need to allocate a portid structure and then, for
each of those, walk the interfaces - we can just add a flag
to each interface and walk those directly. Due to padding in
the struct, we can even do it without any memory cost, and
it even simplifies the code.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-26 23:17:35 +02:00
Jamie Bainbridge 105f5528b9 ipv6: check raw payload size correctly in ioctl
In situations where an skb is paged, the transport header pointer and
tail pointer can be the same because the skb contents are in frags.

This results in ioctl(SIOCINQ/FIONREAD) incorrectly returning a
length of 0 when the length to receive is actually greater than zero.

skb->len is already correctly set in ip6_input_finish() with
pskb_pull(), so use skb->len as it always returns the correct result
for both linear and paged data.

Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:59:35 -04:00
Wei Wang c120144407 tcp: memset ca_priv data to 0 properly
Always zero out ca_priv data in tcp_assign_congestion_control() so that
ca_priv data is cleared out during socket creation.
Also always zero out ca_priv data in tcp_reinit_congestion_control() so
that when cc algorithm is changed, ca_priv data is cleared out as well.
We should still zero out ca_priv data even in TCP_CLOSE state because
user could call connect() on AF_UNSPEC to disconnect the socket and
leave it in TCP_CLOSE state and later call setsockopt() to switch cc
algorithm on this socket.

Fixes: 2b0a8c9ee ("tcp: add CDG congestion control")
Reported-by: Andrey Konovalov  <andreyknvl@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:58:32 -04:00
WANG Cong 199ab00f3c ipv6: check skb->protocol before lookup for nexthop
Andrey reported a out-of-bound access in ip6_tnl_xmit(), this
is because we use an ipv4 dst in ip6_tnl_xmit() and cast an IPv4
neigh key as an IPv6 address:

        neigh = dst_neigh_lookup(skb_dst(skb),
                                 &ipv6_hdr(skb)->daddr);
        if (!neigh)
                goto tx_err_link_failure;

        addr6 = (struct in6_addr *)&neigh->primary_key; // <=== HERE
        addr_type = ipv6_addr_type(addr6);

        if (addr_type == IPV6_ADDR_ANY)
                addr6 = &ipv6_hdr(skb)->daddr;

        memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));

Also the network header of the skb at this point should be still IPv4
for 4in6 tunnels, we shold not just use it as IPv6 header.

This patch fixes it by checking if skb->protocol is ETH_P_IPV6: if it
is, we are safe to do the nexthop lookup using skb_dst() and
ipv6_hdr(skb)->daddr; if not (aka IPv4), we have no clue about which
dest address we can pick here, we have to rely on callers to fill it
from tunnel config, so just fall to ip6_route_output() to make the
decision.

Fixes: ea3dc9601b ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:51:26 -04:00
Myungho Jung 9899886d5e net: core: Prevent from dereferencing null pointer when releasing SKB
Added NULL check to make __dev_kfree_skb_irq consistent with kfree
family of functions.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289

Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:47:14 -04:00
Eric Dumazet 645f4c6f2e tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps
Some devices or distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With tp->tcp_mstamp introduction, we can switch to high resolution
timestamps almost for free (at the expense of 8 additional bytes per
TCP structure)

Note that some TCP stacks use usec TCP timestamps where this
patch makes even more sense : Many TCP flows have < 500 usec RTT.
Hopefully this finer TS option can be standardized soon.

Tested:
 HZ=100 kernel
 ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &

 Peer without patch :
 lpaa24:~# ss -tmi dst lpaa23
 ...
 skmem:(r0,rb8388608,...)
 rcv_rtt:10 rcv_space:3210000 minrtt:0.017

 Peer with the patch :
 lpaa23:~# ss -tmi dst lpaa24
 ...
 skmem:(r0,rb428800,...)
 rcv_rtt:0.069 rcv_space:30000 minrtt:0.017

We can see saner RCVBUF, and more precise rcv_rtt information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:39 -04:00
Eric Dumazet a6db50b81e tcp: remove ack_time from struct tcp_sacktag_state
It is no longer needed, everything uses tp->tcp_mstamp instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet 7e0ca8a4c1 tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()
Following patch will remove ack_time from struct tcp_sacktag_state

Same info is now found in tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet d2329f102d tcp: do not pass timestamp to tcp_rack_advance()
No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet 88d5c65098 tcp: do not pass timestamp to tcp_rate_gen()
No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet 1317a9d69f tcp: do not pass timestamp to tcp_fastretrans_alert()
Not used anymore now tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:38 -04:00
Eric Dumazet efab8f8582 tcp: do not pass timestamp to tcp_rack_identify_loss()
Not used anymore now tp->tcp_mstamp holds the information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:37 -04:00
Eric Dumazet 128eda86be tcp: do not pass timestamp to tcp_rack_mark_lost()
This is no longer used, since tcp_rack_detect_loss() takes
the timestamp from tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:37 -04:00
Eric Dumazet 7c1c730859 tcp: do not pass timestamp to tcp_rack_detect_loss()
We can use tp->tcp_mstamp as it contains a recent timestamp.

This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:37 -04:00
Eric Dumazet 69e996c58a tcp: add tp->tcp_mstamp field
We want to use precise timestamps in TCP stack, but we do not
want to call possibly expensive kernel time services too often.

tp->tcp_mstamp is guaranteed to be updated once per incoming packet.

We will use it in the following patches, removing specific
skb_mstamp_get() calls, and removing ack_time from
struct tcp_sacktag_state.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 14:44:36 -04:00
Al Viro eea86b637a Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', 'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess 2017-04-26 12:06:59 -04:00
Xin Long 35db069121 xfrm: do the garbage collection after flushing policy
Now xfrm garbage collection can be triggered by 'ip xfrm policy del'.
These is no reason not to do it after flushing policies, especially
considering that 'garbage collection deferred' is only triggered
when it reaches gc_thresh.

It's no good that the policy is gone but the xdst still hold there.
The worse thing is that xdst->route/orig_dst is also hold and can
not be released even if the orig_dst is already expired.

This patch is to do the garbage collection if there is any policy
removed in xfrm_policy_flush.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-26 10:34:32 +02:00
Florian Westphal 9a08ecfe74 netfilter: don't attach a nat extension by default
nowadays the NAT extension only stores the interface index
(used to purge connections that got masqueraded when interface goes down)
and pptp nat information.

Previous patches moved nf_ct_nat_ext_add to those places that need it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 2fe7c321ab netfilter: pptp: attach nat extension when needed
make sure nat extension gets added if the master conntrack is subject to
NAT.  This will be required once the nat core stops adding it by default.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal ff459018d7 netfilter: masquerade: attach nat extension if not present
Currently the nat extension is always attached as soon as nat module is
loaded.  However, most NAT uses do not need the nat extension anymore.

Prepare to remove the add-nat-by-default by making those places that need
it attach it if its not present yet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 22d4536d2c netfilter: conntrack: handle initial extension alloc via krealloc
krealloc(NULL, ..) is same as kmalloc(), so we can avoid special-casing
the initial allocation after the prealloc removal (we had to use
->alloc_len as the initial allocation size).

This also means we do not zero the preallocated memory anymore; only
offsets[].  Existing code makes sure the new (used) extension space gets
zeroed out.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 23f671a1b5 netfilter: conntrack: mark extension structs as const
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 54044b1f02 netfilter: conntrack: remove prealloc support
It was used by the nat extension, but since commit
7c96643519 ("netfilter: move nat hlist_head to nf_conn") its only needed
for connections that use MASQUERADE target or a nat helper.

Also it seems a lot easier to preallocate a fixed size instead.

With default settings, conntrack first adds ecache extension (sysctl
defaults to 1), so we get 40(ct extension header) + 24 (ecache) == 64 byte
on x86_64 for initial allocation.

Followup patches can constify the extension structs and avoid
the initial zeroing of the entire extension area.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Gao Feng 495dcb56d0 netfilter: SYNPROXY: Return NF_STOLEN instead of NF_DROP during handshaking
Current SYNPROXY codes return NF_DROP during normal TCP handshaking,
it is not friendly to caller. Because the nf_hook_slow would treat
the NF_DROP as an error, and return -EPERM.
As a result, it may cause the top caller think it meets one error.

For example, the following codes are from cfv_rx_poll()
	err = netif_receive_skb(skb);
	if (unlikely(err)) {
		++cfv->ndev->stats.rx_dropped;
	} else {
		++cfv->ndev->stats.rx_packets;
		cfv->ndev->stats.rx_bytes += skb_len;
	}
When SYNPROXY returns NF_DROP, then netif_receive_skb returns -EPERM.
As a result, the cfv driver would treat it as an error, and increase
the rx_dropped counter.

So use NF_STOLEN instead of NF_DROP now because there is no error
happened indeed, and free the skb directly.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal aee12a0a37 ebtables: remove nf_hook_register usage
Similar to ip_register_table, pass nf_hook_ops to ebt_register_table().
This allows to handle hook registration also via pernet_ops and allows
us to avoid use of legacy register_hook api.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:21 +02:00
Florian Westphal 1a0ed0ad48 netfilter: decnet: only register hooks in init namespace
looks like decnet isn't namespacified in first place, so restrict hook
registration to the initial namespace.

Prepares for eventual removal of legacy nf_register_hook() api.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:21 +02:00
Florian Westphal efe4160618 ipvs: convert to use pernet nf_hook api
nf_(un)register_hooks has to maintain an internal hook list to add/remove
those hooks from net namespaces as they are added/deleted.

ipvs already uses pernet_ops, so we can switch to the (more recent)
pernet hook api instead.

Compile tested only.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:21 +02:00
Florian Westphal 1fefe14725 netfilter: synproxy: only register hooks when needed
Defer registration of the synproxy hooks until the first SYNPROXY rule is
added.  Also means we only register hooks in namespaces that need it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:21 +02:00
Trond Myklebust 35a2442189 NFS: NFS over RDMA Client Side Changes
New Features:
 - Break RDMA connections after a connection timeout
 - Support for unloading the underlying device driver
 
 Bugfixes and cleanups:
 - Mark the receive workqueue as "read-mostly"
 - Silence warnings caused by ENOBUFS
 - Update a comment in xdr_init_decode_pages()
 - Remove rpcrdma_buffer->rb_pool.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlj/tSAACgkQ18tUv7Cl
 QOtC/Q/9ElH3UEK4uF7yc14B6LhwcX0n9Ka47CqPnil/+4aN1HK8Oa/cHd3NxeJP
 B8MOoRLVI9VrY3bLIpbzl49z+RVssgnuMDAQFqCPHja5YxzxVKnLApFUw5WyuDyl
 FXp/qZUkHyZymp3VIuY7c/iRK3KIJm7J3Ca5JxRc+x4Vu2jYBRWUt3+8cC+pPTM+
 MSGxSsjsfE0isIOKt7Z5nucx5sdlbCZXoZzrcOL2tZ4IP+5rgYk+3W+H6yg0hzCv
 P+Bv6Ce0Ye0ebcLpi+rMJlYjy4e5YbWuMqhnVrhR4NEtAJMH4NSvg5rT3iv/CKDf
 vJnSvoOSERowlDmvK3h8BAQ9u3V81u3C21xRDdiCgrIfNBvmzthNyYV9fuGZkR8Q
 BCNDKji7r+uWxhFwX+X4D1izBRTEv7PoHLQCF8WDPqU2M8dLjF+mpM7yzSRzt7pF
 8u9WGEtIr0l+YtOYRYS8UcQuDv5GVl5z5hoS/MZeifuWoJOAcfxeCtSENTK1Ftt9
 4ysM298umF7rMHRUrSnI3d3OKIwTkdMwVmZscAvNLHR2VvuwqrwGM4B0NmADK1Za
 y3/tzAgL3jKl3dTI1Ny9djBgpxSnnAwLa+I92LbiessP9woGgOYKmfruDHFzS/yp
 +4JRbjqaXB6iFtpjDre3H3CCEwrRvBZJ1TOQwc84z6xqMs8kmL0=
 =F9zg
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFS over RDMA Client Side Changes

New Features:
- Break RDMA connections after a connection timeout
- Support for unloading the underlying device driver

Bugfixes and cleanups:
- Mark the receive workqueue as "read-mostly"
- Silence warnings caused by ENOBUFS
- Update a comment in xdr_init_decode_pages()
- Remove rpcrdma_buffer->rb_pool.
2017-04-25 18:42:48 -04:00
Chuck Lever dadf3e435d svcrdma: Clean out old XDR encoders
Clean up: These have been replaced and are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:56 -04:00
Chuck Lever 2cf32924c6 svcrdma: Remove the req_map cache
req_maps are no longer used by the send path and can thus be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever 68cc4636bb svcrdma: Remove unused RDMA Write completion handler
Clean up. All RDMA Write completions are now handled by
svc_rdma_wc_write_ctx.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever ded8d19641 svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
The sge array in struct svc_rdma_op_ctxt is no longer used for
sending RDMA Write WRs. It need only accommodate the construction of
Send and Receive WRs. The maximum inline size is the largest payload
it needs to handle now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever f5821c76b2 svcrdma: Clean up RPC-over-RDMA backchannel reply processing
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever 4757d90b15 svcrdma: Report Write/Reply chunk overruns
Observed at Connectathon 2017.

If a client has underestimated the size of a Write or Reply chunk,
the Linux server writes as much payload data as it can, then it
recognizes there was a problem and closes the connection without
sending the transport header.

This creates a couple of problems:

<> The client never receives indication of the server-side failure,
   so it continues to retransmit the bad RPC. Forward progress on
   the transport is blocked.

<> The reply payload pages are not moved out of the svc_rqst, thus
   they can be released by the RPC server before the RDMA Writes
   have completed.

The new rdma_rw-ized helpers return a distinct error code when a
Write/Reply chunk overrun occurs, so it's now easy for the caller
(svc_rdma_sendto) to recognize this case.

Instead of dropping the connection, post an RDMA_ERROR message. The
client now sees an RDMA_ERROR and can properly terminate the RPC
transaction.

As part of the new logic, set up the same delayed release for these
payload pages as would have occurred in the normal case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever 6b19cc5ca2 svcrdma: Clean up RDMA_ERROR path
Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
be refactored to reduce code duplication and remove C structure-
based XDR encoding. It is also relocated to the source file that
contains its only caller.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever 9a6a180b78 svcrdma: Use rdma_rw API in RPC reply path
The current svcrdma sendto code path posts one RDMA Write WR at a
time. Each of these Writes typically carries a small number of pages
(for instance, up to 30 pages for mlx4 devices). That means a 1MB
NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
and one for the Send WR carrying the actual RPC Reply message.

Instead, use the new rdma_rw API. The details of Write WR chain
construction and memory registration are taken care of in the RDMA
core. svcrdma can focus on the details of the RPC-over-RDMA
protocol. This gives three main benefits:

1. All Write WRs for one RDMA segment are posted in a single chain.
As few as one ib_post_send() for each Write chunk.

2. The Write path can now use FRWR to register the Write buffers.
If the device's maximum page list depth is large, this means a
single Write WR is needed for each RPC's Write chunk data.

3. The new code introduces support for RPCs that carry both a Write
list and a Reply chunk. This combination can be used for an NFSv4
READ where the data payload is large, and thus is removed from the
Payload Stream, but the Payload Stream is still larger than the
inline threshold.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever f13193f50b svcrdma: Introduce local rdma_rw API helpers
The plan is to replace the local bespoke code that constructs and
posts RDMA Read and Write Work Requests with calls to the rdma_rw
API. This shares code with other RDMA-enabled ULPs that manages the
gory details of buffer registration and posting Work Requests.

Some design notes:

 o The structure of RPC-over-RDMA transport headers is flexible,
   allowing multiple segments per Reply with arbitrary alignment,
   each with a unique R_key. Write and Send WRs continue to be
   built and posted in separate code paths. However, one whole
   chunk (with one or more RDMA segments apiece) gets exactly
   one ib_post_send and one work completion.

 o svc_xprt reference counting is modified, since a chain of
   rdma_rw_ctx structs generates one completion, no matter how
   many Write WRs are posted.

 o The current code builds the transport header as it is construct-
   ing Write WRs. I've replaced that with marshaling of transport
   header data items in a separate step. This is because the exact
   structure of client-provided segments may not align with the
   components of the server's reply xdr_buf, or the pages in the
   page list. Thus parts of each client-provided segment may be
   written at different points in the send path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:55 -04:00
Chuck Lever c238c4c034 svcrdma: Clean up svc_rdma_get_inv_rkey()
Replace C structure-based XDR decoding with more portable code that
instead uses pointer arithmetic.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever c55ab0707b svcrdma: Add helper to save pages under I/O
Clean up: extract the logic to save pages under I/O into a helper to
add a big documenting comment without adding clutter in the send
path.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever b623589dba svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
The Send Queue depth is temporarily reduced to 1 SQE per credit. The
new rdma_rw API does an internal computation, during QP creation, to
increase the depth of the Send Queue to handle RDMA Read and Write
operations.

This change has to come before the NFSD code paths are updated to
use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
the size of the SQ too much, resulting in memory allocation failures
during QP creation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever 6e6092ca30 svcrdma: Add svc_rdma_map_reply_hdr()
Introduce a helper to DMA-map a reply's transport header before
sending it. This will in part replace the map vector cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever 17f5f7f506 svcrdma: Move send_wr to svc_rdma_op_ctxt
Clean up: Move the ib_send_wr off the stack, and move common code
to post a Send Work Request into a helper.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
Chuck Lever 2be1fce95e xprtrdma: Remove rpcrdma_buffer::rb_pool
Since commit 1e465fd4ff ("xprtrdma: Replace send and receive
arrays"), this field is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:35 -04:00
Chuck Lever 7ecce75fc3 sunrpc: Fix xdr_init_decode_pages() documenting comment
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:34 -04:00
Chuck Lever 0031e47c76 xprtrdma: Squelch ENOBUFS warnings
When ro_map is out of buffers, that's not a permanent error, so
don't report a problem.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:33 -04:00
Chuck Lever 7d7fa9b550 xprtrdma: Annotate receive workqueue
Micro-optimize the receive workqueue by marking it's anchor "read-
mostly."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:31 -04:00
Chuck Lever 56a6bd154d xprtrdma: Revert commit d0f36c46de
Device removal is now adequately supported. Pinning the underlying
device driver to prevent removal while an NFS mount is active is no
longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:30 -04:00
Chuck Lever a9b0e381ca xprtrdma: Restore transport after device removal
After a device removal, enable the transport connect worker to
restore normal operation if there is another device with
connectivity to the server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:28 -04:00
Chuck Lever 1890896b4e xprtrdma: Refactor rpcrdma_ep_connect
I'm about to add another arm to

    if (ep->rep_connected != 0)

It will be cleaner to use a switch statement here. We'll be looking
for a couple of specific errnos, or "anything else," basically to
sort out the difference between a normal reconnect and recovery from
device removal.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:27 -04:00
Chuck Lever bebd031866 xprtrdma: Support unplugging an HCA from under an NFS mount
The device driver for the underlying physical device associated
with an RPC-over-RDMA transport can be removed while RPC-over-RDMA
transports are still in use (ie, while NFS filesystems are still
mounted and active). The IB core performs a connection event upcall
to request that consumers free all RDMA resources associated with
a transport.

There may be pending RPCs when this occurs. Care must be taken to
release associated resources without leaving references that can
trigger a subsequent crash if a signal or soft timeout occurs. We
rely on the caller of the transport's ->close method to ensure that
the previous RPC task has invoked xprt_release but the transport
remains write-locked.

A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close
is invoked, it destroys the transport's H/W resources, then wakes
the upcall, which completes and allows the core driver unload to
continue.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:24 -04:00
Chuck Lever 91a10c5297 xprtrdma: Use same device when mapping or syncing DMA buffers
When the underlying device driver is reloaded, ia->ri_device will be
replaced. All cached copies of that device pointer have to be
updated as well.

Commit 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and Receive
buffers") added the rg_device field to each regbuf. As part of
handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on
all regbufs for a transport.

Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after
the driver has been reloaded should reinitialize rg_device correctly
for every case except rpcrdma_wc_receive, which still uses
rpcrdma_rep::rr_device.

Ensure the same device that was used to map a Receive buffer is also
used to sync it in rpcrdma_wc_receive by using rg_device there
instead of rr_device.

This is the only use of rr_device, so it can be removed.

The use of regbufs in the send path is also updated, for
completeness.

Fixes: 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:22 -04:00
Chuck Lever fff09594ed xprtrdma: Refactor rpcrdma_ia_open()
In order to unload a device driver and reload it, xprtrdma will need
to close a transport's interface adapter, and then call
rpcrdma_ia_open again, possibly finding a different interface
adapter.

Make rpcrdma_ia_open safe to call on the same transport multiple
times.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:20 -04:00
Chuck Lever 33849792cb xprtrdma: Detect unreachable NFS/RDMA servers more reliably
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.

Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:

	/* Must suppress retransmit to maintain credits */
	if (req->rl_connect_cookie == xprt->connect_cookie)
		goto drop_connection;
	req->rl_connect_cookie = xprt->connect_cookie;

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:19 -04:00
Chuck Lever e2a4f4fbef sunrpc: Export xprt_force_disconnect()
xprt_force_disconnect() is already invoked from the socket
transport. I want to invoke xprt_force_disconnect() from the
RPC-over-RDMA transport, which is a separate module from sunrpc.ko.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:16 -04:00
Chuck Lever 9378b274e1 xprtrdma: Cancel refresh worker during buffer shutdown
Trying to create MRs while the transport is being torn down can
cause a crash.

Fixes: e2ac236c0b ("xprtrdma: Allocate MRs on demand")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25 16:12:14 -04:00
Johannes Berg 127f60bfa9 mac80211: rewrite monitor mode delivery logic
The monitor mode delivery logic makes it hard to add any
kind of filtering in an efficient way, because the monitor
SKB is created first and then passed to all interfaces.

Rewrite the logic to create the monitor SKB the first time
it's actually needed, and then keep delivering it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-25 21:43:37 +02:00
Vasanthakumar Thiagarajan cd50ac0f31 cfg80211: Fix dfs state propagation for non-DFS center channel
When part of a bigger bandwidth (160 MHz) channel falls in DFS
channel range it is possible that the  center frequency may not
necessarily be a radar channel. Remove the sanity check on channel
flag for IEEE80211_CHAN_RADAR in regulatory_propagate_dfs_state(),
this should fix the dfs state propagation for non-DFS center freq
which has DFS channels in it's bandwidth, should also fix unnecessary
WARN_ON() spam in regulatory_propagate_dfs_state().

Fixes: 8976672736 ("cfg80211: Share Channel DFS state across wiphys of same DFS domain")
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-25 21:42:52 +02:00
Alexander Potapenko fd2c83b357 net/packet: check length in getsockopt() called with PACKET_HDRLEN
In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
|val| remains uninitialized and the syscall may behave differently
depending on its value, and even copy garbage to userspace on certain
architectures. To fix this we now return -EINVAL if optlen is too small.

This bug has been detected with KMSAN.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 14:05:52 -04:00
David Ahern 8048ced9be net: ipv6: regenerate host route if moved to gc list
Taking down the loopback device wreaks havoc on IPv6 routing. By
extension, taking down a VRF device wreaks havoc on its table.

Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
FIB code while running syzkaller fuzzer. The root cause is a dead dst
that is on the garbage list gets reinserted into the IPv6 FIB. While on
the gc (or perhaps when it gets added to the gc list) the dst->next is
set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
out-of-bounds access.

Andrey's reproducer was the key to getting to the bottom of this.

With IPv6, host routes for an address have the dst->dev set to the
loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
a walk of the fib evicting routes with the 'lo' device which means all
host routes are removed. That process moves the dst which is attached to
an inet6_ifaddr to the gc list and marks it as dead.

The recent change to keep global IPv6 addresses added a new function,
fixup_permanent_addr, that is called on admin up. That function restarts
dad for an inet6_ifaddr and when it completes the host route attached
to it is inserted into the fib. Since the route was marked dead and
moved to the gc list, re-inserting the route causes the reported
out-of-bounds accesses. If the device with the address is taken down
or the address is removed, the WARN_ON in fib6_del is triggered.

All of those faults are fixed by regenerating the host route if the
existing one has been moved to the gc list, something that can be
determined by checking if the rt6i_ref counter is 0.

Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 14:04:44 -04:00
Xin Long b1b9d36602 bridge: move bridge multicast cleanup to ndo_uninit
During removing a bridge device, if the bridge is still up, a new mdb entry
still can be added in br_multicast_add_group() after all mdb entries are
removed in br_multicast_dev_del(). Like the path:

  mld_ifc_timer_expire ->
    mld_sendpack -> ...
      br_multicast_rcv ->
        br_multicast_add_group

The new mp's timer will be set up. If the timer expires after the bridge
is freed, it may cause use-after-free panic in br_multicast_group_expired.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
Call Trace:
 <IRQ>
 [<ffffffff81094536>] call_timer_fn+0x36/0x110
 [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
 [<ffffffff81096967>] run_timer_softirq+0x237/0x340
 [<ffffffff8108dcbf>] __do_softirq+0xef/0x280
 [<ffffffff8169889c>] call_softirq+0x1c/0x30
 [<ffffffff8102c275>] do_softirq+0x65/0xa0
 [<ffffffff8108e055>] irq_exit+0x115/0x120
 [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
 [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80

Nikolay also found it would cause a memory leak - the mdb hash is
reallocated and not freed due to the mdb rehash.

unreferenced object 0xffff8800540ba800 (size 2048):
  backtrace:
    [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
    [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
    [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
    [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
    [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
    [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
    [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
    [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
    [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
    [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
    [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
    [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
    [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
    [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
    [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
    [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]

This could happen when ip link remove a bridge or destroy a netns with a
bridge device inside.

With Nikolay's suggestion, this patch is to clean up bridge multicast in
ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
that netif_running check in br_multicast_add_group can avoid this issue.

v1->v2:
  - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
    of calling dev_close in br_dev_delete.

(NOTE: Depends upon b6fe0440c6 ("bridge: implement missing ndo_uninit()"))

Fixes: e10177abf8 ("bridge: multicast: fix handling of temp and perm entries")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 14:02:39 -04:00
Sabrina Dubroca ec9c4215fe ipv6: fix source routing
Commit a149e7c7ce ("ipv6: sr: add support for SRH injection through
setsockopt") introduced handling of IPV6_SRCRT_TYPE_4, but at the same
time restricted it to only IPV6_SRCRT_TYPE_0 and
IPV6_SRCRT_TYPE_4. Previously, ipv6_push_exthdr() and fl6_update_dst()
would also handle other values (ie STRICT and TYPE_2).

Restore previous source routing behavior, by handling IPV6_SRCRT_STRICT
and IPV6_SRCRT_TYPE_2 the same way as IPV6_SRCRT_TYPE_0 in
ipv6_push_exthdr() and fl6_update_dst().

Fixes: a149e7c7ce ("ipv6: sr: add support for SRH injection through setsockopt")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 13:59:24 -04:00
David S. Miller b5cdae3291 net: Generic XDP
This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

v5:
 - Handle both negative and positive offset after running prog
 - Fix mac length in XDP_TX case (Alexei)
 - Use rcu_dereference_protected() in free_netdev (kbuild test robot)

v4:
 - Fix MAC header adjustmnet before calling prog (David Ahern)
 - Disable LRO when generic XDP is installed (Michael Chan)
 - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
 - Do not perform generic XDP on reinjected packets (DaveM)

v3:
 - Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
 - Elide GRO when generic XDP is in use.  (Alexei)
 - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
 - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
 - Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
 - Use RCU for generic xdp_prog, thanks to Johannes Berg.

Tested-by: Andy Gospodarek <andy@greyhouse.net>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 13:33:49 -04:00
Parthasarathy Bhuvaragan 05ff837897 tipc: fix socket flow control accounting error at tipc_recv_stream
Until now in tipc_recv_stream(), we update the received
unacknowledged bytes based on a stack variable and not based on the
actual message size.
If the user buffer passed at tipc_recv_stream() is smaller than the
received skb, the size variable in stack differs from the actual
message size in the skb. This leads to a flow control accounting
error causing permanent congestion.

In this commit, we fix this accounting error by always using the
size of the incoming message.

Fixes: 10724cc7bb ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:45:38 -04:00
Parthasarathy Bhuvaragan 3364d61c92 tipc: fix socket flow control accounting error at tipc_send_stream
Until now in tipc_send_stream(), we return -1 when the socket
encounters link congestion even if the socket had successfully
sent partial data. This is incorrect as the application resends
the same the partial data leading to data corruption at
receiver's end.

In this commit, we return the partially sent bytes as the return
value at link congestion.

Fixes: 10724cc7bb ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:45:37 -04:00
Paolo Abeni b7d6df5751 ipv6: move stub initialization after ipv6 setup completion
The ipv6 stub pointer is currently initialized before the ipv6
routing subsystem: a 3rd party can access and use such stub
before the routing data is ready.
Moreover, such pointer is not cleared in case of initialization
error, possibly leading to dangling pointers usage.

This change addresses the above moving the stub initialization
at the end of ipv6 init code.

Fixes: 5f81bd2e5d ("ipv6: export a stub for IPv6 symbols used by vxlan")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:43:16 -04:00
Guillaume Nault a485c2b877 l2tp: define "l2tpeth" device type
Export type of l2tpeth interfaces to userspace
(/sys/class/net/<iface>/uevent).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:41:56 -04:00
Guillaume Nault c39855febc l2tp: set name_assign_type for devices created by l2tp_eth.c
Export naming scheme used when creating l2tpeth interfaces
(/sys/class/net/<iface>/name_assign_type). This let userspace know if
the device's name has been generated automatically or defined manually.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:41:56 -04:00
Jamal Hadi Salim e0ee84ded7 net sched actions: Complete the JUMPX opcode
per discussion at netconf/netdev:
When we have an action that is capable of branching (example a policer),
we can achieve a continuation of the action graph by programming a
"continue" where we find an exact replica of the same filter rule with a lower
priority and the remainder of the action graph. When you have 100s of thousands
of filters which require such a feature it gets very inefficient to do two
lookups.

This patch completes a leftover feature of action codes. Its time has come.

Example below where a user labels packets with a different skbmark on ingress
of a port depending on whether they have/not exceeded the configured rate.
This mark is then used to make further decisions on some egress port.

 #rate control, very low so we can easily see the effect
sudo $TC actions add action police rate 1kbit burst 90k \
conform-exceed pipe/jump 2 index 10
 # skbedit index 11 will be used if the user conforms
sudo $TC actions add action skbedit mark 11 ok index 11
 # skbedit index 12 will be used if the user does not conform
sudo $TC actions add action skbedit mark 12 ok index 12

 #lets bind the user ..
sudo $TC filter add dev $ETH parent ffff: protocol ip prio 8 u32 \
match ip dst 127.0.0.8/32 flowid 1:10 \
action police index 10 \
action skbedit index 11 \
action skbedit index 12

 #run a ping -f and see what happens..
 #
jhs@foobar:~$ sudo $TC -s filter ls dev $ETH parent ffff: protocol ip
filter pref 8 u32
filter pref 8 u32 fh 800: ht divisor 1
filter pref 8 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2800 success 1005)
  match 7f000008/ffffffff at 16 (success 1005 )
	action order 1:  police 0xa rate 1Kbit burst 23440b mtu 2Kb action pipe/jump 2 overhead 0b
	ref 2 bind 1 installed 207 sec used 122 sec
	Action statistics:
	Sent 84420 bytes 1005 pkt (dropped 0, overlimits 721 requeues 0)
	backlog 0b 0p requeues 0

	action order 2:  skbedit mark 11 pass
	 index 11 ref 2 bind 1 installed 204 sec used 122 sec
 	Action statistics:
	Sent 60564 bytes 721 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

	action order 3:  skbedit mark 12 pass
	 index 12 ref 2 bind 1 installed 201 sec used 122 sec
 	Action statistics:
	Sent 23856 bytes 284 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

Not bad, about 28% non-conforming packets..

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:30:06 -04:00
Dave Johnson 9dd2ab609e netfilter: Wrong icmp6 checksum for ICMPV6_TIME_EXCEED in reverse SNATv6 path
When recalculating the outer ICMPv6 checksum for a reverse path NATv6
such as ICMPV6_TIME_EXCEED nf_nat_icmpv6_reply_translation() was
accessing data beyond the headlen of the skb for non-linear skb.  This
resulted in incorrect ICMPv6 checksum as garbage data was used.

Patch replaces csum_partial() with skb_checksum() which supports
non-linear skbs similar to nf_nat_icmp_reply_translation() from ipv4.

Signed-off-by: Dave Johnson <dave-kernel@centerclick.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-25 11:10:38 +02:00
Liping Zhang 277a292835 netfilter: nft_dynset: continue to next expr if _OP_ADD succeeded
Currently, after adding the following nft rules:
  # nft add set x target1 { type ipv4_addr \; flags timeout \;}
  # nft add rule x y set add ip daddr timeout 1d @target1 counter

the counters will always be zero despite of the elements are added
to the dynamic set "target1" or not, as we will break the nft expr
traversal unconditionally:
  # nft list ruleset
  ...
  set target1 {
      ...
      elements = { 8.8.8.8 expires 23h59m53s}
  }
  chain output {
      ...
      set add ip daddr timeout 1d @target1 counter packets 0 bytes 0
                                                           ^       ^
      ...
  }

Since we add the elements to the set successfully, we should continue
to the next expression.

Additionally, if elements are added to "flow table" successfully, we
will _always_ continue to the next expr, even if the operation is
_OP_ADD. So it's better to keep them to be consistent.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Reported-by: Robert White <rwhite@pobox.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-25 11:10:37 +02:00
Linus Lüssing cf3cb246e2 bridge: ebtables: fix reception of frames DNAT-ed to bridge device/port
When trying to redirect bridged frames to the bridge device itself or
a bridge port (brouting) via the dnat target then this currently fails:

The ethernet destination of the frame is dnat'ed to the MAC address of
the bridge device or port just fine. However, the IP code drops it in
the beginning of ip_input.c/ip_rcv() as the dnat target left
the skb->pkt_type as PACKET_OTHERHOST.

Fixing this by resetting skb->pkt_type to an appropriate type after
dnat'ing.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-25 11:08:31 +02:00
Oliver Hartkopp 1ef83310b8 can: network namespace support for CAN gateway
The CAN gateway was not implemented as per-net in the initial network
namespace support by Mario Kicherer (8e8cda6d73).
This patch enables the CAN gateway to be used in different namespaces.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25 09:04:30 +02:00
Oliver Hartkopp 384317ef41 can: network namespace support for CAN_BCM protocol
The CAN_BCM protocol and its procfs entries were not implemented as per-net
in the initial network namespace support by Mario Kicherer (8e8cda6d73).
This patch adds the missing per-net functionality for the CAN BCM.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25 09:04:29 +02:00
Oliver Hartkopp cb5635a367 can: complete initial namespace support
The statistics and its proc output was not implemented as per-net in the
initial network namespace support by Mario Kicherer (8e8cda6d73).
This patch adds the missing per-net statistics for the CAN subsystem.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25 09:04:29 +02:00
Oliver Hartkopp f2e72f43e7 can: remove obsolete definitions
can_rx_alldev_list is a per-net data structure now. Remove it's definition
here and can_rx_dev_list too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25 09:04:28 +02:00
Oliver Hartkopp 48452c169d can: remove obsolete pernet_operations definitions
The namespace support for the CAN subsystem does not need any additional
memory. So when ".size = 0" there's no extra memory allocated by the system.
And therefore ".id" is obsolete too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25 09:04:28 +02:00
Oliver Hartkopp a7bbd28f04 can: fix memory leak in initial namespace support
The can_rx_alldev_list is a per-net data structure now and allocated in
can_pernet_init(). Make sure the memory is free'd in can_pernet_exit() too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25 09:04:27 +02:00
Salvatore Benedetto 58771c1cb0 Bluetooth: convert smp and selftest to crypto kpp API
* Convert both smp and selftest to crypto kpp API
* Remove module ecc as no more required
* Add ecdh_helper functions for wrapping kpp async calls

This patch has been tested *only* with selftest, which is called on
module loading.

Signed-off-by: Salvatore Benedetto <salvatore.benedetto@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-25 04:53:42 +02:00
Pan Bian 78302fd405 tipc: check return value of nlmsg_new
Function nlmsg_new() will return a NULL pointer if there is no enough
memory, and its return value should be checked before it is used.
However, in function tipc_nl_node_get_monitor(), the validation of the
return value of function nlmsg_new() is missed. This patch fixes the
bug.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 15:51:30 -04:00
Pan Bian a50fe0ffd7 lwtunnel: check return value of nla_nest_start
Function nla_nest_start() may return a NULL pointer on error. However,
in function lwtunnel_fill_encap(), the return value of nla_nest_start()
is not validated before it is used. This patch checks the return value
of nla_nest_start() against NULL.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 15:51:30 -04:00
Benjamin LaHaise a577d8f793 cls_flower: add support for matching MPLS fields (v2)
Add support to the tc flower classifier to match based on fields in MPLS
labels (TTL, Bottom of Stack, TC field, Label).

Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Gao Feng <fgao@ikuai8.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:30:46 -04:00
Benjamin LaHaise 029c1ecbb2 flow_dissector: add mpls support (v2)
Add support for parsing MPLS flows to the flow dissector in preparation for
adding MPLS match support to cls_flower.

Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <jhs@mojatatu.com>
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Gao Feng <fgao@ikuai8.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:30:46 -04:00
Wei Wang 59450f8d83 net/tcp_fastopen: Remove mss check in tcp_write_timeout()
Christoph Paasch from Apple found another firewall issue for TFO:
After successful 3WHS using TFO, server and client starts to exchange
data. Afterwards, a 10s idle time occurs on this connection. After that,
firewall starts to drop every packet on this connection.

The fix for this issue is to extend existing firewall blackhole detection
logic in tcp_write_timeout() by removing the mss check.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:27:17 -04:00
Wei Wang 46c2fa3987 net/tcp_fastopen: Add snmp counter for blackhole detection
This counter records the number of times the firewall blackhole issue is
detected and active TFO is disabled.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:27:17 -04:00
Wei Wang cf1ef3f071 net/tcp_fastopen: Disable active side TFO in certain scenarios
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X     S
		[retry and timeout]

https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C?  X <- data ------ S
		[retry and timeout]

This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.

The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST

We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().

The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.

Also, add a debug sysctl:
  tcp_fastopen_blackhole_detect_timeout_sec:
    the initial timeout to use when firewall blackhole issue happens.
    This can be set and read.
    When setting it to 0, it means to disable the active disable logic.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:27:17 -04:00
David S. Miller bc95cd8e8b mlx5-updates-2017-04-22
Sparse and compiler warnings fixes from Stephen Hemminger.
 
 From Roi Dayan and Or Gerlitz, Add devlink and mlx5 support for controlling
 E-Switch encapsulation mode, this knob will enable HW support for applying
 encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJY+5cRAAoJEEg/ir3gV/o+5c8H/1/khPzy26B2lWyjPC8CRCQF
 eSd0tiHLgIqbZTbnIHTR+NbZ/SUFaukoJi8OKn1fGFHCCajWvPP4xkENVKrUdi3q
 kOgNZb/R1V0j6SdELyoMalFPjAscTgdmwYMnry+vcjOxJ+H2uUTnMKXwFf8IsBjz
 EINy8oZ5jZcejmft0c2O5HN4Bt/7U5ttM3CroAdcvPT9lq2DFJL2uCABhTO/1DdY
 b7uVa47FnkqxX19Ebn7fjp5r3diGYOmCPMjdC89C//rbkLB8FN61EkcSLpGY3YNm
 djmCPQ+xaa3ielmBpOk3AMayFEtYW0nDMj9eWECVByadRQZ2qz9wTVXBp5CX9zg=
 =E3Jt
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2017-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2017-04-22

Sparse and compiler warnings fixes from Stephen Hemminger.

From Roi Dayan and Or Gerlitz, Add devlink and mlx5 support for controlling
E-Switch encapsulation mode, this knob will enable HW support for applying
encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:11:10 -04:00
David Ahern 58c4c6a3f7 net: add rcu locking when changing early demux
systemd-sysctl is triggering a suspicious RCU usage message when
net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via
a sysctl config file:

[   33.896184] ===============================
[   33.899558] [ ERR: suspicious RCU usage.  ]
[   33.900624] 4.11.0-rc7+ #104 Not tainted
[   33.901698] -------------------------------
[   33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage!
[   33.905724]
other info that might help us debug this:

[   33.907656]
rcu_scheduler_active = 2, debug_locks = 0
[   33.909288] 1 lock held by systemd-sysctl/143:
[   33.910373]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff8123a370>] file_start_write+0x45/0x48
[   33.912407]
stack backtrace:
[   33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104
[   33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   33.917870] Call Trace:
[   33.918431]  dump_stack+0x81/0xb6
[   33.919241]  lockdep_rcu_suspicious+0x10f/0x118
[   33.920263]  proc_configure_early_demux+0x65/0x10a
[   33.921391]  proc_udp_early_demux+0x3a/0x41

add rcu locking to proc_configure_early_demux.

Fixes: dddb64bcb3 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:08:19 -04:00
David Ahern fc1f8f4f31 net: ipv6: send unsolicited NA if enabled for all interfaces
When arp_notify is set to 1 for either a specific interface or for 'all'
interfaces, gratuitous arp requests are sent. Since ndisc_notify is the
ipv6 equivalent to arp_notify, it should follow the same semantics.
Commit 4a6e3c5def ("net: ipv6: send unsolicited NA on admin up") sends
the NA on admin up. The final piece is checking devconf_all->ndisc_notify
in addition to the per device setting. Add it.

Fixes: 5cb04436ee ("ipv6: add knob to send unsolicited ND on link-layer address change")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:07:18 -04:00
Peter Tirsek 6bd3d19292 netfilter: xt_socket: Fix broken IPv6 handling
Commit 834184b1f3 ("netfilter: defrag: only register defrag
functionality if needed") used the outdated XT_SOCKET_HAVE_IPV6 macro
which was removed earlier in commit 8db4c5be88 ("netfilter: move
socket lookup infrastructure to nf_socket_ipv{4,6}.c"). With that macro
never being defined, the xt_socket match emits an "Unknown family 10"
warning when used with IPv6:

WARNING: CPU: 0 PID: 1377 at net/netfilter/xt_socket.c:160 socket_mt_enable_defrag+0x47/0x50 [xt_socket]
Unknown family 10
Modules linked in: xt_socket nf_socket_ipv4 nf_socket_ipv6 nf_defrag_ipv4 [...]
CPU: 0 PID: 1377 Comm: ip6tables-resto Not tainted 4.10.10 #1
Hardware name: [...]
Call Trace:
? __warn+0xe7/0x100
? socket_mt_enable_defrag+0x47/0x50 [xt_socket]
? socket_mt_enable_defrag+0x47/0x50 [xt_socket]
? warn_slowpath_fmt+0x39/0x40
? socket_mt_enable_defrag+0x47/0x50 [xt_socket]
? socket_mt_v2_check+0x12/0x40 [xt_socket]
? xt_check_match+0x6b/0x1a0 [x_tables]
? xt_find_match+0x93/0xd0 [x_tables]
? xt_request_find_match+0x20/0x80 [x_tables]
? translate_table+0x48e/0x870 [ip6_tables]
? translate_table+0x577/0x870 [ip6_tables]
? walk_component+0x3a/0x200
? kmalloc_order+0x1d/0x50
? do_ip6t_set_ctl+0x181/0x490 [ip6_tables]
? filename_lookup+0xa5/0x120
? nf_setsockopt+0x3a/0x60
? ipv6_setsockopt+0xb0/0xc0
? sock_common_setsockopt+0x23/0x30
? SyS_socketcall+0x41d/0x630
? vfs_read+0xfa/0x120
? do_fast_syscall_32+0x7a/0x110
? entry_SYSENTER_32+0x47/0x71

This patch brings the conditional back in line with how the rest of the
file handles IPv6.

Fixes: 834184b1f3 ("netfilter: defrag: only register defrag functionality if needed")
Signed-off-by: Peter Tirsek <peter@tirsek.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:29 +02:00
Liping Zhang 64f3967c7a netfilter: ctnetlink: acquire ct->lock before operating nf_ct_seqadj
We should acquire the ct->lock before accessing or modifying the
nf_ct_seqadj, as another CPU may modify the nf_ct_seqadj at the same
time during its packet proccessing.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:29 +02:00
Liping Zhang 53b56da83d netfilter: ctnetlink: make it safer when updating ct->status
After converting to use rcu for conntrack hash, one CPU may update
the ct->status via ctnetlink, while another CPU may process the
packets and update the ct->status.

So the non-atomic operation "ct->status |= status;" via ctnetlink
becomes unsafe, and this may clear the IPS_DYING_BIT bit set by
another CPU unexpectedly. For example:
         CPU0                            CPU1
  ctnetlink_change_status        __nf_conntrack_find_get
      old = ct->status              nf_ct_gc_expired
          -                         nf_ct_kill
          -                      test_and_set_bit(IPS_DYING_BIT
      new = old | status;                 -
  ct->status = new; <-- oops, _DYING_ is cleared!

Now using a series of atomic bit operation to solve the above issue.

Also note, user shouldn't set IPS_TEMPLATE, IPS_SEQ_ADJUST directly,
so make these two bits be unchangable too.

If we set the IPS_TEMPLATE_BIT, ct will be freed by nf_ct_tmpl_free,
but actually it is alloced by nf_conntrack_alloc.
If we set the IPS_SEQ_ADJUST_BIT, this may cause the NULL pointer
deference, as the nfct_seqadj(ct) maybe NULL.

Last, add some comments to describe the logic change due to the
commit a963d710f3 ("netfilter: ctnetlink: Fix regression in CTA_STATUS
processing"), which makes me feel a little confusing.

Fixes: 76507f69c4 ("[NETFILTER]: nf_conntrack: use RCU for conntrack hash")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:28 +02:00
Liping Zhang 88be4c09d9 netfilter: ctnetlink: fix deadlock due to acquire _expect_lock twice
Currently, ctnetlink_change_conntrack is always protected by _expect_lock,
but this will cause a deadlock when deleting the helper from a conntrack,
as the _expect_lock will be acquired again by nf_ct_remove_expectations:

         CPU0
        ----
  lock(nf_conntrack_expect_lock);
  lock(nf_conntrack_expect_lock);

  *** DEADLOCK ***
  May be due to missing lock nesting notation

  2 locks held by lt-conntrack_gr/12853:
  #0:  (&table[i].mutex){+.+.+.}, at: [<ffffffffa05e2009>]
       nfnetlink_rcv_msg+0x399/0x6a9 [nfnetlink]
  #1:  (nf_conntrack_expect_lock){+.....}, at: [<ffffffffa05f2c1f>]
       ctnetlink_new_conntrack+0x17f/0x408 [nf_conntrack_netlink]

  Call Trace:
   dump_stack+0x85/0xc2
   __lock_acquire+0x1608/0x1680
   ? ctnetlink_parse_tuple_proto+0x10f/0x1c0 [nf_conntrack_netlink]
   lock_acquire+0x100/0x1f0
   ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack]
   _raw_spin_lock_bh+0x3f/0x50
   ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack]
   nf_ct_remove_expectations+0x32/0x90 [nf_conntrack]
   ctnetlink_change_helper+0xc6/0x190 [nf_conntrack_netlink]
   ctnetlink_new_conntrack+0x1b2/0x408 [nf_conntrack_netlink]
   nfnetlink_rcv_msg+0x60a/0x6a9 [nfnetlink]
   ? nfnetlink_rcv_msg+0x1b9/0x6a9 [nfnetlink]
   ? nfnetlink_bind+0x1a0/0x1a0 [nfnetlink]
   netlink_rcv_skb+0xa4/0xc0
   nfnetlink_rcv+0x87/0x770 [nfnetlink]

Since the operations are unrelated to nf_ct_expect, so we can drop the
_expect_lock. Also note, after removing the _expect_lock protection,
another CPU may invoke nf_conntrack_helper_unregister, so we should
use rcu_read_lock to protect __nf_conntrack_helper_find invoked by
ctnetlink_change_helper.

Fixes: ca7433df3a ("netfilter: conntrack: seperate expect locking from nf_conntrack_lock")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:28 +02:00
Liping Zhang 14e5676156 netfilter: ctnetlink: drop the incorrect cthelper module request
First, when creating a new ct, we will invoke request_module to try to
load the related inkernel cthelper. So there's no need to call the
request_module again when updating the ct helpinfo.

Second, ctnetlink_change_helper may be called with rcu_read_lock held,
i.e. rcu_read_lock -> nfqnl_recv_verdict -> nfqnl_ct_parse ->
ctnetlink_glue_parse -> ctnetlink_glue_parse_ct ->
ctnetlink_change_helper. But the request_module invocation may sleep,
so we can't call it with the rcu_read_lock held.

Remove it now.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:28 +02:00
Liping Zhang 54a5f9d9ab netfilter: nft_set_bitmap: free dummy elements when destroy the set
We forget to free dummy elements when deleting the set. So when I was
running nft-test.py, I saw many kmemleak warnings:
  kmemleak: 1344 new suspected memory leaks ...
  # cat /sys/kernel/debug/kmemleak
  unreferenced object 0xffff8800631345c8 (size 32):
  comm "nft", pid 9075, jiffies 4295743309 (age 1354.815s)
  hex dump (first 32 bytes):
    f8 63 13 63 00 88 ff ff 88 79 13 63 00 88 ff ff  .c.c.....y.c....
    04 0c 00 00 00 00 00 00 00 00 00 00 08 03 00 00  ................
  backtrace:
    [<ffffffff819059da>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff81288174>] __kmalloc+0x164/0x310
    [<ffffffffa061269d>] nft_set_elem_init+0x3d/0x1b0 [nf_tables]
    [<ffffffffa06130da>] nft_add_set_elem+0x45a/0x8c0 [nf_tables]
    [<ffffffffa0613645>] nf_tables_newsetelem+0x105/0x1d0 [nf_tables]
    [<ffffffffa05fe6d4>] nfnetlink_rcv+0x414/0x770 [nfnetlink]
    [<ffffffff817f0ca6>] netlink_unicast+0x1f6/0x310
    [<ffffffff817f10c6>] netlink_sendmsg+0x306/0x3b0
  ...

Fixes: e920dde516 ("netfilter: nft_set_bitmap: keep a list of dummy elements")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:05:25 +02:00
Liping Zhang 66e5a6b18b netfilter: nf_ct_helper: permit cthelpers with different names via nfnetlink
cthelpers added via nfnetlink may have the same tuple, i.e. except for
the l3proto and l4proto, other fields are all zero. So even with the
different names, we will also fail to add them:
  # nfct helper add ssdp inet udp
  # nfct helper add tftp inet udp
  nfct v1.4.3: netlink error: File exists

So in order to avoid unpredictable behaviour, we should:
1. cthelpers can be selected by nft ct helper obj or xt_CT target, so
report error if duplicated { name, l3proto, l4proto } tuple exist.
2. cthelpers can be selected by nf_ct_tuple_src_mask_cmp when
nf_ct_auto_assign_helper is enabled, so also report error if duplicated
{ l3proto, l4proto, src-port } tuple exist.

Also note, if the cthelper is added from userspace, then the src-port will
always be zero, it's invalid for nf_ct_auto_assign_helper, so there's no
need to check the second point listed above.

Fixes: 893e093c78 ("netfilter: nf_ct_helper: bail out on duplicated helpers")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:05:05 +02:00
Jarno Rajahalme cf5d709188 openvswitch: Delete conntrack entry clashing with an expectation.
Conntrack helpers do not check for a potentially clashing conntrack
entry when creating a new expectation.  Also, nf_conntrack_in() will
check expectations (via init_conntrack()) only if a conntrack entry
can not be found.  The expectation for a packet which also matches an
existing conntrack entry will not be removed by conntrack, and is
currently handled inconsistently by OVS, as OVS expects the
expectation to be removed when the connection tracking entry matching
that expectation is confirmed.

It should be noted that normally an IP stack would not allow reuse of
a 5-tuple of an old (possibly lingering) connection for a new data
connection, so this is somewhat unlikely corner case.  However, it is
possible that a misbehaving source could cause conntrack entries be
created that could then interfere with new related connections.

Fix this in the OVS module by deleting the clashing conntrack entry
after an expectation has been matched.  This causes the following
nf_conntrack_in() call also find the expectation and remove it when
creating the new conntrack entry, as well as the forthcoming reply
direction packets to match the new related connection instead of the
old clashing conntrack entry.

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Reported-by: Yang Song <yangsong@vmware.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:04:41 +02:00
Gao Feng 470acf55a0 netfilter: xt_CT: fix refcnt leak on error path
There are two cases which causes refcnt leak.

1. When nf_ct_timeout_ext_add failed in xt_ct_set_timeout, it should
free the timeout refcnt.
Now goto the err_put_timeout error handler instead of going ahead.

2. When the time policy is not found, we should call module_put.
Otherwise, the related cthelper module cannot be removed anymore.
It is easy to reproduce by typing the following command:
  # iptables -t raw -A OUTPUT -p tcp -j CT --helper ftp --timeout xxx

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:03:01 +02:00
Jarno Rajahalme 120645513f openvswitch: Add eventmask support to CT action.
Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
which can be used in conjunction with the commit flag
(OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
conntrack events (IPCT_*) should be delivered via the Netfilter
netlink multicast groups.  Default behavior depends on the system
configuration, but typically a lot of events are delivered.  This can be
very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
types of events are of interest.

Netfilter core init_conntrack() adds the event cache extension, so we
only need to set the ctmask value.  However, if the system is
configured without support for events, the setting will be skipped due
to extension not being found.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 13:53:25 -04:00
Jarno Rajahalme abd0a4f2b4 openvswitch: Typo fix.
Fix typo in a comment.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 13:53:24 -04:00
Ansis Atteka b40c5f4fde udp: disable inner UDP checksum offloads in IPsec case
Otherwise, UDP checksum offloads could corrupt ESP packets by attempting
to calculate UDP checksum when this inner UDP packet is already protected
by IPsec.

One way to reproduce this bug is to have a VM with virtio_net driver (UFO
set to ON in the guest VM); and then encapsulate all guest's Ethernet
frames in Geneve; and then further encrypt Geneve with IPsec.  In this
case following symptoms are observed:
1. If using ixgbe NIC, then it will complain with following error message:
   ixgbe 0000:01:00.1: partial checksum but l4 proto=32!
2. Receiving IPsec stack will drop all the corrupted ESP packets and
   increase XfrmInStateProtoError counter in /proc/net/xfrm_stat.
3. iperf UDP test from the VM with packet sizes above MTU will not work at
   all.
4. iperf TCP test from the VM will get ridiculously low performance because.

Signed-off-by: Ansis Atteka <aatteka@ovn.org>
Co-authored-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 13:48:54 -04:00
Robert Shearman b7c8487cb3 ipv4: Avoid caching l3mdev dst on mismatched local route
David reported that doing the following:

    ip li add red type vrf table 10
    ip link set dev eth1 vrf red
    ip addr add 127.0.0.1/8 dev red
    ip link set dev eth1 up
    ip li set red up
    ping -c1 -w1 -I red 127.0.0.1
    ip li del red

when either policy routing IP rules are present or the local table
lookup ip rule is before the l3mdev lookup results in a hang with
these messages:

    unregister_netdevice: waiting for red to become free. Usage count = 1

The problem is caused by caching the dst used for sending the packet
out of the specified interface on a local route with a different
nexthop interface. Thus the dst could stay around until the route in
the table the lookup was done is deleted which may be never.

Address the problem by not forcing output device to be the l3mdev in
the flow's output interface if the lookup didn't use the l3mdev. This
then results in the dst using the right device according to the route.

Changes in v2:
 - make the dev_out passed in by __ip_route_output_key_hash correct
   instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as
   suggested by David.

Fixes: 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be a loopback")
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Suggested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 12:50:29 -04:00
Mike Maloney 4a69a86420 packet: add PACKET_FANOUT_FLAG_UNIQUEID to assign new fanout group id.
Fanout uses a per net global namespace. A process that intends to create
a new fanout group can accidentally join an existing group. It is not
possible to detect this.

Add socket option PACKET_FANOUT_FLAG_UNIQUEID.  When specified the
supplied fanout group id must be set to 0, and the kernel chooses an id
that is not already in use.  This is an ephemeral flag so that
other sockets can be added to this group using setsockopt, but NOT
specifying this flag.  The current getsockopt(..., PACKET_FANOUT, ...)
can be used to retrieve the new group id.

We assume that there are not a lot of fanout groups and that this is not
a high frequency call.

The method assigns ids starting at zero and increases until it finds an
unused id.  It keeps track of the last assigned id, and uses it as a
starting point to find new ids.

Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 12:46:00 -04:00
Gerard Garcia 82dfb540ae VSOCK: Add virtio vsock vsockmon hooks
The virtio drivers deal with struct virtio_vsock_pkt.  Add
virtio_transport_deliver_tap_pkt(pkt) for handing packets to the
vsockmon device.

We call virtio_transport_deliver_tap_pkt(pkt) from
net/vmw_vsock/virtio_transport.c and drivers/vhost/vsock.c instead of
common code.  This is because the drivers may drop packets before
handing them to common code - we still want to capture them.

Signed-off-by: Gerard Garcia <ggarcia@deic.uab.cat>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 12:35:56 -04:00
Gerard Garcia 531b374834 VSOCK: Add vsockmon tap functions
Add tap functions that can be used by the vsock transports to
deliver packets to vsockmon virtual network devices.

Signed-off-by: Gerard Garcia <ggarcia@deic.uab.cat>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 12:35:56 -04:00
Steffen Klassert e892d2d404 esp: Fix misplaced spin_unlock_bh.
A recent commit moved esp_alloc_tmp() out of a lock
protected region, but forgot to remove the unlock from
the error path. This patch removes the forgotten unlock.
While at it, remove some unneeded error assignments too.

Fixes: fca11ebde3 ("esp4: Reorganize esp_output")
Fixes: 383d0350f2 ("esp6: Reorganize esp_output")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-24 07:56:31 +02:00
Ingo Molnar 58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
Roi Dayan f43e9b069a net/devlink: Add E-Switch encapsulation control
This is an e-switch global knob to enable HW support for applying
encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading.

The actual encap/decap is carried out (along with the matching and other actions)
per offloaded e-switch rules, e.g as done when offloading the TC tunnel key action.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-22 20:26:37 +03:00
David S. Miller fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
Tushar Dave c70b17b775 netpoll: Check for skb->queue_mapping
Reducing real_num_tx_queues needs to be in sync with skb queue_mapping
otherwise skbs with queue_mapping greater than real_num_tx_queues
can be sent to the underlying driver and can result in kernel panic.

One such event is running netconsole and enabling VF on the same
device. Or running netconsole and changing number of tx queues via
ethtool on same device.

e.g.
Unable to handle kernel NULL pointer dereference
tsk->{mm,active_mm}->context = 0000000000001525
tsk->{mm,active_mm}->pgd = fff800130ff9a000
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
kworker/48:1(475): Oops [#1]
CPU: 48 PID: 475 Comm: kworker/48:1 Tainted: G           OE
4.11.0-rc3-davem-net+ #7
Workqueue: events queue_process
task: fff80013113299c0 task.stack: fff800131132c000
TSTATE: 0000004480e01600 TPC: 00000000103f9e3c TNPC: 00000000103f9e40 Y:
00000000    Tainted: G           OE
TPC: <ixgbe_xmit_frame_ring+0x7c/0x6c0 [ixgbe]>
g0: 0000000000000000 g1: 0000000000003fff g2: 0000000000000000 g3:
0000000000000001
g4: fff80013113299c0 g5: fff8001fa6808000 g6: fff800131132c000 g7:
00000000000000c0
o0: fff8001fa760c460 o1: fff8001311329a50 o2: fff8001fa7607504 o3:
0000000000000003
o4: fff8001f96e63a40 o5: fff8001311d77ec0 sp: fff800131132f0e1 ret_pc:
000000000049ed94
RPC: <set_next_entity+0x34/0xb80>
l0: 0000000000000000 l1: 0000000000000800 l2: 0000000000000000 l3:
0000000000000000
l4: 000b2aa30e34b10d l5: 0000000000000000 l6: 0000000000000000 l7:
fff8001fa7605028
i0: fff80013111a8a00 i1: fff80013155a0780 i2: 0000000000000000 i3:
0000000000000000
i4: 0000000000000000 i5: 0000000000100000 i6: fff800131132f1a1 i7:
00000000103fa4b0
I7: <ixgbe_xmit_frame+0x30/0xa0 [ixgbe]>
Call Trace:
 [00000000103fa4b0] ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
 [0000000000998c74] netpoll_start_xmit+0xf4/0x200
 [0000000000998e10] queue_process+0x90/0x160
 [0000000000485fa8] process_one_work+0x188/0x480
 [0000000000486410] worker_thread+0x170/0x4c0
 [000000000048c6b8] kthread+0xd8/0x120
 [0000000000406064] ret_from_fork+0x1c/0x2c
 [0000000000000000]           (null)
Disabling lock debugging due to kernel taint
Caller[00000000103fa4b0]: ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
Caller[0000000000998c74]: netpoll_start_xmit+0xf4/0x200
Caller[0000000000998e10]: queue_process+0x90/0x160
Caller[0000000000485fa8]: process_one_work+0x188/0x480
Caller[0000000000486410]: worker_thread+0x170/0x4c0
Caller[000000000048c6b8]: kthread+0xd8/0x120
Caller[0000000000406064]: ret_from_fork+0x1c/0x2c
Caller[0000000000000000]:           (null)

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:45:19 -04:00
Nikolay Aleksandrov 723b929ca0 ip6mr: fix notification device destruction
Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
because we call unregister_netdevice_many for a device that is already
being destroyed. In IPv4's ipmr that has been resolved by two commits
long time ago by introducing the "notify" parameter to the delete
function and avoiding the unregister when called from a notifier, so
let's do the same for ip6mr.

The trace from Andrey:
------------[ cut here ]------------
kernel BUG at net/core/dev.c:6813!
invalid opcode: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 1165 Comm: kworker/u4:3 Not tainted 4.11.0-rc7+ #251
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Workqueue: netns cleanup_net
task: ffff880069208000 task.stack: ffff8800692d8000
RIP: 0010:rollback_registered_many+0x348/0xeb0 net/core/dev.c:6813
RSP: 0018:ffff8800692de7f0 EFLAGS: 00010297
RAX: ffff880069208000 RBX: 0000000000000002 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88006af90569
RBP: ffff8800692de9f0 R08: ffff8800692dec60 R09: 0000000000000000
R10: 0000000000000006 R11: 0000000000000000 R12: ffff88006af90070
R13: ffff8800692debf0 R14: dffffc0000000000 R15: ffff88006af90000
FS:  0000000000000000(0000) GS:ffff88006cb00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe7e897d870 CR3: 00000000657e7000 CR4: 00000000000006e0
Call Trace:
 unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
 unregister_netdevice_many+0xc8/0x120 net/core/dev.c:7880
 ip6mr_device_event+0x362/0x3f0 net/ipv6/ip6mr.c:1346
 notifier_call_chain+0x145/0x2f0 kernel/notifier.c:93
 __raw_notifier_call_chain kernel/notifier.c:394
 raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
 call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1647
 call_netdevice_notifiers net/core/dev.c:1663
 rollback_registered_many+0x919/0xeb0 net/core/dev.c:6841
 unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
 unregister_netdevice_many net/core/dev.c:7880
 default_device_exit_batch+0x4fa/0x640 net/core/dev.c:8333
 ops_exit_list.isra.4+0x100/0x150 net/core/net_namespace.c:144
 cleanup_net+0x5a8/0xb40 net/core/net_namespace.c:463
 process_one_work+0xc04/0x1c10 kernel/workqueue.c:2097
 worker_thread+0x223/0x19c0 kernel/workqueue.c:2231
 kthread+0x35e/0x430 kernel/kthread.c:231
 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
Code: 3c 32 00 0f 85 70 0b 00 00 48 b8 00 02 00 00 00 00 ad de 49 89
47 78 e9 93 fe ff ff 49 8d 57 70 49 8d 5f 78 eb 9e e8 88 7a 14 fe <0f>
0b 48 8b 9d 28 fe ff ff e8 7a 7a 14 fe 48 b8 00 00 00 00 00
RIP: rollback_registered_many+0x348/0xeb0 RSP: ffff8800692de7f0
---[ end trace e0b29c57e9b3292c ]---

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:35:47 -04:00
David S. Miller 69e3948aaa NFC 4.12 pull request
This is the NFC pull request for 4.12. We have:
 
 - Improvements for the pn533 command queue handling and device
   registration order.
 - Removal of platform data for the pn544 and st21nfca drivers.
 - Additional device tree options to support more trf7970a hardware options.
 - Support for Sony's RC-S380P through the port100 driver.
 - Removal of the obsolte nfcwilink driver.
 - Headers inclusion cleanups (miscdevice.h, unaligned.h) for many drivers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJY8//bAAoJEIqAPN1PVmxKYVsP/0d9V98WuvBiyNffRwNLbol1
 w37Er17cIma4Tzrm9jWzwGFCAd4k5Bn3K6rEXejsnSCkvSPZaRvlsd9itpmxmYhs
 SkWPl9IoPi9wWrHkr20p34n1OdZdqx+R6CtKNB4B7t7EASWlZ6BMl4RgeO03QckA
 FHZSGszOWMr9OF/+ZLBJm66JlNTkNiaumjFXeayXEzkv2JhnZqxdLqR8117Ycwa1
 MvSYzvcOAV1OWlaiyc3VzyF49D3DcxweC4lgx3JkQ1CPzcIIgPYaws1QGLraSwUT
 JSVWn3P0WFM8sPJEGDa7XKjVPfy7mW2wgQ2oJVZJR5TOygyonkNuTK2ohEXp0SUI
 xzH/qbQmvKb/VbwdXWj4N7rnfpdry/C52S5+nn/pLV6Y2S7LF4FGvUMWUQmh2uu3
 kw2SQqEHLcbHnDz3G50UfTJ9mH1CVP8a4HsM39Wtm79H3IVmnS2+owm/wdSrqq6h
 5i/nL7L/6XDj+yg+2th1BdHxhA6F7aTDxxFpgF25K+y79tm2Fvnic6pQBfwRTpvv
 FfvTMpJAdC9OkLppNb3PLUT+YnSN1YgH7Hgv6rFc/KiVJ4rMFMXV1EaWdzWWuRd5
 U8Obl1Nag2SmSSVrRAr56yfltkJlhqcoLk01Go3d/qYF4GO7LFrmSoODH0L0JDaE
 mH/vYF47mkFvWicF950v
 =Zy1M
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.12 pull request

This is the NFC pull request for 4.12. We have:

- Improvements for the pn533 command queue handling and device
  registration order.
- Removal of platform data for the pn544 and st21nfca drivers.
- Additional device tree options to support more trf7970a hardware options.
- Support for Sony's RC-S380P through the port100 driver.
- Removal of the obsolte nfcwilink driver.
- Headers inclusion cleanups (miscdevice.h, unaligned.h) for many drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:29:40 -04:00
Dan Carpenter 6f60f43810 net: qrtr: potential use after free in qrtr_sendmsg()
If skb_pad() fails then it frees the skb so we should check for errors.

Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:19:27 -04:00
David S. Miller 6b633e82b0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-04-20

This adds the basic infrastructure for IPsec hardware
offloading, it creates a configuration API and adjusts
the packet path.

1) Add the needed netdev features to configure IPsec offloads.

2) Add the IPsec hardware offloading API.

3) Prepare the ESP packet path for hardware offloading.

4) Add gso handlers for esp4 and esp6, this implements
   the software fallback for GSO packets.

5) Add xfrm replay handler functions for offloading.

6) Change ESP to use a synchronous crypto algorithm on
   offloading, we don't have the option for asynchronous
   returns when we handle IPsec at layer2.

7) Add a xfrm validate function to validate_xmit_skb. This
   implements the software fallback for non GSO packets.

8) Set the inner_network and inner_transport members of
   the SKB, as well as encapsulation, to reflect the actual
   positions of these headers, and removes them only once
   encryption is done on the payload.
   From Ilan Tayari.

9) Prepare the ESP GRO codepath for hardware offloading.

10) Fix incorrect null pointer check in esp6.
    From Colin Ian King.

11) Fix for the GSO software fallback path to detect the
    fallback correctly.
    From Ilan Tayari.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:11:28 -04:00
WANG Cong 4392053879 net_sched: remove useless NULL to tp->root
There is no need to NULL tp->root in ->destroy(), since tp is
going to be freed very soon, and existing readers are still
safe to read them.

For cls_route, we always init its tp->root, so it can't be NULL,
we can drop more useless code.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:58:15 -04:00
WANG Cong 763dbf6328 net_sched: move the empty tp check from ->destroy() to ->delete()
We could have a race condition where in ->classify() path we
dereference tp->root and meanwhile a parallel ->destroy() makes it
a NULL. Daniel cured this bug in commit d936377414
("net, sched: respect rcu grace period on cls destruction").

This happens when ->destroy() is called for deleting a filter to
check if we are the last one in tp, this tp is still linked and
visible at that time. The root cause of this problem is the semantic
of ->destroy(), it does two things (for non-force case):

1) check if tp is empty
2) if tp is empty we could really destroy it

and its caller, if cares, needs to check its return value to see if it
is really destroyed. Therefore we can't unlink tp unless we know it is
empty.

As suggested by Daniel, we could actually move the test logic to ->delete()
so that we can safely unlink tp after ->delete() tells us the last one is
just deleted and before ->destroy().

Fixes: 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
Cc: Roi Dayan <roid@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:58:15 -04:00
Al Viro 3b6d4dbf09 sctp: switch to copy_from_iter_full()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 13:57:27 -04:00
Al Viro 1c512a7ca9 net/9p: switch to copy_from_iter_full()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 13:57:22 -04:00
Al Viro dc88e3b4c8 rds: make use of iov_iter_revert() 2017-04-21 13:57:11 -04:00
David Ahern 557c44be91 net: ipv6: RTF_PCPU should not be settable from userspace
Andrey reported a fault in the IPv6 route code:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880069809600 task.stack: ffff880062dc8000
RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
RSP: 0018:ffff880062dced30 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff8800670561c0 RCX: 0000000000000006
RDX: 0000000000000003 RSI: ffff880062dcfb28 RDI: 0000000000000018
RBP: ffff880062dced68 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff880062dcfb28 R14: dffffc0000000000 R15: 0000000000000000
FS:  00007feebe37e7c0(0000) GS:ffff88006cb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000205a0fe4 CR3: 000000006b5c9000 CR4: 00000000000006e0
Call Trace:
 ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
 ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
...

Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
set. Flags passed to the kernel are blindly copied to the allocated
rt6_info by ip6_route_info_create making a newly inserted route appear
as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
and expects rt->dst.from to be set - which it is not since it is not
really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
generates the fault.

Fix by checking for the flag and failing with EINVAL.

Fixes: d52d3997f8 ("ipv6: Create percpu rt6_info")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:55:33 -04:00
Daniel Borkmann b1d9fc41aa bpf: add napi_id read access to __sk_buff
Add napi_id access to __sk_buff for socket filter program types, tc
program types and other bpf_convert_ctx_access() users. Having access
to skb->napi_id is useful for per RX queue listener siloing, f.e.
in combination with SO_ATTACH_REUSEPORT_EBPF and when busy polling is
used, meaning SO_REUSEPORT enabled listeners can then select the
corresponding socket at SYN time already [1]. The skb is marked via
skb_mark_napi_id() early in the receive path (e.g., napi_gro_receive()).

Currently, sockets can only use SO_INCOMING_NAPI_ID from 6d4339028b
("net: Introduce SO_INCOMING_NAPI_ID") as a socket option to look up
the NAPI ID associated with the queue for steering, which requires a
prior sk_mark_napi_id() after the socket was looked up.

Semantics for the __sk_buff napi_id access are similar, meaning if
skb->napi_id is < MIN_NAPI_ID (e.g. outgoing packets using sender_cpu),
then an invalid napi_id of 0 is returned to the program, otherwise a
valid non-zero napi_id.

  [1] http://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:53:27 -04:00
Ilan Tayari 43170c4e0b gso: Validate assumption of frag_list segementation
Commit 07b26c9454 ("gso: Support partial splitting at the frag_list
pointer") assumes that all SKBs in a frag_list (except maybe the last
one) contain the same amount of GSO payload.

This assumption is not always correct, resulting in the following
warning message in the log:
    skb_segment: too many frags

For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
one frag, and some with 2 frags.
After GRO, the frag_list SKBs end up having different amounts of payload.
If this frag_list SKB is then forwarded, the aforementioned assumption
is violated.

Validate the assumption, and fall back to software GSO if it not true.

Change-Id: Ia03983f4a47b6534dd987d7a2aad96d54d46d212
Fixes: 07b26c9454 ("gso: Support partial splitting at the frag_list pointer")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:30:29 -04:00
Matthew Whitehead 7acf8a1e8a Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning
Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.

For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.

Version 2: changed jiffies to microseconds for predictable units.

Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:22:34 -04:00
Craig Gallek 9830ad4c6a ip_tunnel: Allow policy-based routing through tunnels
This feature allows the administrator to set an fwmark for
packets traversing a tunnel.  This allows the use of independent
routing tables for tunneled packets without the use of iptables.

There is no concept of per-packet routing decisions through IPv4
tunnels, so this implementation does not need to work with
per-packet route lookups as the v6 implementation may
(with IP6_TNL_F_USE_ORIG_FWMARK).

Further, since the v4 tunnel ioctls share datastructures
(which can not be trivially modified) with the kernel's internal
tunnel configuration structures, the mark attribute must be stored
in the tunnel structure itself and passed as a parameter when
creating or changing tunnel attributes.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:21:31 -04:00
Craig Gallek 0a473b82cb ip6_tunnel: Allow policy-based routing through tunnels
This feature allows the administrator to set an fwmark for
packets traversing a tunnel.  This allows the use of independent
routing tables for tunneled packets without the use of iptables.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:21:30 -04:00
David Lebrun 95b9b88d2d ipv6: sr: fix double free of skb after handling invalid SRH
The icmpv6_param_prob() function already does a kfree_skb(),
this patch removes the duplicate one.

Fixes: 1ababeba4a ("ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:16:01 -04:00
Florian Fainelli 8e6c1812e6 net: dsa: Remove redundant NULL dst check
tag_lan9303.c does check for a NULL dst but that's already checked by
dsa_switch_rcv() one layer above.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Juergen Borleis <jbe@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 10:41:24 -04:00
Wolfgang Bumiller e0535ce58b net sched actions: allocate act cookie early
Policing filters do not use the TCA_ACT_* enum and the tb[]
nlattr array in tcf_action_init_1() doesn't get filled for
them so we should not try to look for a TCA_ACT_COOKIE
attribute in the then uninitialized array.
The error handling in cookie allocation then calls
tcf_hash_release() leading to invalid memory access later
on.
Additionally, if cookie allocation fails after an already
existing non-policing filter has successfully been changed,
tcf_action_release() should not be called, also we would
have to roll back the changes in the error handling, so
instead we now allocate the cookie early and assign it on
success at the end.

CVE-2017-7979
Fixes: 1045ba77a5 ("net sched actions: Add support for user cookies")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 16:32:07 -04:00
David S. Miller 12a0a6c054 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2017-04-19

Two fixes for af_key:

1) Add a lock to key dump to prevent a NULL pointer dereference.
   From Yuejie Shi.

2) Fix slab-out-of-bounds in parse_ipsecrequests.
   From Herbert Xu.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 16:19:46 -04:00
Chema Gonzalez d6ecf32805 tcp_cubic: fix typo in module param description
Signed-off-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 16:16:44 -04:00
subashab@codeaurora.org 0bd84065b1 net: ipv6: Fix UDP early demux lookup with udp_l3mdev_accept=0
David Ahern reported that 5425077d73 ("net: ipv6: Add early demux
handler for UDP unicast") breaks udp_l3mdev_accept=0 since early
demux for IPv6 UDP was doing a generic socket lookup which does not
require an exact match. Fix this by making UDPv6 early demux match
connected sockets only.

v1->v2: Take reference to socket after match as suggested by Eric
v2->v3: Add comment before break

Fixes: 5425077d73 ("net: ipv6: Add early demux handler for UDP unicast")
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 15:50:27 -04:00
Eric Dumazet 0f9fa831ae tcp: remove poll() flakes with FastOpen
When using TCP FastOpen for an active session, we send one wakeup event
from tcp_finish_connect(), right before the data eventually contained in
the received SYNACK is queued to sk->sk_receive_queue.

This means that depending on machine load or luck, poll() users
might receive POLLOUT events instead of POLLIN|POLLOUT

To fix this, we need to move the call to sk->sk_state_change()
after the (optional) call to tcp_rcv_fastopen_synack()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 15:42:11 -04:00
Eric Dumazet 3d4762639d tcp: remove poll() flakes when receiving RST
When a RST packet is processed, we send two wakeup events to interested
polling users.

First one by a sk->sk_error_report(sk) from tcp_reset(),
followed by a sk->sk_state_change(sk) from tcp_done().

Depending on machine load and luck, poll() can either return POLLERR,
or POLLIN|POLLOUT|POLLERR|POLLHUP (this happens on 99 % of the cases)

This is probably fine, but we can avoid the confusion by reordering
things so that we have more TCP fields updated before the first wakeup.

This might even allow us to remove some barriers we added in the past.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 15:42:10 -04:00
David Lebrun 2f3bb64247 ipv6: sr: fix out-of-bounds access in SRH validation
This patch fixes an out-of-bounds access in seg6_validate_srh() when the
trailing data is less than sizeof(struct sr6_tlv).

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 15:40:33 -04:00
Johannes Berg 3018e947d7 mac80211: reject ToDS broadcast data frames
AP/AP_VLAN modes don't accept any real 802.11 multicast data
frames, but since they do need to accept broadcast management
frames the same is currently permitted for data frames. This
opens a security problem because such frames would be decrypted
with the GTK, and could even contain unicast L3 frames.

Since the spec says that ToDS frames must always have the BSSID
as the RA (addr1), reject any other data frames.

The problem was originally reported in "Predicting, Decrypting,
and Abusing WPA2/802.11 Group Keys" at usenix
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/vanhoef
and brought to my attention by Jouni.

Cc: stable@vger.kernel.org
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
--
Dave, I didn't want to send you a new pull request for a single
commit yet again - can you apply this one patch as is?
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 15:37:46 -04:00
Tobias Klauser 1f504ec989 bpf: remove reference to sock_filter_ext from kerneldoc comment
struct sock_filter_ext didn't make it into the tree and is now called
struct bpf_insn. Reword the kerneldoc comment for bpf_convert_filter()
accordingly.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 15:26:49 -04:00
David S. Miller 028f43bc64 My last pull request has been a while, we now have:
* connection quality monitoring with multiple thresholds
  * support for FILS shared key authentication offload
  * pre-CAC regulatory compliance - only ETSI allows this
  * sanity check for some rate confusion that hit ChromeOS
    (but nobody else uses it, evidently)
  * some documentation updates
  * lots of cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlj12HMACgkQa3t4Rpy0
 AB0ztBAAi0tH9xR/7iYgChyZV4S8PpYKo2QoQZofG8vzAztboqI4clAxbWEOsJHh
 qddjm+foiHVJtZj2LqxjDcaxk69VIh/ERSlR7ve7GCzz9WAAWBMHZop2eArHvgI1
 pqP4mQEZ7QISVo88H3LeRdj8NmTwfZYH8u8e2CN3yEpSh1PPrU+slaXRLrjB4uql
 XWwwJYQatgDw6Dj4vTIk++DqGo7OhK6CrC1gZLnyOtitTiPzRtfj8rdRHeRKdlj4
 wOkUaenjs5r9KsofNYZpzckHp2NEpgIruqCsNdRGHf14EWBC5Q1N35OUOecyQ67T
 3VeSnHxU4qjomkXgwqmDKFFOdqtqIruor3YDdO1iwO2TNF+JlNfq5AqUNec/XjUv
 VDmj1NRZE0ftJtCkDFm1Q/ABfVDH9i2O6ZBs6a3zb65lA83q1y4xlF48LqDzG3qi
 fNnfRO2rOOiyosF3HEkF5u1mfD6MRUtZAc2ZiHckGUpAngs5QOWKqtVgcgWjmbFW
 qDTKsFYi2YpGXZAnUjqS4ZtmcgRGEXqg1STJBt4cA8cnmI9Ka5GplACVhqzGeneH
 EYMESEct9BOpR6BjABmbZL09NtCkiTPYjiL4V//USr4f6NFhOeHHMYuxYFYIEgC6
 ldRjf4EUzZw0QJ8X6L+zxYI5m40fEJ7bGhlIdMo7fWXpRpCaF1Y=
 =f4VT
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2017-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
My last pull request has been a while, we now have:
 * connection quality monitoring with multiple thresholds
 * support for FILS shared key authentication offload
 * pre-CAC regulatory compliance - only ETSI allows this
 * sanity check for some rate confusion that hit ChromeOS
   (but nobody else uses it, evidently)
 * some documentation updates
 * lots of cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 13:54:40 -04:00
Juergen Beisert e8fe177a62 net: dsa: add support for the SMSC-LAN9303 tagging format
To define the outgoing port and to discover the incoming port a regular
VLAN tag is used by the LAN9303. But its VID meaning is 'special'.

This tag handler/filter depends on some hardware features which must be
enabled in the device to provide and make use of this special VLAN tag
to control the destination and the source of an ethernet packet.

Signed-off-by: Juergen Borleis <jbe@pengutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 13:48:54 -04:00
NeilBrown 62b2417e84 sunrpc: don't check for failure from mempool_alloc()
When mempool_alloc() is allowed to sleep (GFP_NOIO allows
sleeping) it cannot fail.
So rpc_alloc_task() cannot fail, so rpc_new_task doesn't need
to test for failure.
Consequently rpc_new_task() cannot fail, so the callers
don't need to test.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:44:57 -04:00
David S. Miller 6324805979 A single fix, for the MU-MIMO monitor mode, that fixes
bad SKB accesses if the SKB was paged, which is the case
 for the only driver supporting this - iwlwifi.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlj1zD0ACgkQa3t4Rpy0
 AB0zsRAAgQ+rAwSjMfd4HtqiM80JLOhDzjkpw3wXvJvA3eG47DIEdPp3i2rV3FjB
 pfFmcn/iAVoNgGwxPoFXqsSevxZ+Kv5LPhvvYeqCyOB1H8xIRQw1pHN93jHc19JT
 BXa/UAuZLcLNvChTWBNF2jbWeiLiErkRTFqqdveGp6FJxmfFH0GhOi0eZPCWgXOH
 AK35A/59rngYRXc7lZTX47i2FE49IcxUSbp/0ZLtn74VgHyHElltiEhaUva9lUki
 YaOOLter25HzZgsB+LpxQYOeOS88oF+8vvYbT07Vn5UpjVu6Y7WpKfCEx4BqliP5
 H2VQflSXTSGH+wX5sQLmtXA5oidlNfeA3r9THsaPBXCfnqNUFTaszfs7zzAO+JNC
 BCtmGDiz1c6EYOjKdIpqglfsTehDHkCDnVtOdiUN4oPDqYACgXH/XdfzVwoZCQSg
 be5idiS6DPVFeQQ19DO+mK+Iwcn9PZ2H2NunLuAeaHd6mR1hdKN8wcSF/ipgtjpR
 yNyQbSFKIgtvTHiBq42IZZzLsASAI/qAgVXQFnrbR8uzYCJwVscnt7Cg0nSNIVV4
 c5McdytcJ8b1NY37hNrwS9e9XJ8oJmOjOOewxciRNLKJq2NjVkB7wh4oYiCL899z
 fQrCvXEcIqv4m0YtnT4On1ouEn1B3A0XFSUVuIBNdec1rJK7ovg=
 =r4GR
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A single fix, for the MU-MIMO monitor mode, that fixes
bad SKB accesses if the SKB was paged, which is the case
for the only driver supporting this - iwlwifi.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 13:36:42 -04:00
David S. Miller 7b9f6da175 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 10:35:33 -04:00
Gao Feng 122868b378 netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14
The window scale may be enlarged from 14 to 15 according to the itef
draft https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03.

Use the macro TCP_MAX_WSCALE to support it easily with TCP stack in
the future.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal be7be6e161 netfilter: ipvs: fix incorrect conflict resolution
The commit ab8bc7ed86
("netfilter: remove nf_ct_is_untracked")
changed the line
   if (ct && !nf_ct_is_untracked(ct) && nfct_nat(ct)) {
	   to
   if (ct && nfct_nat(ct)) {

meanwhile, the commit 41390895e5
("netfilter: ipvs: don't check for presence of nat extension")
from ipvs-next had changed the same line to

  if (ct && !nf_ct_is_untracked(ct) && (ct->status & IPS_NAT_MASK)) {

When ipvs-next got merged into nf-next, the merge resolution took
the first version, dropping the conversion of nfct_nat().

While this doesn't cause a problem at the moment, it will once we stop
adding the nat extension by default.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal 01026edef9 nefilter: eache: reduce struct size from 32 to 24 byte
Only "cache" needs to use ulong (its used with set_bit()), missed can use
u16.  Also add build-time assertion to ensure event bits fit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal c6dd940b1f netfilter: allow early drop of assured conntracks
If insertion of a new conntrack fails because the table is full, the kernel
searches the next buckets of the hash slot where the new connection
was supposed to be inserted at for an entry that hasn't seen traffic
in reply direction (non-assured), if it finds one, that entry is
is dropped and the new connection entry is allocated.

Allow the conntrack gc worker to also remove *assured* conntracks if
resources are low.

Do this by querying the l4 tracker, e.g. tcp connections are now dropped
if they are no longer established (e.g. in finwait).

This could be refined further, e.g. by adding 'soft' established timeout
(i.e., a timeout that is only used once we get close to resource
exhaustion).

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal b3a5db109e netfilter: conntrack: use u8 for extension sizes again
commit 223b02d923
("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len")
had to increase size of the extension offsets because total size of the
extensions had increased to a point where u8 did overflow.

3 years later we've managed to diet extensions a bit and we no longer
need u16.  Furthermore we can now add a compile-time assertion for this
problem.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal faec865db9 netfilter: remove last traces of variable-sized extensions
get rid of the (now unused) nf_ct_ext_add_length define and also
rename the function to plain nf_ct_ext_add().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal 9f0f3ebeda netfilter: helpers: remove data_len usage for inkernel helpers
No need to track this for inkernel helpers anymore as
NF_CT_HELPER_BUILD_BUG_ON checks do this now.

All inkernel helpers know what kind of structure they
stored in helper->data.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal 157ffffeb5 netfilter: nfnetlink_cthelper: reject too large userspace allocation requests
Userspace should not abuse the kernel to store large amounts of data,
reject requests larger than the private area can accommodate.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal dcf67740f2 netfilter: helper: add build-time asserts for helper data size
add a 32 byte scratch area in the helper struct instead of relying
on variable sized helpers plus compile-time asserts to let us know
if 32 bytes aren't enough anymore.

Not having variable sized helpers will later allow to add BUILD_BUG_ON
for the total size of conntrack extensions -- the helper extension is
the only one that doesn't have a fixed size.

The (useless!) NF_CT_HELPER_BUILD_BUG_ON(0); are added so that in case
someone adds a new helper and copy-pastes from one that doesn't store
private data at least some indication that this macro should be used
somehow is there...

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:16 +02:00
Florian Westphal 694a0055f0 netfilter: nft_ct: allow to set ctnetlink event types of a connection
By default the kernel emits all ctnetlink events for a connection.
This allows to select the types of events to generate.

This can be used to e.g. only send DESTROY events but no NEW/UPDATE ones
and will work even if sysctl net.netfilter.nf_conntrack_events is set to 0.

This was already possible via iptables' CT target, but the nft version has
the advantage that it can also be used with already-established conntracks.

The added nf_ct_is_template() check isn't a bug fix as we only support
mark and labels (and unlike ecache the conntrack core doesn't copy those).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:16 +02:00
Ilan Tayari 8f92e03ecc esp4/6: Fix GSO path for non-GSO SW-crypto packets
If esp*_offload module is loaded, outbound packets take the
GSO code path, being encapsulated at layer 3, but encrypted
in layer 2. validate_xmit_xfrm calls esp*_xmit for that.

esp*_xmit was wrongfully detecting these packets as going
through hardware crypto offload, while in fact they should
be encrypted in software, causing plaintext leakage to
the network, and also dropping at the receiver side.

Perform the encryption in esp*_xmit, if the SA doesn't have
a hardware offload_handle.

Also, align esp6 code to esp4 logic.

Fixes: fca11ebde3 ("esp4: Reorganize esp_output")
Fixes: 383d0350f2 ("esp6: Reorganize esp_output")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-19 07:48:57 +02:00
Colin Ian King ffa6f571e4 esp6: fix incorrect null pointer check on xo
The check for xo being null is incorrect, currently it is checking
for non-null, it should be checking for null.

Detected with CoverityScan, CID#1429349 ("Dereference after null check")

Fixes: 7862b4058b ("esp: Add gso handlers for esp4 and esp6")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-19 06:49:00 +02:00
Paul E. McKenney 5f0d5a3ae7 mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section.  Of course, that is not the
case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety".  This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
  Dumazet, in order to help people familiar with the old name find
  the new one. ]
Acked-by: David Rientjes <rientjes@google.com>
2017-04-18 11:42:36 -07:00
Xin Long 6c80138773 sctp: process duplicated strreset asoc request correctly
This patch is to fix the replay attack issue for strreset asoc requests.

When a duplicated strreset asoc request is received, reply it with bad
seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with the
result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.

But note that if the result saved in asoc is performed, the sender's next
tsn and receiver's next tsn for the response chunk should be set. It's
safe to get them from asoc. Because if it's changed, which means the peer
has received the response already, the new response with wrong tsn won't
be accepted by peer.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:39:50 -04:00
Xin Long d0f025e611 sctp: process duplicated strreset in and addstrm in requests correctly
This patch is to fix the replay attack issue for strreset and addstrm in
requests.

When a duplicated strreset in or addstrm in request is received, reply it
with bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with
the result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.

For strreset in or addstrm in request, if the receiver side processes it
successfully, a strreset out or addstrm out request(as a response for that
request) will be sent back to peer. reconf_time will retransmit the out
request even if it's lost.

So when receiving a duplicated strreset in or addstrm in request and it's
result was performed, it shouldn't reply this request, but drop it instead.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:39:50 -04:00
Xin Long e4dc99c7c2 sctp: process duplicated strreset out and addstrm out requests correctly
Now sctp stream reconf will process a request again even if it's seqno is
less than asoc->strreset_inseq.

If one request has been done successfully and some data chunks have been
accepted and then a duplicated strreset out request comes, the streamin's
ssn will be cleared. It will cause that stream will never receive chunks
any more because of unsynchronized ssn. It allows a replay attack.

A similar issue also exists when processing addstrm out requests. It will
cause more extra streams being added.

This patch is to fix it by saving the last 2 results into asoc. When a
duplicated strreset out or addstrm out request is received, reply it with
bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with the
result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.

Note that it saves last 2 results instead of only last 1 result, because
two requests can be sent together in one chunk.

And note that when receiving a duplicated request, the receiver side will
still reply it even if the peer has received the response. It's safe, As
the response will be dropped by the peer.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:39:50 -04:00
Matthias Kaehlcke bbf67e450a nl80211: Fix enum type of variable in nl80211_put_sta_rate()
rate_flg is of type 'enum nl80211_attrs', however it is assigned with
'enum nl80211_rate_info' values. Change the type of rate_flg accordingly.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-18 11:03:03 +02:00
Matthias Kaehlcke a4ac6f2e53 mac80211: ibss: Fix channel type enum in ieee80211_sta_join_ibss()
cfg80211_chandef_create() expects an 'enum nl80211_channel_type' as
channel type however in ieee80211_sta_join_ibss()
NL80211_CHAN_WIDTH_20_NOHT is passed in two occasions, which is of
the enum type 'nl80211_chan_width'. Change the value to NL80211_CHAN_NO_HT
(20 MHz, non-HT channel) of the channel type enum.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-18 11:02:35 +02:00
Matthias Kaehlcke aa1702dd16 cfg80211: Fix array-bounds warning in fragment copy
__ieee80211_amsdu_copy_frag intentionally initializes a pointer to
array[-1] to increment it later to valid values. clang rightfully
generates an array-bounds warning on the initialization statement.

Initialize the pointer to array[0] and change the algorithm from
increment before to increment after consume.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-18 11:02:18 +02:00
Johannes Berg f64331d580 mac80211: keep a separate list of monitor interfaces that are up
In addition to keeping monitor interfaces on the regular list of
interfaces, keep those that are up and not in cooked mode on a
separate list. This saves having to iterate all interfaces when
delivering to monitor interfaces.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-18 11:00:13 +02:00
Arend Van Spriel 96b08fd608 nl80211: add request id in scheduled scan event messages
For multi-scheduled scan support in subsequent patch a request id
will be added. This patch add this request id to the scheduled
scan event messages. For now the request id will always be zero.
With multi-scheduled scan its value will inform user-space to which
scan the event relates.

Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-18 10:23:50 +02:00
Herbert Xu 096f41d3a8 af_key: Fix sadb_x_ipsecrequest parsing
The parsing of sadb_x_ipsecrequest is broken in a number of ways.
First of all we're not verifying sadb_x_ipsecrequest_len.  This
is needed when the structure carries addresses at the end.  Worse
we don't even look at the length when we parse those optional
addresses.

The migration code had similar parsing code that's better but
it also has some deficiencies.  The length is overcounted first
of all as it includes the header itself.  It also fails to check
the length before dereferencing the sa_family field.

This patch fixes those problems in parse_sockaddr_pair and then
uses it in parse_ipsecrequest.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-18 08:26:03 +02:00
David Ahern c21ef3e343 net: rtnetlink: plumb extended ack to doit function
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.

This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:35:38 -04:00
David Lebrun af3b5158b8 ipv6: sr: fix BUG due to headroom too small after SRH push
When a locally generated packet receives an SRH with two or more segments,
the remaining headroom is too small to push an ethernet header. This patch
ensures that the headroom is large enough after SRH push.

The BUG generated the following trace.

[  192.950285] skbuff: skb_under_panic: text:ffffffff81809675 len:198 put:14 head:ffff88006f306400 data:ffff88006f3063fa tail:0xc0 end:0x2c0 dev:A-1
[  192.952456] ------------[ cut here ]------------
[  192.953218] kernel BUG at net/core/skbuff.c:105!
[  192.953411] invalid opcode: 0000 [#1] PREEMPT SMP
[  192.953411] Modules linked in:
[  192.953411] CPU: 5 PID: 3433 Comm: ping6 Not tainted 4.11.0-rc3+ #237
[  192.953411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
[  192.953411] task: ffff88007c2d42c0 task.stack: ffffc90000ef4000
[  192.953411] RIP: 0010:skb_panic+0x61/0x70
[  192.953411] RSP: 0018:ffffc90000ef7900 EFLAGS: 00010286
[  192.953411] RAX: 0000000000000085 RBX: 00000000000086dd RCX: 0000000000000201
[  192.953411] RDX: 0000000080000201 RSI: ffffffff81d104c5 RDI: 00000000ffffffff
[  192.953411] RBP: ffffc90000ef7920 R08: 0000000000000001 R09: 0000000000000000
[  192.953411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  192.953411] R13: ffff88007c5a4000 R14: ffff88007b363d80 R15: 00000000000000b8
[  192.953411] FS:  00007f94b558b700(0000) GS:ffff88007fd40000(0000) knlGS:0000000000000000
[  192.953411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  192.953411] CR2: 00007fff5ecd5080 CR3: 0000000074141000 CR4: 00000000001406e0
[  192.953411] Call Trace:
[  192.953411]  skb_push+0x3b/0x40
[  192.953411]  eth_header+0x25/0xc0
[  192.953411]  neigh_resolve_output+0x168/0x230
[  192.953411]  ? ip6_finish_output2+0x242/0x8f0
[  192.953411]  ip6_finish_output2+0x242/0x8f0
[  192.953411]  ? ip6_finish_output2+0x76/0x8f0
[  192.953411]  ip6_finish_output+0xa8/0x1d0
[  192.953411]  ip6_output+0x64/0x2d0
[  192.953411]  ? ip6_output+0x73/0x2d0
[  192.953411]  ? ip6_dst_check+0xb5/0xc0
[  192.953411]  ? dst_cache_per_cpu_get.isra.2+0x40/0x80
[  192.953411]  seg6_output+0xb0/0x220
[  192.953411]  lwtunnel_output+0xcf/0x210
[  192.953411]  ? lwtunnel_output+0x59/0x210
[  192.953411]  ip6_local_out+0x38/0x70
[  192.953411]  ip6_send_skb+0x2a/0xb0
[  192.953411]  ip6_push_pending_frames+0x48/0x50
[  192.953411]  rawv6_sendmsg+0xa39/0xf10
[  192.953411]  ? __lock_acquire+0x489/0x890
[  192.953411]  ? __mutex_lock+0x1fc/0x970
[  192.953411]  ? __lock_acquire+0x489/0x890
[  192.953411]  ? __mutex_lock+0x1fc/0x970
[  192.953411]  ? tty_ioctl+0x283/0xec0
[  192.953411]  inet_sendmsg+0x45/0x1d0
[  192.953411]  ? _copy_from_user+0x54/0x80
[  192.953411]  sock_sendmsg+0x33/0x40
[  192.953411]  SYSC_sendto+0xef/0x170
[  192.953411]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
[  192.953411]  ? trace_hardirqs_on_caller+0x12b/0x1b0
[  192.953411]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  192.953411]  SyS_sendto+0x9/0x10
[  192.953411]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[  192.953411] RIP: 0033:0x7f94b453db33
[  192.953411] RSP: 002b:00007fff5ecd0578 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  192.953411] RAX: ffffffffffffffda RBX: 00007fff5ecd16e0 RCX: 00007f94b453db33
[  192.953411] RDX: 0000000000000040 RSI: 000055a78352e9c0 RDI: 0000000000000003
[  192.953411] RBP: 00007fff5ecd1690 R08: 000055a78352c940 R09: 000000000000001c
[  192.953411] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a783321e10
[  192.953411] R13: 000055a7839890c0 R14: 0000000000000004 R15: 0000000000000000
[  192.953411] Code: 00 00 48 89 44 24 10 8b 87 c4 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 90 58 d2 81 48 89 04 24 31 c0 e8 4f 70 9a ff <0f> 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 8b 97 d8 00 00
[  192.953411] RIP: skb_panic+0x61/0x70 RSP: ffffc90000ef7900
[  193.000186] ---[ end trace bd0b89fabdf2f92c ]---
[  193.000951] Kernel panic - not syncing: Fatal exception in interrupt
[  193.001137] Kernel Offset: disabled
[  193.001169] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Fixes: 19d5a26f5e ("ipv6: sr: expand skb head only if necessary")
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:33:53 -04:00
Ilan Tayari 7a7a9bd7ac gso: Validate assumption of frag_list segementation
Commit 07b26c9454 ("gso: Support partial splitting at the frag_list
pointer") assumes that all SKBs in a frag_list (except maybe the last
one) contain the same amount of GSO payload.

This assumption is not always correct, resulting in the following
warning message in the log:
    skb_segment: too many frags

For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
one frag, and some with 2 frags.
After GRO, the frag_list SKBs end up having different amounts of payload.
If this frag_list SKB is then forwarded, the aforementioned assumption
is violated.

Validate the assumption, and fall back to software GSO if it not true.

Fixes: 07b26c9454 ("gso: Support partial splitting at the frag_list pointer")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:31:29 -04:00
Xin Long edb12f2d72 sctp: get list_of_streams of strreset outreq earlier
Now when processing strreset out responses, it gets outreq->list_of_streams
only when result is performed. But if result is not performed, str_p will
be NULL. It will cause panic in sctp_ulpevent_make_stream_reset_event if
nums is not 0.

This patch is to fix it by getting outreq->list_of_streams earlier, and
also to improve some codes for the strreset inreq process.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:25:35 -04:00
Chenbo Feng 9fd0f31563 Add uid and cookie bpf helper to cg_skb_func_proto
BPF helper functions get_socket_cookie and get_socket_uid can be
used for network traffic classifications, among others. Expose
them also to programs of type BPF_PROG_TYPE_CGROUP_SKB. As of
commit 8f917bba00 ("bpf: pass sk to helper functions") the
required skb->sk function is available at both cgroup bpf ingress
and egress hooks. With these two new helper, cg_skb_func_proto is
effectively the same as sk_filter_func_proto.

Change since V1:
Instead of add the helper to cg_skb_func_proto, redirect the
cg_skb_func_proto to sk_filter_func_proto since all helper function
in sk_filter_func_proto are applicable to cg_skb_func_proto now.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:22:00 -04:00
Florian Westphal 0aa8c13eb5 ipv6: drop non loopback packets claiming to originate from ::1
We lack a saddr check for ::1. This causes security issues e.g. with acls
permitting connections from ::1 because of assumption that these originate
from local machine.

Assuming a source address of ::1 is local seems reasonable.
RFC4291 doesn't allow such a source address either, so drop such packets.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:09:23 -04:00
David S. Miller 450cc8cce2 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2017-04-14

Here's the main batch of Bluetooth & 802.15.4 patches for the 4.12
kernel.

 - Many fixes to 6LoWPAN, in particular for BLE
 - New CA8210 IEEE 802.15.4 device driver (accounting for most of the
   lines of code added in this pull request)
 - Added Nokia Bluetooth (UART) HCI driver
 - Some serdev & TTY changes that are dependencies for the Nokia
   driver (with acks from relevant maintainers and an agreement that
   these come through the bluetooth tree)
 - Support for new Intel Bluetooth device
 - Various other minor cleanups/fixes here and there

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:00:57 -04:00
Al Viro 71d6ad0837 p9_client_readdir() fix
Don't assume that server is sane and won't return more data than
asked for.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 14:23:20 -04:00
Nikolay Aleksandrov cab93af0ed net: bridge: notify on hw fdb takeover
Recently we added support for SW fdbs to take over HW ones, but that
results in changing a user-visible fdb flag thus we need to send a
notification, also it's consistent with how HW takes over SW entries.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:45:34 -04:00
WANG Cong f5001ceab8 kcm: remove a useless copy_from_user()
struct kcm_clone only contains fd, and kcm_clone() only
writes this struct, so there is no need to copy it from user.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:28:48 -04:00
stephen hemminger 8ea3e43911 Subject: net: allow configuring default qdisc
Since 3.12 it has been possible to configure the default queuing
discipline via sysctl. This patch adds ability to configure the
default queue discipline in kernel configuration. This is useful for
environments where configuring the value from userspace is difficult
to manage.

The default is still the same as before (pfifo_fast) and it is
possible to change after kernel init with sysctl. This is similar
to how TCP congestion control works.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:23:06 -04:00
R. Parameswaran 57240d0078 l2tp: device MTU setup, tunnel socket needs a lock
The MTU overhead calculation in L2TP device set-up
merged via commit b784e7ebfc
needs to be adjusted to lock the tunnel socket while
referencing the sub-data structures to derive the
socket's IP overhead.

Reported-by: Guillaume Nault <g.nault@alphalink.fr>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:01:48 -04:00
Willem de Bruijn 1862d6208d net-timestamp: avoid use-after-free in ip_recv_error
Syzkaller reported a use-after-free in ip_recv_error at line

    info->ipi_ifindex = skb->dev->ifindex;

This function is called on dequeue from the error queue, at which
point the device pointer may no longer be valid.

Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
pointer is valid or NULL. Store it in temporary storage skb->cb.

It is safe to reference skb->dev here, as called from device drivers
or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
in that case it is NULL and ifindex is set to 0 (invalid).

Do not return a pktinfo cmsg if ifindex is 0. This maintains the
current behavior of not returning a cmsg if skb->dev was NULL.

On dequeue, the ipv4 path will cast from sock_exterr_skb to
in_pktinfo. Both have ifindex as their first element, so no explicit
conversion is needed. This is by design, introduced in commit
0b922b7a82 ("net: original ingress device index in PKTINFO"). For
ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.

Fixes: 829ae9d611 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 12:59:22 -04:00
WANG Cong 1215e51eda ipv4: fix a deadlock in ip_ra_control
Similar to commit 87e9f03159
("ipv4: fix a potential deadlock in mcast getsockopt() path"),
there is a deadlock scenario for IP_ROUTER_ALERT too:

       CPU0                    CPU1
       ----                    ----
  lock(rtnl_mutex);
                               lock(sk_lock-AF_INET);
                               lock(rtnl_mutex);
  lock(sk_lock-AF_INET);

Fix this by always locking RTNL first on all setsockopt() paths.

Note, after this patch ip_ra_lock is no longer needed either.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 12:46:50 -04:00
David Ahern 4a6e3c5def net: ipv6: send unsolicited NA on admin up
ndisc_notify is the ipv6 equivalent to arp_notify. When arp_notify is
set to 1, gratuitous arp requests are sent when the device is brought up.
The same is expected when ndisc_notify is set to 1 (per ndisc_notify in
Documentation/networking/ip-sysctl.txt). The NA is not sent on NETDEV_UP
event; add it.

Fixes: 5cb04436ee ("ipv6: add knob to send unsolicited ND on link-layer address change")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 12:44:55 -04:00
Vivien Didelot a6a71f19fe net: dsa: isolate legacy code
This patch moves as is the legacy DSA code from dsa.c to legacy.c,
except the few shared symbols which remain in dsa.c.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 11:03:17 -04:00
Eric W. Biederman b54807fa52 sysctl: Remove dead register_sysctl_root
The function no longer does anything.  The is only a single caller of
register_sysctl_root when semantically there should be two.  Remove
this function so that if someone decides this functionality is needed
again it will be obvious all of the callers of setup_sysctl_set need
to be audited and modified appropriately.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-16 23:42:49 -05:00
David S. Miller 6b6cbc1471 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simply overlapping changes.  In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.

In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-15 21:16:30 -04:00
Florian Westphal ab8bc7ed86 netfilter: remove nf_ct_is_untracked
This function is now obsolete and always returns false.
This change has no effect on generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:51:33 +02:00
Florian Westphal cc41c84b7e netfilter: kill the fake untracked conntrack objects
resurrect an old patch from Pablo Neira to remove the untracked objects.

Currently, there are four possible states of an skb wrt. conntrack.

1. No conntrack attached, ct is NULL.
2. Normal (kmem cache allocated) ct attached.
3. a template (kmalloc'd), not in any hash tables at any point in time
4. the 'untracked' conntrack, a percpu nf_conn object, tagged via
   IPS_UNTRACKED_BIT in ct->status.

Untracked is supposed to be identical to case 1.  It exists only
so users can check

-m conntrack --ctstate UNTRACKED vs.
-m conntrack --ctstate INVALID

e.g. attempts to set connmark on INVALID or UNTRACKED conntracks is
supposed to be a no-op.

Thus currently we need to check
 ct == NULL || nf_ct_is_untracked(ct)

in a lot of places in order to avoid altering untracked objects.

The other consequence of the percpu untracked object is that all
-j NOTRACK (and, later, kfree_skb of such skbs) result in an atomic op
(inc/dec the untracked conntracks refcount).

This adds a new kernel-private ctinfo state, IP_CT_UNTRACKED, to
make the distinction instead.

The (few) places that care about packet invalid (ct is NULL) vs.
packet untracked now need to test ct == NULL vs. ctinfo == IP_CT_UNTRACKED,
but all other places can omit the nf_ct_is_untracked() check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:47:57 +02:00
Gao Feng 6e354a5e56 netfilter: ecache: Refine the nf_ct_deliver_cached_events
1. Remove single !events condition check to deliver the missed event
even though there is no new event happened.

Consider this case:
1) nf_ct_deliver_cached_events is invoked at the first time, the
event is failed to deliver, then the missed is set.
2) nf_ct_deliver_cached_events is invoked again, but there is no
any new event happened.
The missed event is lost really.

It would try to send the missed event again after remove this check.
And it is ok if there is no missed event because the latter check
!((events | missed) & e->ctmask) could avoid it.

2. Correct the return value check of notify->fcn.
When send the event successfully, it returns 0, not postive value.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:43:49 +02:00
Gao Feng 7025bac47f netfilter: nf_nat: Fix return NF_DROP in nfnetlink_parse_nat_setup
The __nf_nat_alloc_null_binding invokes nf_nat_setup_info which may
return NF_DROP when memory is exhausted, so convert NF_DROP to -ENOMEM
to make ctnetlink happy. Or ctnetlink_setup_nat treats it as a success
when one error NF_DROP happens actully.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:04:14 +02:00
Pablo Neira Ayuso a702ece3b1 Merge tag 'ipvs2-for-v4.12' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Second Round of IPVS Updates for v4.12

please consider these clean-ups and enhancements to IPVS for v4.12.

* Removal unused variable
* Use kzalloc where appropriate
* More efficient detection of presence of NAT extension
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Conflicts:
	net/netfilter/ipvs/ip_vs_ftp.c
2017-04-15 10:54:40 +02:00
Aaron Conole db268d4dfd ipset: remove unused function __ip_set_get_netlink
There are no in-tree callers.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 10:24:41 +02:00
Dan Carpenter a88086e098 net: off by one in inet6_pton()
If "scope_len" is sizeof(scope_id) then we would put the NUL terminator
one space beyond the end of the buffer.

Fixes: b1a951fe46 ("net/utils: generic inet_pton_with_scope helper")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14 14:08:54 -06:00
David S. Miller f4c13c8ec5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Missing TCP header sanity check in TCPMSS target, from Eric Dumazet.

2) Incorrect event message type for related conntracks created via
   ctnetlink, from Liping Zhang.

3) Fix incorrect rcu locking when handling helpers from ctnetlink,
   from Gao feng.

4) Fix missing rcu locking when updating helper, from Liping Zhang.

5) Fix missing read_lock_bh when iterating over list of device addresses
   from TPROXY and redirect, also from Liping.

6) Fix crash when trying to dump expectations from conntrack with no
   helper via ctnetlink, from Liping.

7) Missing RCU protection to expecation list update given ctnetlink
   iterates over the list under rcu read lock side, from Liping too.

8) Don't dump autogenerated seed in nft_hash to userspace, this is
   very confusing to the user, again from Liping.

9) Fix wrong conntrack netns module refcount in ipt_CLUSTERIP,
   from Gao feng.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-14 10:47:13 -04:00
Steffen Klassert bcd1f8a45e xfrm: Prepare the GRO codepath for hardware offloading.
On IPsec hardware offloading, we already get a secpath with
valid state attached when the packet enters the GRO handlers.
So check for hardware offload and skip the state lookup in this
case.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:07:49 +02:00
Ilan Tayari f1bd7d659e xfrm: Add encapsulation header offsets while SKB is not encrypted
Both esp4 and esp6 used to assume that the SKB payload is encrypted
and therefore the inner_network and inner_transport offsets are
not relevant.
When doing crypto offload in the NIC, this is no longer the case
and the NIC driver needs these offsets so it can do TX TCP checksum
offloading.
This patch sets the inner_network and inner_transport members of
the SKB, as well as encapsulation, to reflect the actual positions
of these headers, and removes them only once encryption is done
on the payload.

Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:07:39 +02:00
Steffen Klassert f6e27114a6 net: Add a xfrm validate function to validate_xmit_skb
When we do IPsec offloading, we need a fallback for
packets that were targeted to be IPsec offloaded but
rerouted to a device that does not support IPsec offload.
For that we add a function that checks the offloading
features of the sending device and and flags the
requirement of a fallback before it calls the IPsec
output function. The IPsec output function adds the IPsec
trailer and does encryption if needed.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:07:28 +02:00
Steffen Klassert b3859c8ebf esp: Use a synchronous crypto algorithm on offloading.
We need a fallback algorithm for crypto offloading to a NIC.
This is because packets can be rerouted to other NICs that
don't support crypto offloading. The fallback is going to be
implemented at layer2 where we know the final output device
but can't handle asynchronous returns fron the crypto layer.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:07:19 +02:00
Steffen Klassert d7dbefc45c xfrm: Add xfrm_replay_overflow functions for offloading
This patch adds functions that handles IPsec sequence
numbers for GSO segments and TSO offloading. We need
to calculate and update the sequence numbers based
on the segments that GSO/TSO will generate. We need
this to keep software and hardware sequence number
counter in sync.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:07:01 +02:00
Steffen Klassert 7862b4058b esp: Add gso handlers for esp4 and esp6
This patch extends the xfrm_type by an encap function pointer
and implements esp4_gso_encap and esp6_gso_encap. These functions
doing the basic esp encapsulation for a GSO packet. In case the
GSO packet needs to be segmented in software, we add gso_segment
functions. This codepath is going to be used on esp hardware
offloads.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:06:50 +02:00
Steffen Klassert 383d0350f2 esp6: Reorganize esp_output
We need a fallback for ESP at layer 2, so split esp6_output
into generic functions that can be used at layer 3 and layer 2
and use them in esp_output. We also add esp6_xmit which is
used for the layer 2 fallback.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:06:42 +02:00
Steffen Klassert fca11ebde3 esp4: Reorganize esp_output
We need a fallback for ESP at layer 2, so split esp_output
into generic functions that can be used at layer 3 and layer 2
and use them in esp_output. We also add esp_xmit which is
used for the layer 2 fallback.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:06:33 +02:00
Steffen Klassert f1fbed0e89 esp6: Remame esp_input_done2
We are going to export the ipv4 and the ipv6
version of esp_input_done2. They are not static
anymore and can't have the same name. So rename
the ipv6 version to esp6_input_done2.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:06:21 +02:00
Steffen Klassert d77e38e612 xfrm: Add an IPsec hardware offloading API
This patch adds all the bits that are needed to do
IPsec hardware offload for IPsec states and ESP packets.
We add xfrmdev_ops to the net_device. xfrmdev_ops has
function pointers that are needed to manage the xfrm
states in the hardware and to do a per packet
offloading decision.

Joint work with:
Ilan Tayari <ilant@mellanox.com>
Guy Shapiro <guysh@mellanox.com>
Yossi Kuperman <yossiku@mellanox.com>

Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:06:10 +02:00
Steffen Klassert c35fe4106b xfrm: Add mode handlers for IPsec on layer 2
This patch adds a gso_segment and xmit callback for the
xfrm_mode and implement these functions for tunnel and
transport mode.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:06:01 +02:00
Steffen Klassert 21f42cc95f xfrm: Move device notifications to a sepatate file
This is needed for the upcomming IPsec device offloading.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:05:53 +02:00
Steffen Klassert 9d389d7f84 xfrm: Add a xfrm type offload.
We add a struct  xfrm_type_offload so that we have the offloaded
codepath separated to the non offloaded codepath. With this the
non offloade and the offloaded codepath can coexist.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:05:44 +02:00
Steffen Klassert c7ef8f0c02 net: Add ESP offload features
This patch adds netdev features to configure IPsec offloads.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-14 10:05:36 +02:00
Aaron Conole 809c2d9a3b netfilter: nf_conntrack: remove double assignment
The protonet pointer will unconditionally be rewritten, so just do the
needed assignment first.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-14 01:54:23 +02:00
Aaron Conole 7925056827 netfilter: nf_tables: remove double return statement
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-14 01:54:19 +02:00
Gao Feng fe50543c19 netfilter: ipt_CLUSTERIP: Fix wrong conntrack netns refcnt usage
Current codes invoke wrongly nf_ct_netns_get in the destroy routine,
it should use nf_ct_netns_put, not nf_ct_netns_get.
It could cause some modules could not be unloaded.

Fixes: ecb2421b5d ("netfilter: add and use nf_ct_netns_get/put")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-13 23:21:40 +02:00
Liping Zhang 79e09ef96b netfilter: nft_hash: do not dump the auto generated seed
This can prevent the nft utility from printing out the auto generated
seed to the user, which is unnecessary and confusing.

Fixes: cb1b69b0b1 ("netfilter: nf_tables: add hash expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-13 23:20:13 +02:00
Taehee Yoo 5389023421 netfilter: nat: remove rcu_read_lock in __nf_nat_decode_session.
__nf_nat_decode_session is called from nf_nat_decode_session as decodefn.
before calling decodefn, it already set rcu_read_lock. so rcu_read_lock in
__nf_nat_decode_session can be removed.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-13 22:48:03 +02:00
Johannes Berg fe52145f91 netlink: pass extended ACK struct where available
This is an add-on to the previous patch that passes the extended ACK
structure where it's already available by existing genl_info or extack
function arguments.

This was done with this spatch (with some manual adjustment of
indentation):

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, info->extack)
...
}

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, info->extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse(A, B, C, D, E, NULL)
+nla_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_validate(A, B, C, D, NULL)
+nlmsg_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate(A, B, C, D, NULL)
+nla_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate_nested(A, B, C, NULL)
+nla_validate_nested(A, B, C, extack)
...>
}

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Johannes Berg fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Johannes Berg ba0dc5f6e0 netlink: allow sending extended ACK with cookie on success
Now that we have extended error reporting and a new message format for
netlink ACK messages, also extend this to be able to return arbitrary
cookie data on success.

This will allow, for example, nl80211 to not send an extra message for
cookies identifying newly created objects, but return those directly
in the ACK message.

The cookie data size is currently limited to 20 bytes (since Jamal
talked about using SHA1 for identifiers.)

Thanks to Jamal Hadi Salim for bringing up this idea during the
discussions.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:21 -04:00
Johannes Berg 7ab606d160 genetlink: pass extended ACK report down
Pass the extended ACK reporting struct down from generic netlink to
the families, using the existing struct genl_info for simplicity.

Also add support to set the extended ACK information from generic
netlink users.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:21 -04:00
Johannes Berg 2d4bc93368 netlink: extended ACK reporting
Add the base infrastructure and UAPI for netlink extended ACK
reporting. All "manual" calls to netlink_ack() pass NULL for now and
thus don't get extended ACK reporting.

Big thanks goes to Pablo Neira Ayuso for not only bringing up the
whole topic at netconf (again) but also coming up with the nlattr
passing trick and various other ideas.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:20 -04:00
Gao Feng 7ed14d973f net: ipv4: Refine the ipv4_default_advmss
1. Don't get the metric RTAX_ADVMSS of dst.
There are two reasons.
1) Its caller dst_metric_advmss has already invoke dst_metric_advmss
before invoke default_advmss.
2) The ipv4_default_advmss is used to get the default mss, it should
not try to get the metric like ip6_default_advmss.

2. Use sizeof(tcphdr)+sizeof(iphdr) instead of literal 40.

3. Define one new macro IPV4_MAX_PMTU instead of 65535 according to
RFC 2675, section 5.1.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:19:48 -04:00
David Ahern 27b3b551d8 rtnetlink: Do not generate notifications for NETDEV_CHANGE_TX_QUEUE_LEN event
Changing tx queue length generates identical messages:

[LINK]22: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 02:04:f4:b7:5c:d2 brd ff:ff:ff:ff:ff:ff promiscuity 0
    dummy numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
[LINK]22: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 02:04:f4:b7:5c:d2 brd ff:ff:ff:ff:ff:ff promiscuity 0
    dummy numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Remove NETDEV_CHANGE_TX_QUEUE_LEN from the list of notifiers that generate
notifications.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:34 -04:00
David Ahern b6b36eb23a rtnetlink: Do not generate notifications for NETDEV_CHANGEUPPER event
NETDEV_CHANGEUPPER is an internal event; do not generate userspace
notifications.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:34 -04:00
David Ahern aed0735909 rtnetlink: Do not generate notifications for CHANGELOWERSTATE event
CHANGELOWERSTATE is an internal event; do not generate userspace
notifications.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:33 -04:00
David Ahern bf2c2984d3 rtnetlink: Do not generate notifications for PRECHANGEUPPER event
PRECHANGEUPPER is an internal event; do not generate userspace
notifications.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:33 -04:00
David Ahern aef091ae58 rtnetlink: Do not generate notifications for POST_TYPE_CHANGE event
Changing the master device for a link generates many messages; the one
generated for POST_TYPE_CHANGE is redundant:

[LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master br1 state UNKNOWN group default
    link/ether 02:02:02:02:02:03 brd ff:ff:ff:ff:ff:ff
[LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master br1 state UNKNOWN group default
    link/ether 02:02:02:02:02:03 brd ff:ff:ff:ff:ff:ff

Remove POST_TYPE_CHANGE from the list of notifiers that generate
notifications.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:33 -04:00
David Ahern cd8966e75e rtnetlink: Do not generate notifications for CHANGEADDR event
Changing hardware address generates redundant messages:

[LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 02:02:02:02:02:02 brd ff:ff:ff:ff:ff:ff

[LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 02:02:02:02:02:02 brd ff:ff:ff:ff:ff:ff

Do not send a notification for the CHANGEADDR notifier.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:33 -04:00
David Ahern 46ede612c7 rtnetlink: Do not generate notification for UDP_TUNNEL_PUSH_INFO
NETDEV_UDP_TUNNEL_PUSH_INFO is an internal notifier; nothing userspace
can do so don't generate a netlink notification.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:33 -04:00
David Ahern 085e1a65f0 rtnetlink: Do not generate notifications for MTU events
Changing MTU on a link currently causes 3 messages to be sent to userspace:

[LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1490 qdisc noqueue state UNKNOWN group default
    link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

[LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

[LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff

Remove the messages sent for PRE_CHANGE_MTU and CHANGE_MTU netdev events.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:16:32 -04:00
Johannes Berg 9e478066ea mac80211: fix MU-MIMO follow-MAC mode
There are two bugs in the follow-MAC code:
 * it treats the radiotap header as the 802.11 header
   (therefore it can't possibly work)
 * it doesn't verify that the skb data it accesses is actually
   present in the header, which is mitigated by the first point

Fix this by moving all of this out into a separate function.
This function copies the data it needs using skb_copy_bits()
to make sure it can be accessed if it's paged, and offsets
that by the possibly present vendor radiotap header.

This also makes all those conditions more readable.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 14:35:53 +02:00
Johannes Berg 65f1d6007e mac80211: use common code for monitor options in add/change
Refactor the code to have common code for changing monitor
options when adding and changing virtual interfaces. This
will make it easier to add BPF filters to both paths. Note
that this code carefully checks the error conditions first
and only then applies the changes, to guarantee atomicity.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 13:41:39 +02:00
Johannes Berg 1db77596e4 cfg80211: refactor nl80211 monitor option parsing
Refactor the parsing of monitor flags and the MU-MIMO options.
This will allow adding more things cleanly in the future and
also allows setting the latter already when creating a monitor
interface.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 13:41:39 +02:00
Johannes Berg 818a986e4e cfg80211: move add/change interface monitor flags into params
Instead passing both flags, which can be NULL, and vif_params,
which are never NULL, move the flags into the vif_params and
use BIT(0), which is invalid from userspace, to indicate that
the flags were changed.

While updating all drivers, fix a small bug in wil6210 where
it was setting the flags to 0 instead of leaving them unchanged.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 13:41:38 +02:00
Johannes Berg 8c5e688944 mac80211: correct MU-MIMO monitor follow functionality
The MU-MIMO monitor follow functionality is broken because it
doesn't clear the MU-MIMO owner even if both follow features
are disabled. Fix that, and while at it move the code into a
new helper function. Call this also when creating a new monitor
interface to prepare for an upcoming cfg80211 change allowing
that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 13:41:38 +02:00
Johannes Berg b0265024b8 cfg80211: allow leaving MU-MIMO monitor configuration unchanged
When changing monitor parameters, not setting the MU-MIMO attributes
should mean that they're not changed - it's documented that to turn
the feature off it's necessary to set all-zero group membership and
an invalid follow-address. This isn't implemented.

Fix this by making the parameters pointers, stop reusing the macaddr
struct member, and documenting that NULL pointers mean unchanged.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 13:41:37 +02:00
Johannes Berg 30841f5cde mac80211: drop frames too short for FCS earlier
Instead of dropping such frames only when removing the
monitor info, drop them earlier (keeping the warning)
and simplify removing monitor info. While at it, make
that function return void.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 13:41:36 +02:00
Matthias Kaehlcke 93f56de259 mac80211: Fix clang warning about constant operand in logical operation
When clang detects a non-boolean constant in a logical operation it
generates a 'constant-logical-operand' warning. In
ieee80211_try_rate_control_ops_get() the result of strlen(<const str>)
is used in a logical operation, clang resolves the expression to an
(integer) constant at compile time when clang's builtin strlen function
is used.

Change the condition to check for strlen() > 0 to make the constant
operand boolean and thus avoid the warning.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-04-13 08:27:02 +02:00
Patrik Flykt 258695222b bluetooth: Do not set IFF_POINTOPOINT
The IPv6 stack needs to send and receive Neighbor Discovery
messages. Remove the IFF_POINTOPOINT flag.

Signed-off-by: Patrik Flykt <patrik.flykt@linux.intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Reviewed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:41 +02:00
Luiz Augusto von Dentz 814f1b243d Bluetooth: 6lowpan: Set tx_queue_len to DEFAULT_TX_QUEUE_LEN
Make netdev queue packets if we run out of credits.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:41 +02:00
Luiz Augusto von Dentz 8a505b7f39 Bluetooth: L2CAP: Add l2cap_le_flowctl_send
Consolidate code sending data to LE CoC channels and adds proper
accounting of packets sent, the remaining credits and how many packets
are queued.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:41 +02:00
Luiz Augusto von Dentz f183e52b8e Bluetooth: 6lowpan: Use netif APIs to flow control
Rely on netif_wake_queue and netif_stop_queue to flow control when
transmit resources are unavailable.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:40 +02:00
Luiz Augusto von Dentz e1008f95e1 Bluetooth: 6lowpan: Don't drop packets when run out of credits
Since l2cap_chan_send will now queue the packets there is no point in
checking the credits anymore.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:40 +02:00
Luiz Augusto von Dentz 24dcbf6622 6lowpan: Don't set IFF_NO_QUEUE
There is no point in setting IFF_NO_QUEUE should already have taken
care of setting it if tx_queue_len is not set, in fact this may
actually disable queue for interfaces that require it and do set
tx_queue_len.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:40 +02:00
Luiz Augusto von Dentz 03732141bf Bluetooth: L2CAP: Don't return -EAGAIN if out of credits
Just keep queueing them into TX queue since the caller might just have
to do the same and there is no impact in adding another packet to the
TX queue even if there aren't any credits to transmit them.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:40 +02:00
Luiz Augusto von Dentz da75fdc6bd Bluetooth: 6lowpan: Print errors during recv_pkt
This makes should make it more clear why a packet is being dropped.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:40 +02:00
Luiz Augusto von Dentz 27ce68a37b Bluetooth: 6lowpan: Remove unnecessary peer lookup
During chan_recv_cb there is already a peer lookup which can be passed
to recv_pkt directly instead of the channel.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:40 +02:00
Michael Scott 6dea44f5ac Bluetooth: 6lowpan: fix use after free in chan_suspend/resume
A status field in the skb_cb struct was storing a channel status
based on channel suspend/resume events.  This stored status was
then used to return EAGAIN if there were packet sending issues
in snd_pkt().

The issue is that the skb has been freed by the time the callback
to 6lowpan's suspend/resume was called.  So, this generates a
"use after free" issue that was noticed while running kernel tests
with KASAN debug enabled.

Let's eliminate the status field entirely as we can use the channel
tx_credits to indicate whether we should return EAGAIN when handling
packets.

Signed-off-by: Michael Scott <michael.scott@linaro.org>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:39 +02:00
Michael Scott d2891c4d07 Bluetooth: 6lowpan: fix delay work init in add_peer_chan()
When adding 6lowpan devices very rapidly we sometimes see a crash:
[23122.306615] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.0-43-arm64 #1 Debian 4.9.9.linaro.43-1
[23122.315400] Hardware name: HiKey Development Board (DT)
[23122.320623] task: ffff800075443080 task.stack: ffff800075484000
[23122.326551] PC is at expire_timers+0x70/0x150
[23122.330907] LR is at run_timer_softirq+0xa0/0x1a0
[23122.335616] pc : [<ffff000008142dd8>] lr : [<ffff000008142f58>] pstate: 600001c5

This was due to add_peer_chan() unconditionally initializing the
lowpan_btle_dev->notify_peers delayed work structure, even if the
lowpan_btle_dev passed into add_peer_chan() had previously been
initialized.

Normally, this would go unnoticed as the delayed work timer is set for
100 msec, however when calling add_peer_chan() faster than 100 msec it
clears out a previously queued delay work causing the crash above.

To fix this, let add_peer_chan() know when a new lowpan_btle_dev is passed
in so that it only performs the delay work initialization when needed.

Signed-off-by: Michael Scott <michael.scott@linaro.org>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:39 +02:00
Colin Ian King fa0eaf840a 6lowpan: fix assignment of peer_addr
The data from peer->chan->dst is not being copied to peer_addr, the
current code just updates the pointer and not the contents of what
it points to.  Fix this with the intended assignment.

Detected by CoverityScan, CID#1422111 ("Parse warning
(PW.PARAM_SET_BUT_NOT_USED)")

Fixes: fb6f2f606ce8 ("6lowpan: Fix IID format for Bluetooth")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:39 +02:00
Jonas Holmberg b48c3b59a3 Bluetooth: Change initial min and max interval
Use the initial connection interval recommended in Bluetooth
Specification v4.2 (30ms - 50ms).

Signed-off-by: Jonas Holmberg <jonashg@axis.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:38 +02:00
Colin Ian King cd50361c21 Bluetooth: fix assignments on error variable err
Variable err is being initialized to zero and then later being
set to the error return from the call to hci_req_run_skb; hence
we can remove the redundant initialization to zero.

Also on two occassions err is not being set from the error return
from the call to hci_req_run_skb, so add these missing assignments.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:38 +02:00
Dean Jenkins 27bfbc21a0 Bluetooth: Avoid bt_accept_unlink() double unlinking
There is a race condition between a thread calling bt_accept_dequeue()
and a different thread calling bt_accept_unlink(). Protection against
concurrency is implemented using sk locking. However, sk locking causes
serialisation of the bt_accept_dequeue() and bt_accept_unlink() threads.
This serialisation can cause bt_accept_dequeue() to obtain the sk from the
parent list but becomes blocked waiting for the sk lock held by the
bt_accept_unlink() thread. bt_accept_unlink() unlinks sk and this thread
releases the sk lock unblocking bt_accept_dequeue() which potentially runs
bt_accept_unlink() again on the same sk causing a crash. The attempt to
double unlink the same sk from the parent list can cause a NULL pointer
dereference crash due to bt_sk(sk)->parent becoming NULL on the first
unlink, followed by the second unlink trying to execute
bt_sk(sk)->parent->sk_ack_backlog-- in bt_accept_unlink() which crashes.

When sk is in the parent list, bt_sk(sk)->parent will be not be NULL.
When sk is removed from the parent list, bt_sk(sk)->parent is set to
NULL. Therefore, add a defensive check for bt_sk(sk)->parent not being
NULL to ensure that sk is still in the parent list after the sk lock has
been taken in bt_accept_dequeue(). If bt_sk(sk)->parent is detected as
being NULL then restart the loop so that the loop variables are refreshed
to use the latest values. This is necessary as list_for_each_entry_safe()
is not thread safe so causing a risk of an infinite loop occurring as sk
could point to itself.

In addition, in bt_accept_dequeue() increase the sk reference count to
protect against early freeing of sk. Early freeing can be possible if the
bt_accept_unlink() thread calls l2cap_sock_kill() or rfcomm_sock_kill()
functions before bt_accept_dequeue() gets the sk lock.

For test purposes, the probability of failure can be increased by putting
a msleep of 1 second in bt_accept_dequeue() between getting the sk and
waiting for the sk lock. This exposes the fact that the loop
list_for_each_entry_safe(p, n, &bt_sk(parent)->accept_q) is not safe from
threads that unlink sk from the list in parallel with the loop which can
cause sk to become stale within the loop.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:37 +02:00
Dean Jenkins e163376220 Bluetooth: Handle bt_accept_enqueue() socket atomically
There is a small risk that bt_accept_unlink() runs concurrently with
bt_accept_enqueue() on the same socket. This scenario could potentially
lead to a NULL pointer dereference of the socket's parent member because
the socket can be on the list but the socket's parent member is not yet
updated by bt_accept_enqueue().

Therefore, add socket locking inside bt_accept_enqueue() so that the
socket is added to the list AND the parent's socket address is set in the
socket's parent member. The socket locking ensures that the socket is on
the list with a valid non-NULL parent member.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:37 +02:00
Luiz Augusto von Dentz 9dae2e0303 6lowpan: Fix IID format for Bluetooth
According to RFC 7668 U/L bit shall not be used:

https://wiki.tools.ietf.org/html/rfc7668#section-3.2.2 [Page 10]:

   In the figure, letter 'b' represents a bit from the
   Bluetooth device address, copied as is without any changes on any
   bit.  This means that no bit in the IID indicates whether the
   underlying Bluetooth device address is public or random.

   |0              1|1              3|3              4|4              6|
   |0              5|6              1|2              7|8              3|
   +----------------+----------------+----------------+----------------+
   |bbbbbbbbbbbbbbbb|bbbbbbbb11111111|11111110bbbbbbbb|bbbbbbbbbbbbbbbb|
   +----------------+----------------+----------------+----------------+

Because of this the code cannot figure out the address type from the IP
address anymore thus it makes no sense to use peer_lookup_ba as it needs
the peer address type.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:36 +02:00
Luiz Augusto von Dentz fa09ae661f 6lowpan: Use netdev addr_len to determine lladdr len
This allow technologies such as Bluetooth to use its native lladdr which
is eui48 instead of eui64 which was expected by functions like
lowpan_header_decompress and lowpan_header_compress.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:36 +02:00
Alexander Aring 8a7a4b4767 ipv6: addrconf: fix 48 bit 6lowpan autoconfiguration
This patch adds support for 48 bit 6LoWPAN address length
autoconfiguration which is the case for BTLE 6LoWPAN.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:36 +02:00
Alexander Aring 94e4a68039 6lowpan: iphc: override l2 packet information
The skb->pkt_type need to be set by L2, but on 6LoWPAN there exists L2
e.g. BTLE which doesn't has multicast addressing. If it's a multicast or
not is detected by IPHC headers multicast bit. The IPv6 layer will
evaluate this pkt_type, so we force set this type while uncompressing.
Should be okay for 802.15.4 as well.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:36 +02:00
Patrik Flykt be054fc830 6lowpan: Set MAC address length according to LOWPAN_LLTYPE
Set MAC address length according to the 6LoWPAN link layer in use.
Bluetooth Low Energy uses 48 bit addressing while IEEE802.15.4 uses
64 bits.

Signed-off-by: Patrik Flykt <patrik.flykt@linux.intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:36 +02:00
Patrik Flykt c259d1413b bluetooth: Set 6 byte device addresses
Set BTLE MAC addresses that are 6 bytes long and not 8 bytes
that are used in other places with 6lowpan.

Signed-off-by: Patrik Flykt <patrik.flykt@linux.intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:36 +02:00
Elena Reshetova dab6b5daee Bluetooth: convert rfcomm_dlc.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-04-12 22:02:36 +02:00
Ilan Tayari eaffadbbb3 gso: Support frag_list splitting with head_frag
A driver may use build_skb() for received packets.
These SKBs then have a head_frag.

Since commit d7e8883cfc ("net: make GRO aware of
skb->head_frag"), GRO may build frag_list SKBs out of
head_frag received SKBs.
In such a case, the chained SKBs end up with a head_frag.

Commit 07b26c9454 ("gso: Support partial splitting at
the frag_list pointer") adds partial segmentation of frag_list
SKB chains into individual SKBs.
However, this is not done if the chained SKBs have any
linear part, because the device may not be able to DMA
the private linear buffer.

A chained frag_list SKB with head_frag is wrongfully
detected in this case as having a private linear part
and thus falls back to software GSO, while in fact the
linear part is backed by a DMA page just like any other frag.

This causes low performance when forwarding those packets
that were built with build_skb()

Allow partial segmentation at the frag_list pointer for
chained SKBs with head_frag.

Note that such SKBs can only be created by GRO, when applied
to received packets with head_frag.
Also note that this change only affects the data path that
performs the partial segmentation at frag_list pointer, and
not any of the other more common data paths.

Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-12 13:53:35 -04:00
Rabin Vincent a2d6cbb067 ipv6: Fix idev->addr_list corruption
addrconf_ifdown() removes elements from the idev->addr_list without
holding the idev->lock.

If this happens while the loop in __ipv6_dev_get_saddr() is handling the
same element, that function ends up in an infinite loop:

  NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [test:1719]
  Call Trace:
   ipv6_get_saddr_eval+0x13c/0x3a0
   __ipv6_dev_get_saddr+0xe4/0x1f0
   ipv6_dev_get_saddr+0x1b4/0x204
   ip6_dst_lookup_tail+0xcc/0x27c
   ip6_dst_lookup_flow+0x38/0x80
   udpv6_sendmsg+0x708/0xba8
   sock_sendmsg+0x18/0x30
   SyS_sendto+0xb8/0xf8
   syscall_common+0x34/0x58

Fixes: 6a923934c3 (Revert "ipv6: Revert optional address flusing on ifdown.")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-12 13:23:39 -04:00
Guillaume Nault 2f858b928b l2tp: define parameters of l2tp_tunnel_find*() as "const"
l2tp_tunnel_find() and l2tp_tunnel_find_nth() don't modify "net".

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-12 10:44:02 -04:00
Guillaume Nault 9aaef50c44 l2tp: define parameters of l2tp_session_get*() as "const"
Make l2tp_pernet()'s parameter constant, so that l2tp_session_get*() can
declare their "net" variable as "const".
Also constify "ifname" in l2tp_session_get_by_ifname().

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-12 10:44:02 -04:00
Johannes Berg df7dd8fc96 net: xdp: don't export dev_change_xdp_fd()
Since dev_change_xdp_fd() is only used in rtnetlink, which must
be built-in, there's no reason to export dev_change_xdp_fd().

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-12 10:29:40 -04:00
Ursula Braun 2c9c16825e net/smc: do not use IB_SEND_INLINE together with mapped data
smc specifies IB_SEND_INLINE for IB_WR_SEND ib_post_send calls, but
provides a mapped buffer to be sent. This is inconsistent, since
IB_SEND_INLINE works without mapped buffer. Problem has not been
detected in the past, because tests had been limited to Connect X3 cards
from Mellanox, whose mlx4 driver just ignored the IB_SEND_INLINE flag.
For now, the IB_SEND_INLINE flag is removed.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun 288c83902a net/smc: destruct non-accepted sockets
Make sure sockets never accepted are removed cleanly.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun f5227cd9f1 net/smc: remove duplicate unhash
unhash is already called in sock_put_work. Remove the second call.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun a98bf8c0bc net/smc: guarantee ConnClosed send after shutdown SHUT_WR
State SMC_CLOSED should be reached only, if ConnClosed has been sent to
the peer. If ConnClosed is received from the peer, a socket with
shutdown SHUT_WR done, switches errorneously to state SMC_CLOSED, which
means the peer socket is dangling. The local SMC socket is supposed to
switch to state APPFINCLOSEWAIT to make sure smc_close_final() is called
during socket close.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun 46c28dbd4c net/smc: no socket state changes in tasklet context
Several state changes occur during SMC socket closing. Currently
state changes triggered locally occur in process context with
lock_sock() taken while state changes triggered by peer occur in
tasklet context with bh_lock_sock() taken. bh_lock_sock() does not
wait till a lock_sock(() task in process context is finished. This
may lead to races in socket state transitions resulting in dangling
SMC-sockets, or it may lead to duplicate SMC socket freeing.
This patch introduces a closing worker to run all state changes under
lock_sock().

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun 90e9517ed9 net/smc: always call the POLL_IN part of sk_wake_async
Wake up reading file descriptors for a closing socket as well, otherwise
some socket applications may stall.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun 90cacb2ea6 net/smc: guarantee reset of write_blocked for heavy workload
If peer indicates write_blocked, the cursor state of the received data
should be send to the peer immediately (in smc_tx_consumer_update()).
Afterwards the write_blocked indicator is cleared.

If there is no free slot for another write request, sending is postponed
to worker smc_tx_work, and the write_blocked indicator is not cleared.
Therefore another clearing check is needed in smc_tx_work().

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun 5da7e4d355 net/smc: return active RoCE port only
SMC requires an active ib port on the RoCE device.
smc_pnet_find_roce_resource() determines the matching RoCE device port
according to the configured PNET table. Do not return the found
RoCE device port, if it is not flagged active.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun 249633a443 net/smc: remove useless smc_ib_devices_list check
The global event handler is created only, if the ib_device has already
been used by at least one link group. It is guaranteed that there exists
the corresponding entry in the smc_ib_devices list. Get rid of this
superfluous check.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ursula Braun 3c22e8f320 net/smc: get rid of old comment
This patch removes an outdated comment.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 23:01:14 -04:00
Ido Schimmel 5b8d5429da bridge: netlink: register netdevice before executing changelink
Peter reported a kernel oops when executing the following command:

$ ip link add name test type bridge vlan_default_pvid 1

[13634.939408] BUG: unable to handle kernel NULL pointer dereference at
0000000000000190
[13634.939436] IP: __vlan_add+0x73/0x5f0
[...]
[13634.939783] Call Trace:
[13634.939791]  ? pcpu_next_unpop+0x3b/0x50
[13634.939801]  ? pcpu_alloc+0x3d2/0x680
[13634.939810]  ? br_vlan_add+0x135/0x1b0
[13634.939820]  ? __br_vlan_set_default_pvid.part.28+0x204/0x2b0
[13634.939834]  ? br_changelink+0x120/0x4e0
[13634.939844]  ? br_dev_newlink+0x50/0x70
[13634.939854]  ? rtnl_newlink+0x5f5/0x8a0
[13634.939864]  ? rtnl_newlink+0x176/0x8a0
[13634.939874]  ? mem_cgroup_commit_charge+0x7c/0x4e0
[13634.939886]  ? rtnetlink_rcv_msg+0xe1/0x220
[13634.939896]  ? lookup_fast+0x52/0x370
[13634.939905]  ? rtnl_newlink+0x8a0/0x8a0
[13634.939915]  ? netlink_rcv_skb+0xa1/0xc0
[13634.939925]  ? rtnetlink_rcv+0x24/0x30
[13634.939934]  ? netlink_unicast+0x177/0x220
[13634.939944]  ? netlink_sendmsg+0x2fe/0x3b0
[13634.939954]  ? _copy_from_user+0x39/0x40
[13634.939964]  ? sock_sendmsg+0x30/0x40
[13634.940159]  ? ___sys_sendmsg+0x29d/0x2b0
[13634.940326]  ? __alloc_pages_nodemask+0xdf/0x230
[13634.940478]  ? mem_cgroup_commit_charge+0x7c/0x4e0
[13634.940592]  ? mem_cgroup_try_charge+0x76/0x1a0
[13634.940701]  ? __handle_mm_fault+0xdb9/0x10b0
[13634.940809]  ? __sys_sendmsg+0x51/0x90
[13634.940917]  ? entry_SYSCALL_64_fastpath+0x1e/0xad

The problem is that the bridge's VLAN group is created after setting the
default PVID, when registering the netdevice and executing its
ndo_init().

Fix this by changing the order of both operations, so that
br_changelink() is only processed after the netdevice is registered,
when the VLAN group is already initialized.

Fixes: b6677449df ("bridge: netlink: call br_changelink() during br_dev_newlink()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Peter V. Saveliev <peter@svinota.eu>
Tested-by: Peter V. Saveliev <peter@svinota.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 22:22:44 -04:00
Ido Schimmel b6fe0440c6 bridge: implement missing ndo_uninit()
While the bridge driver implements an ndo_init(), it was missing a
symmetric ndo_uninit(), causing the different de-initialization
operations to be scattered around its dellink() and destructor().

Implement a symmetric ndo_uninit() and remove the overlapping operations
from its dellink() and destructor().

This is a prerequisite for the next patch, as it allows us to have a
proper cleanup upon changelink() failure during the bridge's newlink().

Fixes: b6677449df ("bridge: netlink: call br_changelink() during br_dev_newlink()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 22:22:44 -04:00
Willem de Bruijn 8f917bba00 bpf: pass sk to helper functions
BPF helper functions access socket fields through skb->sk. This is not
set in ingress cgroup and socket filters. The association is only made
in skb_set_owner_r once the filter has accepted the packet. Sk is
available as socket lookup has taken place.

Temporarily set skb->sk to sk in these cases.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 14:54:19 -04:00
Wei Yongjun cb6bf9cfdb devlink: fix return value check in devlink_dpipe_header_put()
Fix the return value check which testing the wrong variable
in devlink_dpipe_header_put().

Fixes: 1555d204e7 ("devlink: Support for pipeline debug (dpipe)")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 14:53:32 -04:00
Johannes Berg be9370a7d8 bpf: remove struct bpf_prog_type_list
There's no need to have struct bpf_prog_type_list since
it just contains a list_head, the type, and the ops
pointer. Since the types are densely packed and not
actually dynamically registered, it's much easier and
smaller to have an array of type->ops pointer. Also
initialize this array statically to remove code needed
to initialize it.

In order to save duplicating the list, move it to a new
header file and include it in the places needing it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 14:38:43 -04:00
Guillaume Nault 55a3ce3b9d l2tp: remove l2tp_session_find()
This function isn't used anymore.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 13:48:09 -04:00
Guillaume Nault af87ae465a l2tp: remove useless duplicate session detection in l2tp_netlink
There's no point in checking for duplicate sessions at the beginning of
l2tp_nl_cmd_session_create(); the ->session_create() callbacks already
return -EEXIST when the session already exists.

Furthermore, even if l2tp_session_find() returns NULL, a new session
might be created right after the test. So relying on ->session_create()
to avoid duplicate session is the only sane behaviour.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 13:48:09 -04:00
David S. Miller c6606a87db Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-04-11

1) Remove unused field from struct xfrm_mgr.

2) Code size optimizations for the xfrm prefix hash and
   address match.

3) Branch optimization for addr4_match.

All patches from Alexey Dobriyan.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 10:10:30 -04:00
NeilBrown 717a94b5fc sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
It is not safe for one thread to modify the ->flags
of another thread as there is no locking that can protect
the update.

So tsk_restore_flags(), which takes a task pointer and modifies
the flags, is an invitation to do the wrong thing.

All current users pass "current" as the task, so no developers have
accepted that invitation.  It would be best to ensure it remains
that way.

So rename tsk_restore_flags() to current_restore_flags() and don't
pass in a task_struct pointer.  Always operate on current->flags.

Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 09:06:32 +02:00
Herbert Xu 633439f5b7 xfrm: Prepare for CRYPTO_MAX_ALG_NAME expansion
This patch fixes the xfrm_user code to use the actual array size
rather than the hard-coded CRYPTO_MAX_ALG_NAME length.  This is
because the array size is fixed at 64 bytes while we want to increase
the in-kernel CRYPTO_MAX_ALG_NAME value.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Tested-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-10 19:17:26 +08:00
Eric Dumazet 17c3060b17 tcp: clear saved_syn in tcp_disconnect()
In the (very unlikely) case a passive socket becomes a listener,
we do not want to duplicate its saved SYN headers.

This would lead to double frees, use after free, and please hackers and
various fuzzers

Tested:
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0

   +0 bind(3, ..., ...) = 0
   +0 listen(3, 5) = 0

   +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <...>
  +.1 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 connect(4, AF_UNSPEC, ...) = 0
   +0 close(3) = 0
   +0 bind(4, ..., ...) = 0
   +0 listen(4, 5) = 0

   +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <...>
  +.1 < . 1:1(0) ack 1 win 257

Fixes: cd8ae85299 ("tcp: provide SYN headers for passive connections")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-09 18:27:28 -07:00
David S. Miller bf74b20d00 Revert "rtnl: Add support for netdev event to link messages"
This reverts commit def12888c1.

As per discussion between Roopa Prabhu and David Ahern, it is
advisable that we instead have the code collect the setlink triggered
events into a bitmask emitted in the IFLA_EVENT netlink attribute.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-09 14:45:21 -07:00
Liping Zhang 7cddd967bf netfilter: nf_ct_expect: use proper RCU list traversal/update APIs
We should use proper RCU list APIs to manipulate help->expectations,
as we can dump the conntrack's expectations via nfnetlink, i.e. in
ctnetlink_exp_ct_dump_table(), where only rcu_read_lock is acquired.

So for list traversal, use hlist_for_each_entry_rcu; for list add/del,
use hlist_add_head_rcu and hlist_del_rcu.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-08 23:52:17 +02:00
Liping Zhang 207df81501 netfilter: ctnetlink: skip dumping expect when nfct_help(ct) is NULL
For IPCTNL_MSG_EXP_GET, if the CTA_EXPECT_MASTER attr is specified, then
the NLM_F_DUMP request will dump the expectations related to this
connection tracking.

But we forget to check whether the conntrack has nf_conn_help or not,
so if nfct_help(ct) is NULL, oops will happen:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
 IP: ctnetlink_exp_ct_dump_table+0xf9/0x1e0 [nf_conntrack_netlink]
 Call Trace:
  ? ctnetlink_exp_ct_dump_table+0x75/0x1e0 [nf_conntrack_netlink]
  netlink_dump+0x124/0x2a0
  __netlink_dump_start+0x161/0x190
  ctnetlink_dump_exp_ct+0x16c/0x1bc [nf_conntrack_netlink]
  ? ctnetlink_exp_fill_info.constprop.33+0xf0/0xf0 [nf_conntrack_netlink]
  ? ctnetlink_glue_seqadj+0x20/0x20 [nf_conntrack_netlink]
  ctnetlink_get_expect+0x32e/0x370 [nf_conntrack_netlink]
  ? debug_lockdep_rcu_enabled+0x1d/0x20
  nfnetlink_rcv_msg+0x60a/0x6a9 [nfnetlink]
  ? nfnetlink_rcv_msg+0x1b9/0x6a9 [nfnetlink]
  [...]

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-08 23:52:17 +02:00
Liping Zhang 0c7930e576 netfilter: make it safer during the inet6_dev->addr_list traversal
inet6_dev->addr_list is protected by inet6_dev->lock, so only using
rcu_read_lock is not enough, we should acquire read_lock_bh(&idev->lock)
before the inet6_dev->addr_list traversal.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-08 23:52:16 +02:00
Liping Zhang 3173d5b8c8 netfilter: ctnetlink: make it safer when checking the ct helper name
One CPU is doing ctnetlink_change_helper(), while another CPU is doing
unhelp() at the same time. So even if help->helper is not NULL at first,
the later statement strcmp(help->helper->name, ...) may still access
the NULL pointer.

So we must use rcu_read_lock and rcu_dereference to avoid such _bad_
thing happen.

Fixes: f95d7a46bc ("netfilter: ctnetlink: Fix regression in CTA_HELP processing")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-08 23:52:16 +02:00
Gao Feng 8b5995d063 netfilter: helper: Add the rcu lock when call __nf_conntrack_helper_find
When invoke __nf_conntrack_helper_find, it needs the rcu lock to
protect the helper module which would not be unloaded.

Now there are two caller nf_conntrack_helper_try_module_get and
ctnetlink_create_expect which don't hold rcu lock. And the other
callers left like ctnetlink_change_helper, ctnetlink_create_conntrack,
and ctnetlink_glue_attach_expect, they already hold the rcu lock
or spin_lock_bh.

Remove the rcu lock in functions nf_ct_helper_expectfn_find_by_name
and nf_ct_helper_expectfn_find_by_symbol. Because they return one pointer
which needs rcu lock, so their caller should hold the rcu lock, not in
these two functions.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-08 23:52:15 +02:00
Liping Zhang 97aae0df1d netfilter: ctnetlink: using bit to represent the ct event
Otherwise, creating a new conntrack via nfnetlink:
  # conntrack -I -p udp -s 1.1.1.1 -d 2.2.2.2 -t 10 --sport 10 --dport 20

will emit the wrong ct events(where UPDATE should be NEW):
  # conntrack -E
  [UPDATE] udp      17 10 src=1.1.1.1 dst=2.2.2.2 sport=10 dport=20
  [UNREPLIED] src=2.2.2.2 dst=1.1.1.1 sport=20 dport=10 mark=0

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-08 23:52:15 +02:00
Florian Fainelli a86d8becc3 net: dsa: Factor bottom tag receive functions
All DSA tag receive functions do strictly the same thing after they have located
the originating source port from their tag specific protocol:

- push ETH_HLEN bytes
- set pkt_type to PACKET_HOST
- call eth_type_trans()
- bump up counters
- call netif_receive_skb()

Factor all of that into dsa_switch_rcv(). This also makes us return a pointer to
a sk_buff, which makes us symetric with the xmit function.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 13:49:36 -07:00
Florian Fainelli 16c5dcb13a net: dsa: Move skb_unshare() to dsa_switch_rcv()
All DSA tag receive functions need to unshare the skb before mangling it, move
this to the generic dsa_switch_rcv() function which will allow us to make the
tag receive function return their mangled skb without caring about freeing a
NULL skb.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 13:49:36 -07:00
Florian Fainelli 9d7f9c4f78 net: dsa: Do not check for NULL dst in tag parsers
dsa_switch_rcv() already tests for dst == NULL, so there is no need to duplicate
the same check within the tag receive functions.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 13:49:36 -07:00
Eric Dumazet 2638fd0f92 netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
Denys provided an awesome KASAN report pointing to an use
after free in xt_TCPMSS

I have provided three patches to fix this issue, either in xt_TCPMSS or
in xt_tcpudp.c. It seems xt_TCPMSS patch has the smallest possible
impact.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-08 22:24:19 +02:00
Arushi Singhal 1e038e3eef netfilter: ip6_tables: Remove unneccessary comments
This comments are obsolete and should go, as there are no set of rules
per CPU anymore.

Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com>
2017-04-08 22:11:35 +02:00
Gao Feng 7cc2b043bc net: tcp: Increase TCP_MIB_OUTRSTS even though fail to alloc skb
Because TCP_MIB_OUTRSTS is an important count, so always increase it
whatever send it successfully or not.

Now move the increment of TCP_MIB_OUTRSTS to the top of
tcp_send_active_reset to make sure it is increased always even though
fail to alloc skb.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 08:30:09 -07:00
Guillaume Nault 321a52a391 l2tp: don't mask errors in pppol2tp_getsockopt()
pppol2tp_getsockopt() doesn't take into account the error code returned
by pppol2tp_tunnel_getsockopt() or pppol2tp_session_getsockopt(). If
error occurs there, pppol2tp_getsockopt() continues unconditionally and
reports erroneous values.

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 08:29:04 -07:00
Guillaume Nault 364700cf8f l2tp: don't mask errors in pppol2tp_setsockopt()
pppol2tp_setsockopt() unconditionally overwrites the error value
returned by pppol2tp_tunnel_setsockopt() or
pppol2tp_session_setsockopt(), thus hiding errors from userspace.

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 08:29:04 -07:00
Chenbo Feng 5daab9db7b New getsockopt option to get socket cookie
Introduce a new getsockopt operation to retrieve the socket cookie
for a specific socket based on the socket fd.  It returns a unique
non-decreasing cookie for each socket.
Tested: https://android-review.googlesource.com/#/c/358163/

Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-08 08:07:01 -07:00
Sean Wang 5cd8985a19 net-next: dsa: add Mediatek tag RX/TX handler
Add the support for the 4-bytes tag for DSA port distinguishing inserted
allowing receiving and transmitting the packet via the particular port.
The tag is being added after the source MAC address in the ethernet
header.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: Landen Chao <Landen.Chao@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-07 13:50:55 -07:00
Jens Axboe 65f619d253 Merge branch 'for-linus' into for-4.12/block
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:20 -06:00
Yuchung Cheng cc663f4d4c tcp: restrict F-RTO to work-around broken middle-boxes
The recent extension of F-RTO 89fe18e44 ("tcp: extend F-RTO
to catch more spurious timeouts") interacts badly with certain
broken middle-boxes.  These broken boxes modify and falsely raise
the receive window on the ACKs. During a timeout induced recovery,
F-RTO would send new data packets to probe if the timeout is false
or not. Since the receive window is falsely raised, the receiver
would silently drop these F-RTO packets. The recovery would take N
(exponentially backoff) timeouts to repair N packet losses.  A TCP
performance killer.

Due to this unfortunate situation, this patch removes this extension
to revert F-RTO back to the RFC specification.

Fixes: 89fe18e44f ("tcp: extend F-RTO to catch more spurious timeouts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-07 11:44:00 -07:00
Arushi Singhal d4ef383541 netfilter: Remove exceptional & on function name
Remove & from function pointers to conform to the style found elsewhere
in the file. Done using the following semantic patch

// <smpl>
@r@
identifier f;
@@

f(...) { ... }
@@
identifier r.f;
@@

- &f
+ f
// </smpl>

Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-07 18:24:47 +02:00
simran singhal cbbb40e2ec net: netfilter: Use list_{next/prev}_entry instead of list_entry
This patch replace list_entry with list_prev_entry as it makes the
code more clear to read.

Signed-off-by: simran singhal <singhalsimran0@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-07 17:33:02 +02:00
simran singhal cdec26858e netfilter: Use seq_puts()/seq_putc() where possible
For string without format specifiers, use seq_puts(). For
seq_printf("\n"), use seq_putc('\n').

Signed-off-by: simran singhal <singhalsimran0@gmail.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-07 17:29:21 +02:00
simran singhal 68ad546aef netfilter: Remove unnecessary cast on void pointer
The following Coccinelle script was used to detect this:
@r@
expression x;
void* e;
type T;
identifier f;
@@
(
  *((T *)e)
|
  ((T *)x)[...]
|
  ((T*)x)->f
|

- (T*)
  e
)

Unnecessary parantheses are also remove.

Signed-off-by: simran singhal <singhalsimran0@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-07 17:29:17 +02:00
Florian Larysch bbadb9a222 net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
ip_route_input when iif is given. If a multipath route is present for
the designated destination, fib_multipath_hash ends up being called with
that skb. However, as that skb contains no information beyond the
protocol type, the calculated hash does not match the one we would see
for a real packet.

There is currently no way to fix this for layer 4 hashing, as
RTM_GETROUTE doesn't have the necessary information to create layer 4
headers. To fix this for layer 3 hashing, set appropriate saddr/daddrs
in the skb and also change the protocol to UDP to avoid special
treatment for ICMP.

Signed-off-by: Florian Larysch <fl@n621.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-07 07:56:14 -07:00
Pablo Neira Ayuso dedb67c4b4 netfilter: Add nfnl_msg_type() helper function
Add and use nfnl_msg_type() function to replace opencoded nfnetlink
message type. I suggested this change, Arushi Singhal made an initial
patch to address this but was missing several spots.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-07 16:31:36 +02:00
David S. Miller bd41486044 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - Code and Style cleanups, by Sven Eckelmann (5 patches)
 
  - Remove an unneccessary memset, by Tobias Klauser
 
  - DAT and BLA optimizations for various corner cases, by Andreas Pape
    (5 patches)
 
  - forward/rebroadcast packet restructuring, by Linus Luessing
    (2 patches)
 
  - ethtool cleanup and remove unncessary code, by Sven Eckelmann
    (4 patches)
 
  - use net_device_stats from net_device instead of private copy,
    by Tobias Klauser
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAljmSyUWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeodYrD/4xzy0IGyF7vb/q3Nj0EeioldL8
 /8j9G14RfnlQaTIGxHEYz6oo1A6+cuH+t/4N+ilPFv6F0bBXG3GsWYrBCYLhVKwu
 2BQcoXC4ysdVqdrc7UDG5JY/t+n7JwhIcBw6kta9mt5fFVmPI/KHpefeaiUkdJb4
 23YkYsOSPVcD/Yv6+uiyfOG8FFxDrqFyn1uHpE1K5/05jLhJJ8S7nODNsitMOv2m
 EMLNJwTstSK/boPoxIpSCraxZGgpb0rt0K6HiBiNq0GyGRCtz0pxBpDOikVwTB62
 hy2aW3R59baGHPxA47oOAF60cJUlTj4pYxmsGwd9hzeJkw4eMvcYJzW6/2PXB3bj
 e4xBT5T+Y3bZDFdK+jW4hOCHLGC6gpwWSs5P6YDhCalhJ/meV75ef03ruIvGhVh9
 qOhuSMAnqCsoJ8FMpLT3GMkSGSkOXEhL8iF/r3/PR/q0cPCquQnMjQg+Kqp7+eb6
 iNHwSA0dqUiBB7KplS8KTcAxre8KlmNmGr544KY9eK0BZZ9BOleT2a2W7+QSTCKx
 o8fl44HmwhgxQ+mYJwG1kEfgmTRhvrOmr6KP9qTkn5lAoIGXspwBo/bA5+cyLJOU
 ewsx8JG5gfT2yUFB1HqdZon+rkW6KKopdHL+BL3BvQRYTUojxmr0lJBrMWg7GNYv
 2lwrasmHxhWbF2L4xg==
 =r99K
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20170406' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Code and Style cleanups, by Sven Eckelmann (5 patches)

 - Remove an unneccessary memset, by Tobias Klauser

 - DAT and BLA optimizations for various corner cases, by Andreas Pape
   (5 patches)

 - forward/rebroadcast packet restructuring, by Linus Luessing
   (2 patches)

 - ethtool cleanup and remove unncessary code, by Sven Eckelmann
   (4 patches)

 - use net_device_stats from net_device instead of private copy,
   by Tobias Klauser
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 14:37:50 -07:00
David S. Miller ec1af27ea8 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWOYUWfSw1s6N8H32AQIqdg/9Hi+47eues/TBbogP8eRrqVEoNHFy75e/
 MMTFe0/Qio7ps78VuOSThbqh96dzIX5K5/7JdiHZyQk2QCTaJ2BvheCUISQovhFl
 yuAJcBhkO5iiQkR0agYdHVjIQGRth3usNIEyD1rm1DS/lr8ec9/iyjoipKpsZmxt
 WlRF3eGgqA+cLpH4K+k4x/LJwIl8868MBz58p6XXW2yZFRygQzYHmMobhDwgLoC2
 C2lHPEyllK7qcIaZD7SI/a2/bMwh7QTx1tJuQK3DgtJrAHigx96uxH3jqECk7fLg
 EhjLqIFmWVCUcrBbUqjlNtcuevzxCZTCCB0LAZgmOTyEyCFJzgoQmQo97VhMPbG5
 JF9bKg+JE6P5iwqtTBEW9p+LoyM7VAt6SzeuKH/vNAVGHc0ULDMB8XPYF3Nvqa3L
 RGIwcxWCAItZFdDCUvWgTyEuZtVXu6LbuvnU1HkaXlXsLLi4041MQgcekY58k6kv
 z4YnXojy0+mciJ4WV/7CLfNMyP36G0gwLugjLAsJwigxJoOTtsfphkGcpWlP9hBm
 IyTFJw0qbbuQD7fprfw//e+IgfjDQbxYMQKxQaJZflXzYDCab8PQkFTl1kWPrJSR
 yR0rk8wKb7Z1fyl/zpUNbw7KdFganqhkZ6jbOrr9G8Hp8jyubnwMCI5B2rSZfe1V
 CSOtCbEt8Is=
 =JnXV
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20170406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Miscellany

Here's a set of patches that make some minor changes to AF_RXRPC:

 (1) Store error codes in struct rxrpc_call::error as negative codes and
     only convert to positive in recvmsg() to avoid confusion inside the
     kernel.

 (2) Note the result of trying to abort a call (this fails if the call is
     already 'completed').

 (3) Don't abort on temporary errors whilst processing challenge and
     response packets, but rather drop the packet and wait for
     retransmission.

And also adds some more tracing:

 (4) Protocol errors.

 (5) Received abort packets.

 (6) Changes in the Rx window size due to ACK packet information.

 (7) Client call initiation (to allow the rxrpc_call struct pointer, the
     wire call ID and the user ID/afs_call pointer to be cross-referenced).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 14:22:46 -07:00
Xin Long 34b2789f1d sctp: listen on the sock only when it's state is listening or closed
Now sctp doesn't check sock's state before listening on it. It could
even cause changing a sock with any state to become a listening sock
when doing sctp_listen.

This patch is to fix it by checking sock's state in sctp_listen, so
that it will listen on the sock with right state.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 13:55:51 -07:00
R. Parameswaran b784e7ebfc L2TP:Adjust intf MTU, add underlay L3, L2 hdrs.
Existing L2TP kernel code does not derive the optimal MTU for Ethernet
pseudowires and instead leaves this to a userspace L2TP daemon or
operator. If an MTU is not specified, the existing kernel code chooses
an MTU that does not take account of all tunnel header overheads, which
can lead to unwanted IP fragmentation. When L2TP is used without a
control plane (userspace daemon), we would prefer that the kernel does a
better job of choosing a default pseudowire MTU, taking account of all
tunnel header overheads, including IP header options, if any. This patch
addresses this.

Change-set here uses the new kernel function, kernel_sock_ip_overhead(),
to factor the outer IP overhead on the L2TP tunnel socket (including
IP Options, if any) when calculating the default MTU for an Ethernet
pseudowire, along with consideration of the inner Ethernet header.

Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 13:43:31 -07:00
R. Parameswaran 113c307593 New kernel function to get IP overhead on a socket.
A new function, kernel_sock_ip_overhead(), is provided
to calculate the cumulative overhead imposed by the IP
Header and IP options, if any, on a socket's payload.
The new function returns an overhead of zero for sockets
that do not belong to the IPv4 or IPv6 address families.
This is used in the L2TP code path to compute the
total outer IP overhead on the L2TP tunnel socket when
calculating the default MTU for Ethernet pseudowires.

Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 13:43:31 -07:00
Gao Feng 2c62e0bc68 netfilter: ctnetlink: Expectations must have a conntrack helper area
The expect check function __nf_ct_expect_check() asks the master_help is
necessary. So it is unnecessary to go ahead in ctnetlink_alloc_expect
when there is no help.

Actually the commit bc01befdcf ("netfilter: ctnetlink: add support for
user-space expectation helpers") permits ctnetlink create one expect
even though there is no master help. But the latter commit 3d058d7bc2
("netfilter: rework user-space expectation helper support") disables it
again.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-06 22:01:42 +02:00
Florian Westphal 6e699867f8 netfilter: nat: avoid use of nf_conn_nat extension
successful insert into the bysource hash sets IPS_SRC_NAT_DONE status bit
so we can check that instead of presence of nat extension which requires
extra deref.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-06 22:01:42 +02:00
Gao Feng cba81cc4c9 netfilter: nat: nf_nat_mangle_{udp,tcp}_packet returns boolean
nf_nat_mangle_{udp,tcp}_packet() returns int. However, it is used as
bool type in many spots. Fix this by consistently handle this return
value as a boolean.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-06 22:01:38 +02:00
Kees Cook 82fe0d2b44 af_unix: Use designated initializers
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified during
allyesconfig builds of x86, arm, and arm64, and the initializer fixes
were extracted from grsecurity. In this case, NULL initialize with { }
instead of undesignated NULLs.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 12:43:04 -07:00
WANG Cong 92f9170621 net_sched: check noop_qdisc before qdisc_hash_add()
Dmitry reported a crash when injecting faults in
attach_one_default_qdisc() and dev->qdisc is still
a noop_disc, the check before qdisc_hash_add() fails
to catch it because it tests NULL. We should test
against noop_qdisc since it is the default qdisc
at this point.

Fixes: 59cc1f61f0 ("net: sched: convert qdisc linked list to hashtable")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 12:28:39 -07:00
Florian Larysch a8801799c6 net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
ip_route_input when iif is given. If a multipath route is present for
the designated destination, ip_multipath_icmp_hash ends up being called,
which uses the source/destination addresses within the skb to calculate
a hash. However, those are not set in the synthetic skb, causing it to
return an arbitrary and incorrect result.

Instead, use UDP, which gets no such special treatment.

Signed-off-by: Florian Larysch <fl@n621.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 12:18:56 -07:00
David S. Miller 0e4c0ee580 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-04-06 11:57:04 -07:00
Gao Feng ec0e3f0111 netfilter: nf_ct_expect: Add nf_ct_remove_expect()
When remove one expect, it needs three statements. And there are
multiple duplicated codes in current code. So add one common function
nf_ct_remove_expect to consolidate this.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-06 18:39:40 +02:00
Gao Feng 92f73221f9 netfilter: expect: Make sure the max_expected limit is effective
Because the type of expecting, the member of nf_conn_help, is u8, it
would overflow after reach U8_MAX(255). So it doesn't work when we
configure the max_expected exceeds 255 with expect policy.

Now add the check for max_expected. Return the -EINVAL when it exceeds
the limit.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-06 18:32:16 +02:00
Pablo Neira Ayuso f323d95469 netfilter: nf_tables: add nft_is_base_chain() helper
This new helper function allows us to check if this is a basechain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-06 18:32:04 +02:00
David S. Miller 6f14f443d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 08:24:51 -07:00
David Howells 89ca694806 rxrpc: Trace client call connection
Add a tracepoint (rxrpc_connect_call) to log the combination of rxrpc_call
pointer, afs_call pointer/user data and wire call parameters to make it
easier to match the tracebuffer contents to captured network packets.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-04-06 11:10:41 +01:00
David Howells 740586d290 rxrpc: Trace changes in a call's receive window size
Add a tracepoint (rxrpc_rx_rwind_change) to log changes in a call's receive
window size as imposed by the peer through an ACK packet.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-04-06 11:10:41 +01:00
David Howells 005ede286f rxrpc: Trace received aborts
Add a tracepoint (rxrpc_rx_abort) to record received aborts.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-04-06 11:10:41 +01:00
David Howells fb46f6ee10 rxrpc: Trace protocol errors in received packets
Add a tracepoint (rxrpc_rx_proto) to record protocol errors in received
packets.  The following changes are made:

 (1) Add a function, __rxrpc_abort_eproto(), to note a protocol error on a
     call and mark the call aborted.  This is wrapped by
     rxrpc_abort_eproto() that makes the why string usable in trace.

 (2) Add trace_rxrpc_rx_proto() or rxrpc_abort_eproto() to protocol error
     generation points, replacing rxrpc_abort_call() with the latter.

 (3) Only send an abort packet in rxkad_verify_packet*() if we actually
     managed to abort the call.

Note that a trace event is also emitted if a kernel user (e.g. afs) tries
to send data through a call when it's not in the transmission phase, though
it's not technically a receive event.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-04-06 11:09:39 +01:00
David Howells ef68622da9 rxrpc: Handle temporary errors better in rxkad security
In the rxkad security module, when we encounter a temporary error (such as
ENOMEM) from which we could conceivably recover, don't abort the
connection, but rather permit retransmission of the relevant packets to
induce a retry.

Note that I'm leaving some places that could be merged together to insert
tracing in the next patch.

Signed-off-by; David Howells <dhowells@redhat.com>
2017-04-06 10:11:59 +01:00
David Howells 84a4c09c38 rxrpc: Note a successfully aborted kernel operation
Make rxrpc_kernel_abort_call() return an indication as to whether it
actually aborted the operation or not so that kafs can trace the failure of
the operation.  Note that 'success' in this context means changing the
state of the call, not necessarily successfully transmitting an ABORT
packet.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-04-06 10:11:59 +01:00
David Howells 3a92789af0 rxrpc: Use negative error codes in rxrpc_call struct
Use negative error codes in struct rxrpc_call::error because that's what
the kernel normally deals with and to make the code consistent.  We only
turn them positive when transcribing into a cmsg for userspace recvmsg.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-04-06 10:11:56 +01:00
Al Viro e73a67f7cd don't open-code kernel_setsockopt()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-06 02:09:23 -04:00
Linus Torvalds ea6b1720ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Reject invalid updates to netfilter expectation policies, from Pablo
    Neira Ayuso.

 2) Fix memory leak in nfnl_cthelper, from Jeffy Chen.

 3) Don't do stupid things if we get a neigh_probe() on a neigh entry
    whose ops lack a solicit method. From Eric Dumazet.

 4) Don't transmit packets in r8152 driver when the carrier is off, from
    Hayes Wang.

 5) Fix ipv6 packet type detection in aquantia driver, from Pavel
    Belous.

 6) Don't write uninitialized data into hw registers in bna driver, from
    Arnd Bergmann.

 7) Fix locking in ping_unhash(), from Eric Dumazet.

 8) Make BPF verifier range checks able to understand certain sequences
    emitted by LLVM, from Alexei Starovoitov.

 9) Fix use after free in ipconfig, from Mark Rutland.

10) Fix refcount leak on force commit in openvswitch, from Jarno
    Rajahalme.

11) Fix various overflow checks in AF_PACKET, from Andrey Konovalov.

12) Fix endianness bug in be2net driver, from Suresh Reddy.

13) Don't forget to wake TX queues when processing a timeout, from
    Grygorii Strashko.

14) ARP header on-stack storage is wrong in flow dissector, from Simon
    Horman.

15) Lost retransmit and reordering SNMP stats in TCP can be
    underreported. From Yuchung Cheng.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits)
  nfp: fix potential use after free on xdp prog
  tcp: fix reordering SNMP under-counting
  tcp: fix lost retransmit SNMP under-counting
  sctp: get sock from transport in sctp_transport_update_pmtu
  net: ethernet: ti: cpsw: fix race condition during open()
  l2tp: fix PPP pseudo-wire auto-loading
  bnx2x: fix spelling mistake in macros HW_INTERRUT_ASSERT_SET_*
  l2tp: take reference on sessions being dumped
  tcp: minimize false-positives on TCP/GRO check
  sctp: check for dst and pathmtu update in sctp_packet_config
  flow dissector: correct size of storage for ARP
  net: ethernet: ti: cpsw: wake tx queues on ndo_tx_timeout
  l2tp: take a reference on sessions used in genetlink handlers
  l2tp: hold session while sending creation notifications
  l2tp: fix duplicate session creation
  l2tp: ensure session can't get removed during pppol2tp_session_ioctl()
  l2tp: fix race in l2tp_recv_common()
  sctp: use right in and out stream cnt
  bpf: add various verifier test cases for self-tests
  bpf, verifier: fix rejection of unaligned access checks for map_value_adj
  ...
2017-04-05 20:17:38 -07:00
Yuchung Cheng 2d2517ee31 tcp: fix reordering SNMP under-counting
Currently the reordering SNMP counters only increase if a connection
sees a higher degree then it has previously seen. It ignores if the
reordering degree is not greater than the default system threshold.
This significantly under-counts the number of reordering events
and falsely convey that reordering is rare on the network.

This patch properly and faithfully records the number of reordering
events detected by the TCP stack, just like the comment says "this
exciting event is worth to be remembered". Note that even so TCP
still under-estimate the actual reordering events because TCP
requires TS options or certain packet sequences to detect reordering
(i.e. ACKing never-retransmitted sequence in recovery or disordered
 state).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 18:41:27 -07:00
Yuchung Cheng ecde8f36f8 tcp: fix lost retransmit SNMP under-counting
The lost retransmit SNMP stat is under-counting retransmission
that uses segment offloading. This patch fixes that so all
retransmission related SNMP counters are consistent.

Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 18:41:27 -07:00
David S. Miller 18148f09c0 linux-can-next-for-4.13-20170404
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAljjveoTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAe/sog7Dgc9tSZB/9DPsUOEbLzcJZ6HVRPJ3mJkCYf9jdD
 VE7vW3w79EcmcbkE7ULkyrEr/+GJs7GvA1vS+j8jphraVGOjP4DjeOm1OLJXylqa
 m2vaBpOqTOx3MdMdd/FVLlap7MKTX3f9J1WAOsGi5kJK3RR9EHRj5tKhoeG40OUk
 rMDg9juskT+XVqgawrcHyM/eVHZ4ny+BlN2LN0zajHT6l3xKiqQ+0iQltXnstymW
 djTfqlrbL+Ix6mlYKsK+joVWwbNUxuguto2m2CKVQukWFu2Q5wxe5YASjtMThQcv
 vmq2LD45cFIfNhzgfTu2msoz2Cv613LFiMtHtrfjlbXtSZmFc2FqIbKc
 =NjgL
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-4.13-20170404' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2017-03-03

this is a pull request of 5 patches for net-next/master.

There are two patches by Yegor Yefremov which convert the ti_hecc
driver into a DT only driver, as there is no in-tree user of the old
platform driver interface anymore. The next patch by Mario Kicherer
adds network namespace support to the can subsystem. The last two
patches by Akshay Bhat add support for the holt_hi311x SPI CAN driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 09:56:22 -07:00
Vlad Yasevich def12888c1 rtnl: Add support for netdev event to link messages
When netdev events happen, a rtnetlink_event() handler will send
messages for every event in it's white list.  These messages contain
current information about a particular device, but they do not include
the iformation about which event just happened.  The consumer of
the message has to try to infer this information.  In some cases
(ex: NETDEV_NOTIFY_PEERS), that is not possible.

This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
that would have an encoding of the which event triggered this
message.  This would allow the the message consumer to easily determine
if it is interested in a particular event or not.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 08:14:14 -07:00
Vlad Yasevich 5138e86f17 rtnetlink: Convert rtnetlink_event to white list
The rtnetlink_event currently functions as a blacklist where
we block cerntain netdev events from being sent to user space.
As a result, events have been added to the system that userspace
probably doesn't care about.

This patch converts the implementation to the white list so that
newly events would have to be specifically added to the list to
be sent to userspace.  This would force new event implementers to
consider whether a given event is usefull to user space or if it's
just a kernel event.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 08:14:14 -07:00
Gao Feng 589c49cbf9 net: tcp: Define the TCP_MAX_WSCALE instead of literal number 14
Define one new macro TCP_MAX_WSCALE instead of literal number '14',
and use U16_MAX instead of 65535 as the max value of TCP window.
There is another minor change, use rounddown(space, mss) instead of
(space / mss) * mss;

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 07:50:32 -07:00
Xin Long 3ebfdf0821 sctp: get sock from transport in sctp_transport_update_pmtu
This patch is almost to revert commit 02f3d4ce9e ("sctp: Adjust PMTU
updates to accomodate route invalidation."). As t->asoc can't be NULL
in sctp_transport_update_pmtu, it could get sk from asoc, and no need
to pass sk into that function.

It is also to remove some duplicated codes from that function.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 07:20:06 -07:00
Andrey Vagin 457c79e544 netlink/diag: report flags for netlink sockets
cb_running is reported in /proc/self/net/netlink and it is reported by
the ss tool, when it gets information from the proc files.

sock_diag is a new interface which is used instead of proc files, so it
looks reasonable that this interface has to report no less information
about sockets than proc files.

We use these flags to dump and restore netlink sockets.

Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 07:13:56 -07:00
Dan Carpenter 2c099ccbd0 net: sched: choke: remove some dead code
We accidentally left this dead code behind after commit 5952fde10c
("net: sched: choke: remove dead filter classify code").

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 06:53:42 -07:00
Tobias Klauser ab044f8e3e batman-adv: Use net_device_stats from struct net_device
Instead of using a private copy of struct net_device_stats in struct
batadv_priv, use stats from struct net_device.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-04-05 15:41:24 +02:00
Andy Shevchenko 3267183c3d NFC: netlink: Use error code from nfc_activate_target()
It looks like a typo to assign a return code to a variable which is not
used. Found due to a compiler warning:

net/nfc/netlink.c: In function ‘nfc_genl_activate_target’:
net/nfc/netlink.c:903:6: warning: variable ‘rc’ set but not used [-Wunused-but-set-variable]
  int rc;
        ^~

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-04-05 10:15:17 +02:00
Guillaume Nault 249ee819e2 l2tp: fix PPP pseudo-wire auto-loading
PPP pseudo-wire type is 7 (11 is L2TP_PWTYPE_IP).

Fixes: f1f39f9110 ("l2tp: auto load type modules")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-04 10:08:42 -07:00
Guillaume Nault e08293a4cc l2tp: take reference on sessions being dumped
Take a reference on the sessions returned by l2tp_session_find_nth()
(and rename it l2tp_session_get_nth() to reflect this change), so that
caller is assured that the session isn't going to disappear while
processing it.

For procfs and debugfs handlers, the session is held in the .start()
callback and dropped in .show(). Given that pppol2tp_seq_session_show()
dereferences the associated PPPoL2TP socket and that
l2tp_dfs_seq_session_show() might call pppol2tp_show(), we also need to
call the session's .ref() callback to prevent the socket from going
away from under us.

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Fixes: 0ad6614048 ("l2tp: Add debugfs files for dumping l2tp debug info")
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-04 10:00:56 -07:00
Sagi Grimberg b1a951fe46 net/utils: generic inet_pton_with_scope helper
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Mario Kicherer 8e8cda6d73 can: initial support for network namespaces
This patch adds initial support for network namespaces. The changes only
enable support in the CAN raw, proc and af_can code. GW and BCM still
have their checks that ensure that they are used only from the main
namespace.

The patch boils down to moving the global structures, i.e. the global
filter list and their /proc stats, into a per-namespace structure and passing
around the corresponding "struct net" in a lot of different places.

Changes since v1:
 - rebased on current HEAD (2bfe01e)
 - fixed overlong line

Signed-off-by: Mario Kicherer <dev@kicherer.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-04 17:35:58 +02:00
Alexey Dobriyan 822f9bb104 soreuseport: use "unsigned int" in __reuseport_alloc()
Number of sockets is limited by 16-bit, so 64-bit allocation will never
happen.

16-bit ops are the worst code density-wise on x86_64 because of
additional prefix (66).

Space savings:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
	function                                     old     new   delta
	reuseport_add_sock                           539     536      -3

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 19:06:38 -07:00
Alexey Dobriyan ec2e45a978 flowcache: more "unsigned int"
Make ->hash_count, ->low_watermark and ->high_watermark unsigned int
and propagate unsignedness to other variables.

This change doesn't change code generation because these fields aren't
used in 64-bit contexts but make it anyway: these fields can't be
negative numbers.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 19:04:48 -07:00
Alexey Dobriyan f31cc7e815 flowcache: make flow_cache_hash_size() return "unsigned int"
Hash size can't negative so "unsigned int" is logically correct.

Propagate "unsigned int" to loop counters.

Space savings:

	add/remove: 0/0 grow/shrink: 2/2 up/down: 6/-18 (-12)
	function                                     old     new   delta
	flow_cache_flush_tasklet                     362     365      +3
	__flow_cache_shrink                          333     336      +3
	flow_cache_cpu_up_prep                       178     171      -7
	flow_cache_lookup                           1159    1148     -11

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 19:04:48 -07:00
Alexey Dobriyan 5a17d9ed9a flowcache: make flow_key_size() return "unsigned int"
Flow keys aren't 4GB+ numbers so 64-bit arithmetic is excessive.

Space savings (I'm not sure what CSWTCH is):

	add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-48 (-48)
	function                                     old     new   delta
	flow_cache_lookup                           1163    1159      -4
	CSWTCH                                     75997   75953     -44

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 19:04:48 -07:00
Marcelo Ricardo Leitner 0b9aefea86 tcp: minimize false-positives on TCP/GRO check
Markus Trippelsdorf reported that after commit dcb17d22e1 ("tcp: warn
on bogus MSS and try to amend it") the kernel started logging the
warning for a NIC driver that doesn't even support GRO.

It was diagnosed that it was possibly caused on connections that were
using TCP Timestamps but some packets lacked the Timestamps option. As
we reduce rcv_mss when timestamps are used, the lack of them would cause
the packets to be bigger than expected, although this is a valid case.

As this warning is more as a hint, getting a clean-cut on the
threshold is probably not worth the execution time spent on it. This
patch thus alleviates the false-positives with 2 quick checks: by
accounting for the entire TCP option space and also checking against the
interface MTU if it's available.

These changes, specially the MTU one, might mask some real positives,
though if they are really happening, it's possible that sooner or later
it will be triggered anyway.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 18:43:41 -07:00
Xin Long df2729c323 sctp: check for dst and pathmtu update in sctp_packet_config
This patch is to move sctp_transport_dst_check into sctp_packet_config
from sctp_packet_transmit and add pathmtu check in sctp_packet_config.

With this fix, sctp can update dst or pathmtu before appending chunks,
which can void dropping packets in sctp_packet_transmit when dst is
obsolete or dst's mtu is changed.

This patch is also to improve some other codes in sctp_packet_config.
It updates packet max_size with gso_max_size, checks for dst and
pathmtu, and appends ecne chunk only when packet is empty and asoc
is not NULL.

It makes sctp flush work better, as we only need to set up them once
for one flush schedule. It's also safe, since asoc is NULL only when
the packet is created by sctp_ootb_pkt_new in which it just gets the
new dst, no need to do more things for it other than set packet with
transport's pathmtu.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 14:54:33 -07:00
Xin Long d229d48d18 sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctp
Before when implementing sctp prsctp, SCTP_PR_STREAM_STATUS wasn't
added, as it needs to save abandoned_(un)sent for every stream.

After sctp stream reconf is added in sctp, assoc has structure
sctp_stream_out to save per stream info.

This patch is to add SCTP_PR_STREAM_STATUS by putting the prsctp
per stream statistics into sctp_stream_out.

v1->v2:
  fix an indent issue.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 14:52:35 -07:00
Simon Horman ac6a3722fe flow dissector: correct size of storage for ARP
The last argument to __skb_header_pointer() should be a buffer large
enough to store struct arphdr. This can be a pointer to a struct arphdr
structure. The code was previously using a pointer to a pointer to
struct arphdr.

By my counting the storage available both before and after is 8 bytes on
x86_64.

Fixes: 55733350e5 ("flow disector: ARP support")
Reported-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03 14:46:45 -07:00
Sven Eckelmann 5405e19e70 batman-adv: Group ethtool related code together
The ethtool code was spread in soft-interface.c. This makes reading the
code and working on it unnecessary complicated. Having everything in a
common place next to the other code which references it, makes it slightly
easier.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-04-03 12:58:29 +02:00
Sven Eckelmann e2790a4b27 batman-adv: Remove ethtool .get_settings stub
The .get_settings function pointer and the related API was deprecated.
Fortunately, batman-adv is a virtual interface and never provided any
useful information via .get_settings. The stub can therefore be
removed.

This also avoids that incorrect information is shown in ethtool about the
batadv interface.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-04-03 12:58:23 +02:00
Sven Eckelmann 40ad9842cb batman-adv: Remove ethtool msglevel functions
batadv devices don't support msglevel. The ethtool stubs therefore returned
that it isn't supported. But instead, the complete function can be dropped
to avoid that bogus values are shown in ethtool.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-04-03 12:58:17 +02:00
Sven Eckelmann 2f249e99c7 batman-adv: Use ethtool helper to get link status
The ethtool_ops of batman-adv never contained more than a stub for the
get_link function pointer. It was always returning that a link exists even
when the devices was not yet up and therefore nothing resampling a link
could have been available.

Instead use the ethtool helper which returns the current carrier state.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-04-03 12:57:06 +02:00
Yuejie Shi 89e357d83c af_key: Add lock to key dump
A dump may come in the middle of another dump, modifying its dump
structure members. This race condition will result in NULL pointer
dereference in kernel. So add a lock to prevent that race.

Fixes: 83321d6b98 ("[AF_KEY]: Dump SA/SP entries non-atomically")
Signed-off-by: Yuejie Shi <syjcnss@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-03 08:05:17 +02:00
Sowmini Varadhan 087d975353 rds: tcp: canonical connection order for all paths with index > 0
The rds_connect_worker() has a bug in the check that enforces the
canonical connection order described in the comments of
rds_tcp_state_change(). The intention is to make sure that all
the multipath connections are always initiated by the smaller IP
address via rds_start_mprds. To achieve this, rds_connection_worker
should check that cp_index > 0.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-02 19:41:00 -07:00
Sowmini Varadhan e97656d03c rds: tcp: allow progress of rds_conn_shutdown if the rds_connection is marked ERROR by an intervening FIN
rds_conn_shutdown() runs in workq context, and marks the rds_connection
as DISCONNECTING before quiescing Tx/Rx paths. However, after all I/O
has quiesced, we may still find the rds_connection state to be
RDS_CONN_ERROR if an intervening FIN was processed in softirq context.

This is not a fatal error: rds_conn_shutdown() should continue the
shutdown, and there is no need to log noisy messages about this event.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-02 19:41:00 -07:00
Al Viro 3278682123 make skb_copy_datagram_msg() et.al. preserve ->msg_iter on error
Fixes the mess observed in e.g. rsync over a noisy link we'd been
seeing since last Summer.  What happens is that we copy part of
a datagram before noticing a checksum mismatch.  Datagram will be
resent, all right, but we want the next try go into the same place,
not after it...

All this family of primitives (copy/checksum and copy a datagram
into destination) is "all or nothing" sort of interface - either
we get 0 (meaning that copy had been successful) or we get an
error (and no way to tell how much had been copied before we ran
into whatever error it had been).  Make all of them leave iterator
unadvanced in case of errors - all callers must be able to cope
with that (an error might've been caught before the iterator had
been advanced), it costs very little to arrange, it's safer for
callers and actually fixes at least one bug in said callers.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-02 12:10:57 -04:00
David Ahern 1511009cd6 net: mpls: Increase max number of labels for lwt encap
Alow users to push down more labels per MPLS encap. Similar to LSR case,
move label array to the end of mpls_iptunnel_encap and allocate based on
the number of labels for the route.

For consistency with the LSR case, re-use the same maximum number of
labels.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:21:44 -07:00
David Ahern a4ac8c986d net: mpls: bump maximum number of labels
Allow users to push down more labels per MPLS route. With the previous
patches, no memory allocations are based on MAX_NEW_LABELS; the limit
is only used to keep userspace in check.

At this point MAX_NEW_LABELS is only used for mpls_route_config (copying
route data from userspace) and processing nexthops looking for the max
number of labels across the route spec.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:21:44 -07:00
David Ahern df1c631648 net: mpls: Limit memory allocation for mpls_route
Limit memory allocation size for mpls_route to 4096.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:21:44 -07:00
David Ahern 59b209667a net: mpls: change mpls_route layout
Move labels to the end of mpls_nh as a 0-sized array and within mpls_route
move the via for a nexthop after the mpls_nh. The new layout becomes:

   +----------------------+
   | mpls_route           |
   +----------------------+
   | mpls_nh 0            |
   +----------------------+
   | alignment padding    |   4 bytes for odd number of labels; 0 for even
   +----------------------+
   | via[rt_max_alen] 0   |
   +----------------------+
   | alignment padding    |   via's aligned on sizeof(unsigned long)
   +----------------------+
   | ...                  |
   +----------------------+
   | mpls_nh n-1          |
   +----------------------+
   | via[rt_max_alen] n-1 |
   +----------------------+

Memory allocated for nexthop + via is constant across all nexthops and
their via. It is based on the maximum number of labels across all nexthops
and the maximum via length. The size is saved in the mpls_route as
rt_nh_size. Accessing a nexthop becomes rt->rt_nh + index * rt->rt_nh_size.

The offset of the via address from a nexthop is saved as rt_via_offset
so that given an mpls_nh pointer the via for that hop is simply
nh + rt->rt_via_offset.

With prior code, memory allocated per mpls_route with 1 nexthop:
     via is an ethernet address - 64 bytes
     via is an ipv4 address     - 64
     via is an ipv6 address     - 72

With this patch set, memory allocated per mpls_route with 1 nexthop and
1 or 2 labels:
     via is an ethernet address - 56 bytes
     via is an ipv4 address     - 56
     via is an ipv6 address     - 64

The 8-byte reduction is due to the previous patch; the change introduced
by this patch has no impact on the size of allocations for 1 or 2 labels.

Performance impact of this change was examined using network namespaces
with veth pairs connecting namespaces. ns0 inserts the packet to the
label-switched path using an lwt route with encap mpls. ns1 adds 1 or 2
labels depending on test, ns2 (and ns3 for 2-label test) pops the label
and forwards. ns3 (or ns4) for a 2-label is the destination. Similar
series of namespaces used for 2-nexthop test.

Intent is to measure changes to latency (overhead in manipulating the
packet) in the forwarding path. Tests used netperf with UDP_RR.

IPv4:                     current   patches
   1 label, 1 nexthop      29908     30115
   2 label, 1 nexthop      29071     29612
   1 label, 2 nexthop      29582     29776
   2 label, 2 nexthop      29086     29149

IPv6:                     current   patches
   1 label, 1 nexthop      24502     24960
   2 label, 1 nexthop      24041     24407
   1 label, 2 nexthop      23795     23899
   2 label, 2 nexthop      23074     22959

In short, the change has no effect to a modest increase in performance.
This is expected since this patch does not really have an impact on routes
with 1 or 2 labels (the current limit) and 1 or 2 nexthops.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:21:44 -07:00
David Ahern 77ef013aad net: mpls: Convert number of nexthops to u8
Number of nexthops and number of alive nexthops are tracked using an
unsigned int. A route should never have more than 255 nexthops so
convert both to u8. Update all references and intermediate variables
to consistently use u8 as well.

Shrinks the size of mpls_route from 32 bytes to 24 bytes with a 2-byte
hole before the nexthops.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:21:44 -07:00
David Ahern 39eb8cd175 net: mpls: rt_nhn_alive and nh_flags should be accessed using READ_ONCE
The number of alive nexthops for a route (rt->rt_nhn_alive) and the
flags for a next hop (nh->nh_flags) are modified by netdev event
handlers. The event handlers run with rtnl_lock held so updates are
always done with the lock held. The packet path accesses the fields
under the rcu lock. Since those fields can change at any moment in
the packet path, both fields should be accessed using READ_ONCE. Updates
to both fields should use WRITE_ONCE.

Update mpls_select_multipath (packet path) and mpls_ifdown and mpls_ifup
(event handlers) accordingly.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:21:44 -07:00
Guillaume Nault 2777e2ab5a l2tp: take a reference on sessions used in genetlink handlers
Callers of l2tp_nl_session_find() need to hold a reference on the
returned session since there's no guarantee that it isn't going to
disappear from under them.

Relying on the fact that no l2tp netlink message may be processed
concurrently isn't enough: sessions can be deleted by other means
(e.g. by closing the PPPOL2TP socket of a ppp pseudowire).

l2tp_nl_cmd_session_delete() is a bit special: it runs a callback
function that may require a previous call to session->ref(). In
particular, for ppp pseudowires, the callback is l2tp_session_delete(),
which then calls pppol2tp_session_close() and dereferences the PPPOL2TP
socket. The socket might already be gone at the moment
l2tp_session_delete() calls session->ref(), so we need to take a
reference during the session lookup. So we need to pass the do_ref
variable down to l2tp_session_get() and l2tp_session_get_by_ifname().

Since all callers have to be updated, l2tp_session_find_by_ifname() and
l2tp_nl_session_find() are renamed to reflect their new behaviour.

Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:16:41 -07:00
Guillaume Nault 5e6a9e5a35 l2tp: hold session while sending creation notifications
l2tp_session_find() doesn't take any reference on the returned session.
Therefore, the session may disappear while sending the notification.

Use l2tp_session_get() instead and decrement session's refcount once
the notification is sent.

Fixes: 33f72e6f0c ("l2tp : multicast notification to the registered listeners")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:16:41 -07:00
Guillaume Nault dbdbc73b44 l2tp: fix duplicate session creation
l2tp_session_create() relies on its caller for checking for duplicate
sessions. This is racy since a session can be concurrently inserted
after the caller's verification.

Fix this by letting l2tp_session_create() verify sessions uniqueness
upon insertion. Callers need to be adapted to check for
l2tp_session_create()'s return code instead of calling
l2tp_session_find().

pppol2tp_connect() is a bit special because it has to work on existing
sessions (if they're not connected) or to create a new session if none
is found. When acting on a preexisting session, a reference must be
held or it could go away on us. So we have to use l2tp_session_get()
instead of l2tp_session_find() and drop the reference before exiting.

Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:16:41 -07:00
Guillaume Nault 57377d6354 l2tp: ensure session can't get removed during pppol2tp_session_ioctl()
Holding a reference on session is required before calling
pppol2tp_session_ioctl(). The session could get freed while processing the
ioctl otherwise. Since pppol2tp_session_ioctl() uses the session's socket,
we also need to take a reference on it in l2tp_session_get().

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:16:41 -07:00
Guillaume Nault 61b9a04772 l2tp: fix race in l2tp_recv_common()
Taking a reference on sessions in l2tp_recv_common() is racy; this
has to be done by the callers.

To this end, a new function is required (l2tp_session_get()) to
atomically lookup a session and take a reference on it. Callers then
have to manually drop this reference.

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:16:41 -07:00
Xin Long afe89962ee sctp: use right in and out stream cnt
Since sctp reconf was added in sctp, the real cnt of in/out stream
have not been c.sinit_max_instreams and c.sinit_num_ostreams any
more.

This patch is to replace them with stream->in/outcnt.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:12:30 -07:00
Tobias Regnery 768bfa2a06 net: dsa: fix build error with devlink build as module
After commit 96567d5dac ("net: dsa: dsa2: Add basic support of devlink")
I see the following link error with CONFIG_NET_DSA=y and CONFIG_NET_DEVLINK=m:

net/built-in.o: In function 'dsa_register_switch':
(.text+0xe226b): undefined reference to `devlink_alloc'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe2284): undefined reference to `devlink_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe243e): undefined reference to `devlink_port_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe24e1): undefined reference to `devlink_port_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe24fa): undefined reference to `devlink_port_type_eth_set'
net/built-in.o: In function 'dsa_dst_unapply.part.8':
dsa2.c:(.text.unlikely+0x345): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x36c): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x38e): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x3f2): undefined reference to 'devlink_unregister'
dsa2.c:(.text.unlikely+0x3fb): undefined reference to 'devlink_free'

Fix this by adding a dependency on MAY_USE_DEVLINK so that CONFIG_NET_DSA
get switched to be build as module when CONFIG_NET_DEVLINK=m.

Fixes: 96567d5dac ("net: dsa: dsa2: Add basic support of devlink")
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 20:10:20 -07:00
OGAWA Hirofumi 85a2566d70 nfc: Send same info for both of NFC_CMD_GET_DEVICE and NFC_EVENT_DEVICE_ADDED
Now, NFC_EVENT_DEVICE_ADDED doesn't send NFC_ATTR_RF_MODE. But
NFC_CMD_GET_DEVICE send.

To get NFC_ATTR_RF_MODE, we have to call NFC_CMD_GET_DEVICE just for
NFC_ATTR_RF_MODE when get NFC_EVENT_DEVICE_ADDED.

This fixes those inconsistent.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-04-01 23:04:29 +02:00
David S. Miller 612307c6be Two fixes:
* don't block netdev queues (indefinitely!) if mac80211
    manages traffic queueing itself
  * check wiphy registration before checking for ops
    on resume, to avoid crash
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAljd8ykACgkQa3t4Rpy0
 AB2T3g/+NuWjXWrIElOrbUjUbybobIbwAlQGQk73NYNORUwP1l3IljF+CmfI9jeF
 Azvn+ddm6mYoRJZPQKR9r2U5KZCg363C4BqRr72M6lymq97CBJzjJJG+6WA6nrYS
 PD1uBz1jpgoEKsbQ4skVGvLnrTLX45kj82kPNrY3JZP+i16wTcXEEmUN0gp/cEcv
 DYuacMOjqmKsNqmGn58ms3rwx9oTpsJlkv33YueD2DEDKEPkqgqDppRAJptDANAA
 YqY5ZanP46qfdKN35En3JjVaOpC7UPT+X078XGoL4u9QlNnznm6G3of65D7RGwhp
 YswYcpsBkMv60kY938LZR8yHkMX1mzZn5NK3pzuB5N0+cg0dtdTLpK5hiyhuwLg9
 EljLrCgNCQXneMCoLYBLnhh70+9OkVjMJeQxsN6OYZGCgHo6/bXeLAzxtJ9KLB8e
 mUclQi+rxMuAiatEiKyLGhbBgge7a93OgAc9l6it2ODJKFDOnE36AR2VBlKPfjgy
 aO3B9IAW/ojBh7nCFgESfRqts48oA5HANia1TRS8a0pLzPOB/CojDmXnwHKj5Baw
 32xukZPKeiFw3K/jZJUkoRX9NmgwE1fdUqMzSIPCXqNt1MyrPOoLr0Dp/karO1Qt
 TmIDq1eMWf0w2oNRHkRjDqay3Fr1dmVXjGOs2m8p/NAYPkigp/0=
 =c49j
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-03-31' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two fixes:
 * don't block netdev queues (indefinitely!) if mac80211
   manages traffic queueing itself
 * check wiphy registration before checking for ops
   on resume, to avoid crash
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 13:31:15 -07:00
Alexei Starovoitov 1cf1cae963 bpf: introduce BPF_PROG_TEST_RUN command
development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.

Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.

Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.

Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.

To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.

Joint work with Daniel Borkmann.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:45:57 -07:00
Vivien Didelot 40ef2c9339 net: dsa: add cross-chip bridging operations
Introduce crosschip_bridge_{join,leave} operations in the dsa_switch_ops
structure, which can be used by switches supporting interconnection.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:22:57 -07:00
Yi-Hung Wei 6f56f6186c openvswitch: Fix ovs_flow_key_update()
ovs_flow_key_update() is called when the flow key is invalid, and it is
used to update and revalidate the flow key. Commit 329f45bc4f
("openvswitch: add mac_proto field to the flow key") introduces mac_proto
field to flow key and use it to determine whether the flow key is valid.
However, the commit does not update the code path in ovs_flow_key_update()
to revalidate the flow key which may cause BUG_ON() on execute_recirc().
This patch addresses the aforementioned issue.

Fixes: 329f45bc4f ("openvswitch: add mac_proto field to the flow key")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:16:46 -07:00
Linus Torvalds 09c8b3d1d6 The restriction of NFSv4 to TCP went overboard and also broke the
backchannel; fix.  Also some minor refinements to the nfsd
 version-setting interface that we'd like to get fixed before release.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJY3weFAAoJECebzXlCjuG+fQYQAKh7QVB/vDZwvjWZ7OGSBhoe
 LJfS1mzzf52MfQk6qNEM9to86qu+ywmH4XD2rvgMdif6Kc0qF1xBzzVEnp527lVW
 nt1nLq/oIXeRl7a+5p+FxTnqHn0OBH+iCnxcVZI8tvHgptpR+6TwEzR36k4n4esN
 kvNrrv9dFIoBGx0nSBScHTC3zNArU+C9oZTTHAuPJICNBHsOMqYF7sSAXPg9NFR+
 HnowZfGWWXpSnWZ9DaM14Zb/dX0B9Pexv1MsgfiKYS9Beh7g4JN48+CHtWyVliwd
 M/LVBpT2ZwFM/NzvJ2exVIqm/hwj4El2Sjy67XHwQzDvGjnUn/fz55SfXfzZSMyD
 PMj+IeHKuT3jipNui1AXAlzYEz8gPvuKQ0vQ0vVuNX4Ln28KydGwvVTkUNltDnfq
 E7L7RI6mj03OY1j1p6zeK6UJeueZq1gmjfq1NVTPO+TgQFPhh8A50NsDEVcaiwNN
 W8uX7Qa39y79BT+4OFuYL05AbuqxKR+nAmCbVSLy9Kq4sc/6/YErBZtXXAghzPPl
 4Es4tzlAd8skkxWlcVeCeUqPfcDCqHL6xKqa62m5wUuYXmqPtIMyAyVutF4lWpH/
 dAV5Lcjz7HeQnpCgFZXXtIQW5OIIfFa08s0f7fuFm+uxz+nM/x8gg99tuwWP6OsT
 Za7EtdB84M1ZGgjO3JhY
 =oi0+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "The restriction of NFSv4 to TCP went overboard and also broke the
  backchannel; fix.

  Also some minor refinements to the nfsd version-setting interface that
  we'd like to get fixed before release"

* tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: set XPT_CONG_CTRL flag for bc xprt
  NFSD: fix nfsd_reset_versions for NFSv4.
  NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
  NFSD: further refinement of content of /proc/fs/nfsd/versions
  nfsd: map the ENOKEY to nfserr_perm for avoiding warning
  SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
2017-04-01 10:43:37 -07:00
Vidyullatha Kanchanapally a3caf7440d cfg80211: Add support for FILS shared key authentication offload
Enhance nl80211 and cfg80211 connect request and response APIs to
support FILS shared key authentication offload. The new nl80211
attributes can be used to provide additional information to the driver
to establish a FILS connection. Also enhance the set/del PMKSA to allow
support for adding and deleting PMKSA based on FILS cache identifier.

Add a new feature flag that drivers can use to advertize support for
FILS shared key authentication and association in station mode when
using their own SME.

Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-31 08:32:23 +02:00
Vidyullatha Kanchanapally 5349a0f7bf cfg80211: Use a structure to pass connect response params
Currently the connect event from driver takes all the connection
response parameters as arguments. With support for new features these
response parameters can grow. Use a structure to pass these parameters
rather than passing them as function arguments.

Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[add to documentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-31 08:31:26 +02:00
Paolo Abeni 6c7c98bad4 sock: avoid dirtying sk_stamp, if possible
sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for
every packet, even if the SOCK_TIMESTAMP flag is not set in the
related socket.
If selinux is enabled, this cause a cache miss for every packet
since sk->sk_stamp and sk->sk_security share the same cacheline.
With this change sk_stamp is set only if the SOCK_TIMESTAMP
flag is set, and is cleared for the first packet, so that the user
perceived behavior is unchanged.

This gives up to 5% speed-up under udp-flood with small packets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 20:05:24 -07:00
Gao Feng 1935299d9c net: tcp: Refine the __tcp_select_window
1. Move the "window = tp->rcv_wnd;" into the condition block without
tp->rx_opt.rcv_wscale.
Because it is unnecessary when enable wscale;

2. Use the macro ALIGN instead of two statements.
The two statements are used to make window align to 1<<wscale.
Use the ALIGN is more clearer.

3. Use the rounddown to make codes clearer.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 15:41:32 -07:00
Xin Long 3dbcc105d5 sctp: alloc stream info when initializing asoc
When sending a msg without asoc established, sctp will send INIT packet
first and then enqueue chunks.

Before receiving INIT_ACK, stream info is not yet alloced. But enqueuing
chunks needs to access stream info, like out stream state and out stream
cnt.

This patch is to fix it by allocing out stream info when initializing an
asoc, allocing in stream and re-allocing out stream when processing init.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:08:47 -07:00
Colin Ian King ed8bfd5c1c VSOCK: remove unnecessary ternary operator on return value
Rather than assign the positive errno values to ret and then
checking if it is positive and flip the sign, just return the
errno value.

Detected by CoverityScan, CID#986649 ("Logically Dead Code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:07:08 -07:00
Florian Westphal 282ccf6efb drivers: add explicit interrupt.h includes
These files all use functions declared in interrupt.h, but currently rely
on implicit inclusion of this file (via netns/xfrm.h).

That won't work anymore when the flow cache is removed so include that
header where needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:05:34 -07:00
Andrey Konovalov bcc5364bdc net/packet: fix overflow in check for tp_reserve
When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.

Fix by checking that tp_reserve <= INT_MAX on assign.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:04:00 -07:00
Andrey Konovalov 8f8d28e4d6 net/packet: fix overflow in check for tp_frame_nr
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.

Add a check that tp_block_size * tp_block_nr <= UINT_MAX.

Since frames_per_block <= tp_block_size, the expression would
never overflow.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:03:59 -07:00
Andrey Konovalov 2b6867c2ce net/packet: fix overflow in check for priv area size
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).

Compare them as is instead.

Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:03:59 -07:00
Arushi Singhal e241137699 ipvs: remove unused variable
This patch uses the following coccinelle script to remove
a variable that was simply used to store the return
value of a function call before returning it:

@@
identifier len,f;
@@

-int len;
 ... when != len
     when strict
-len =
+return
        f(...);
-return len;

Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-30 14:54:47 +02:00
Varsha Rao 848850a3e9 netfilter: ipvs: Replace kzalloc with kcalloc.
Replace kzalloc with kcalloc. As kcalloc is preferred for allocating an
array instead of kzalloc. This patch fixes the checkpatch issue.

Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
2017-03-30 14:44:14 +02:00
Florian Westphal 41390895e5 netfilter: ipvs: don't check for presence of nat extension
Check for the NAT status bits, they are set once conntrack needs NAT in source or
reply direction, this is slightly faster than nfct_nat() as that has to check the
extension area.

Signed-off-by: Florian Westphal <fw@strlen.de>
2017-03-30 14:41:42 +02:00
David S. Miller 8f1f7eeb22 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a rather large update with Netfilter
fixes, specifically targeted to incorrect RCU usage in several spots and
the userspace conntrack helper infrastructure (nfnetlink_cthelper),
more specifically they are:

1) expect_class_max is incorrect set via cthelper, as in kernel semantics
   mandate that this represents the array of expectation classes minus 1.
   Patch from Liping Zhang.

2) Expectation policy updates via cthelper are currently broken for several
   reasons: This code allows illegal changes in the policy such as changing
   the number of expeciation classes, it is leaking the updated policy and
   such update occurs with no RCU protection at all. Fix this by adding a
   new nfnl_cthelper_update_policy() that describes what is really legal on
   the update path.

3) Fix several memory leaks in cthelper, from Jeffy Chen.

4) synchronize_rcu() is missing in the removal path of several modules,
   this may lead to races since CPU may still be running on code that has
   just gone. Also from Liping Zhang.

5) Don't use the helper hashtable from cthelper, it is not safe to walk
   over those bits without the helper mutex. Fix this by introducing a
   new independent list for userspace helpers. From Liping Zhang.

6) nf_ct_extend_unregister() needs synchronize_rcu() to make sure no
   packets are walking on any conntrack extension that is gone after
   module removal, again from Liping.

7) nf_nat_snmp may crash if we fail to unregister the helper due to
   accidental leftover code, from Gao Feng.

8) Fix leak in nfnetlink_queue with secctx support, from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:35:25 -07:00
Erik Hugne 66bc1e8d5d tipc: allow rdm/dgram socketpairs
for socketpairs using connectionless transport, we cache
the respective node local TIPC portid to use in subsequent
calls to send() in the socket's private data.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:10:11 -07:00
Erik Hugne 70b03759e9 tipc: add support for stream/seqpacket socketpairs
sockets A and B are connected back-to-back, similar to what
AF_UNIX does.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:10:11 -07:00
Linus Torvalds 52b9c81680 Merge branch 'apw' (xfrm_user fixes)
Merge xfrm_user validation fixes from Andy Whitcroft:
 "Two patches we are applying to Ubuntu for XFRM_MSG_NEWAE validation
  issue reported by ZDI.

  The first of these is the primary fix, and the second is for a more
  theoretical issue that Kees pointed out when reviewing the first"

* emailed patches from Andy Whitcroft <apw@canonical.com>:
  xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder
  xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
2017-03-29 13:26:22 -07:00
David Ahern e944e97afc net: mpls: Update lfib_nlmsg_size to skip deleted nexthops
A recent commit skips nexthops in a route if the device has been
deleted. Update lfib_nlmsg_size accordingly.

Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 10:42:38 -07:00
Guillaume Nault e91793bb61 l2tp: purge socket queues in the .destruct() callback
The Rx path may grab the socket right before pppol2tp_release(), but
nothing guarantees that it will enqueue packets before
skb_queue_purge(). Therefore, the socket can be destroyed without its
queues fully purged.

Fix this by purging queues in pppol2tp_session_destruct() where we're
guaranteed nothing is still referencing the socket.

Fixes: 9e9cb6221a ("l2tp: fix userspace reception on plain L2TP sockets")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 09:26:28 -07:00
Guillaume Nault 94d7ee0baa l2tp: hold tunnel socket when handling control frames in l2tp_ip and l2tp_ip6
The code following l2tp_tunnel_find() expects that a new reference is
held on sk. Either sk_receive_skb() or the discard_put error path will
drop a reference from the tunnel's socket.

This issue exists in both l2tp_ip and l2tp_ip6.

Fixes: a3c18422a4 ("l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 09:26:28 -07:00
Andy Whitcroft f843ee6dd0 xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues.  To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.

CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-29 08:40:15 -07:00
Andy Whitcroft 677e806da4 xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer.  However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call.  There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents.  We do
not at this point check that the replay_window is within the allocated
memory.  This leads to out-of-bounds reads and writes triggered by
netlink packets.  This leads to memory corruption and the potential for
priviledge escalation.

We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len().  This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn.  It however does not check the replay_window
remains within that buffer.  Add validation of the contained
replay_window.

CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-29 08:40:06 -07:00
Johannes Berg 7d65f82954 mac80211: unconditionally start new netdev queues with iTXQ support
When internal mac80211 TXQs aren't supported, netdev queues must
always started out started even when driver queues are stopped
while the interface is added. This is necessary because with the
internal TXQ support netdev queues are never stopped and packet
scheduling/dropping is done in mac80211.

Cc: stable@vger.kernel.org # 4.9+
Fixes: 80a83cfc43 ("mac80211: skip netdev queue control with software queuing")
Reported-and-tested-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-29 14:20:40 +02:00
Liping Zhang 77c1c03c5b netfilter: nfnetlink_queue: fix secctx memory leak
We must call security_release_secctx to free the memory returned by
security_secid_to_secctx, otherwise memory may be leaked forever.

Fixes: ef493bd930 ("netfilter: nfnetlink_queue: add security context information")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-29 12:20:50 +02:00
Masashi Honma ed92a9b5d4 mac80211: mesh: drop new node with weak power
On some practical cases, it is useful to drop new node in the distance.
Because mesh metric is calculated with hop count and without RSSI
information, a node far from local peer and near to destination node
could be used as best path.

For example, the nodes are located in linear. Distance of 0 - 1 and
1 - 2 and 2 - 3 is 20meters. 0 to 3 signal is very weak.

    0 --- 1 --- 2 --- 3

Though most robust path from 0 to 3 is 0 -> 1 -> 2 -> 3,
unfortunately, node 0 could recognize node 3 as neighbor. Then node 3
could be next of node 0. This patch aims to avoid such a case.

[Johannes:]
Dropping the node entirely isn't ideal, but at least with encryption
there will be a limit on # of keys the hardware can deal with, and
there might also be a limit on the number of stations it supports.

Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-29 10:29:09 +02:00
Arend Van Spriel b3ef5520c1 cfg80211: check rdev resume callback only for registered wiphy
We got the following use-after-free KASAN report:

 BUG: KASAN: use-after-free in wiphy_resume+0x591/0x5a0 [cfg80211]
	 at addr ffff8803fc244090
 Read of size 8 by task kworker/u16:24/2587
 CPU: 6 PID: 2587 Comm: kworker/u16:24 Tainted: G    B 4.9.13-debug+
 Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 1.2.19 12/22/2016
 Workqueue: events_unbound async_run_entry_fn
  ffff880425d4f9d8 ffffffffaeedb541 ffff88042b80ef00 ffff8803fc244088
  ffff880425d4fa00 ffffffffae84d7a1 ffff880425d4fa98 ffff8803fc244080
  ffff88042b80ef00 ffff880425d4fa88 ffffffffae84da3a ffffffffc141f7d9
 Call Trace:
  [<ffffffffaeedb541>] dump_stack+0x85/0xc4
  [<ffffffffae84d7a1>] kasan_object_err+0x21/0x70
  [<ffffffffae84da3a>] kasan_report_error+0x1fa/0x500
  [<ffffffffc141f7d9>] ? cfg80211_bss_age+0x39/0xc0 [cfg80211]
  [<ffffffffc141f83a>] ? cfg80211_bss_age+0x9a/0xc0 [cfg80211]
  [<ffffffffae48d46d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffffc13fb1c0>] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
  [<ffffffffae84def1>] __asan_report_load8_noabort+0x61/0x70
  [<ffffffffc13fb100>] ? wiphy_suspend+0xbb0/0xc70 [cfg80211]
  [<ffffffffc13fb751>] ? wiphy_resume+0x591/0x5a0 [cfg80211]
  [<ffffffffc13fb751>] wiphy_resume+0x591/0x5a0 [cfg80211]
  [<ffffffffc13fb1c0>] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
  [<ffffffffaf3b206e>] dpm_run_callback+0x6e/0x4f0
  [<ffffffffaf3b31b2>] device_resume+0x1c2/0x670
  [<ffffffffaf3b367d>] async_resume+0x1d/0x50
  [<ffffffffae3ee84e>] async_run_entry_fn+0xfe/0x610
  [<ffffffffae3d0666>] process_one_work+0x716/0x1a50
  [<ffffffffae3d05c9>] ? process_one_work+0x679/0x1a50
  [<ffffffffafdd7b6d>] ? _raw_spin_unlock_irq+0x3d/0x60
  [<ffffffffae3cff50>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
  [<ffffffffae3d1a80>] worker_thread+0xe0/0x1460
  [<ffffffffae3d19a0>] ? process_one_work+0x1a50/0x1a50
  [<ffffffffae3e54c2>] kthread+0x222/0x2e0
  [<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
  [<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
  [<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
  [<ffffffffafdd86aa>] ret_from_fork+0x2a/0x40
 Object at ffff8803fc244088, in cache kmalloc-1024 size: 1024
 Allocated:
 PID = 71
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_kmalloc+0xad/0xe0
  kasan_slab_alloc+0x12/0x20
  __kmalloc_track_caller+0x134/0x360
  kmemdup+0x20/0x50
  brcmf_cfg80211_attach+0x10b/0x3a90 [brcmfmac]
  brcmf_bus_start+0x19a/0x9a0 [brcmfmac]
  brcmf_pcie_setup+0x1f1a/0x3680 [brcmfmac]
  brcmf_fw_request_nvram_done+0x44c/0x11b0 [brcmfmac]
  request_firmware_work_func+0x135/0x280
  process_one_work+0x716/0x1a50
  worker_thread+0xe0/0x1460
  kthread+0x222/0x2e0
  ret_from_fork+0x2a/0x40
 Freed:
 PID = 2568
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_slab_free+0x71/0xb0
  kfree+0xe8/0x2e0
  brcmf_cfg80211_detach+0x62/0xf0 [brcmfmac]
  brcmf_detach+0x14a/0x2b0 [brcmfmac]
  brcmf_pcie_remove+0x140/0x5d0 [brcmfmac]
  brcmf_pcie_pm_leave_D3+0x198/0x2e0 [brcmfmac]
  pci_pm_resume+0x186/0x220
  dpm_run_callback+0x6e/0x4f0
  device_resume+0x1c2/0x670
  async_resume+0x1d/0x50
  async_run_entry_fn+0xfe/0x610
  process_one_work+0x716/0x1a50
  worker_thread+0xe0/0x1460
  kthread+0x222/0x2e0
  ret_from_fork+0x2a/0x40
 Memory state around the buggy address:
  ffff8803fc243f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8803fc244000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff8803fc244080: fc fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                          ^
  ffff8803fc244100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8803fc244180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

What is happening is that brcmf_pcie_resume() detects a device that
is no longer responsive and it decides to unbind resulting in a
wiphy_unregister() and wiphy_free() call. Now the wiphy instance
remains allocated, because PM needs to call wiphy_resume() for it.
However, brcmfmac already does a kfree() for the struct
cfg80211_registered_device::ops field. Change the checks in
wiphy_resume() to only access the struct cfg80211_registered_device::ops
if the wiphy instance is still registered at this time.

Cc: stable@vger.kernel.org # 4.10.x, 4.9.x
Reported-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-29 09:11:29 +02:00
Andrew Lunn 96567d5dac net: dsa: dsa2: Add basic support of devlink
Register the switch and its ports with devlink.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:46:04 -07:00
Andrew Lunn c6e970a04b net: break include loop netdevice.h, dsa.h, devlink.h
There is an include loop between netdevice.h, dsa.h, devlink.h because
of NETDEV_ALIGN, making it impossible to use devlink structures in
dsa.h.

Break this loop by taking dsa.h out of netdevice.h, add a forward
declaration of dsa_switch_tree and netdev_set_default_ethtool_ops()
function, which is what netdevice.h requires.

No longer having dsa.h in netdevice.h means the includes in dsa.h no
longer get included. This breaks a few other files which depend on
these includes. Add these directly in the affected file.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:46:04 -07:00
David Ahern 1182e4d0b4 net: mpls: Send netconf messages on device register and unregister
Send netconf notifications for MPLS when the device registers and
unregisters.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:32:43 -07:00
David Ahern 823566aee0 net:mpls: Refactor mpls_netconf_notify_devconf to take event
Refactor mpls_netconf_notify_devconf to take the event as an input arg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:32:43 -07:00
David Ahern 2345217026 net: ipv6: Add support for RTM_DELNETCONF
Send RTM_DELNETCONF notifications when a device is deleted. The message only
needs the device index, so modify inet6_netconf_fill_devconf to skip devconf
references if it is NULL.

Allows a userspace cache to remove entries as devices are deleted.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:32:42 -07:00
David Ahern 85b3daada4 net: ipv6: Refactor inet6_netconf_notify_devconf to take event
Refactor inet6_netconf_notify_devconf to take the event as an input arg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:32:42 -07:00
David Ahern b5c9641d3d net: devinet: Add support for RTM_DELNETCONF
Send RTM_DELNETCONF notifications when a device is deleted. The message only
needs the device index, so modify inet_netconf_fill_devconf to skip devconf
references if it is NULL.

Allows a userspace cache to remove entries as devices are deleted.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:32:42 -07:00
David Ahern 3b0228656d net: devinet: Refactor inet_netconf_notify_devconf to take event
Refactor inet_netconf_notify_devconf to take the event as an input arg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:32:42 -07:00
Vivien Didelot 4333d619f9 net: dsa: fix copyright holder
I do not hold the copyright of the DSA core and drivers source files,
since these changes have been written as an initiative of my day job.
Fix this.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:04:51 -07:00
Vlad Yasevich 382ed72480 ipv6: add support for NETDEV_RESEND_IGMP event
This patch adds support for NETDEV_RESEND_IGMP event similar
to how it works for IPv4.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 22:02:21 -07:00
Jarno Rajahalme b768b16de5 openvswitch: Fix refcount leak on force commit.
The reference count held for skb needs to be released when the skb's
nfct pointer is cleared regardless of if nf_ct_delete() is called or
not.

Failing to release the skb's reference cound led to deferred conntrack
cleanup spinning forever within nf_conntrack_cleanup_net_list() when
cleaning up a network namespace:

   kworker/u16:0-19025 [004] 45981067.173642: sched_switch: kworker/u16:0:19025 [120] R ==> rcu_preempt:7 [120]
   kworker/u16:0-19025 [004] 45981067.173651: kernel_stack: <stack trace>
=> ___preempt_schedule (ffffffffa001ed36)
=> _raw_spin_unlock_bh (ffffffffa0713290)
=> nf_ct_iterate_cleanup (ffffffffc00a4454)
=> nf_conntrack_cleanup_net_list (ffffffffc00a5e1e)
=> nf_conntrack_pernet_exit (ffffffffc00a63dd)
=> ops_exit_list.isra.1 (ffffffffa06075f3)
=> cleanup_net (ffffffffa0607df0)
=> process_one_work (ffffffffa0084c31)
=> worker_thread (ffffffffa008592b)
=> kthread (ffffffffa008bee2)
=> ret_from_fork (ffffffffa071b67c)

Fixes: dd41d33f0b ("openvswitch: Add force commit.")
Reported-by: Yang Song <yangsong@vmware.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 21:56:11 -07:00
Chuck Lever 23abec20aa svcrdma: set XPT_CONG_CTRL flag for bc xprt
Same change as Kinglong Mee's fix for the TCP backchannel service.

Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-28 21:25:55 -04:00
Ying Xue 7efea60dcf tipc: adjust the policy of holding subscription kref
When a new subscription object is inserted into name_seq->subscriptions
list, it's under name_seq->lock protection; when a subscription is
deleted from the list, it's also under the same lock protection;
similarly, when accessing a subscription by going through subscriptions
list, the entire process is also protected by the name_seq->lock.

Therefore, if subscription refcount is increased before it's inserted
into subscriptions list, and its refcount is decreased after it's
deleted from the list, it will be unnecessary to hold refcount at all
before accessing subscription object which is obtained by going through
subscriptions list under name_seq->lock protection.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 18:03:33 -07:00
Ying Xue 139bb36f75 tipc: advance the time of deleting subscription from subscriber->subscrp_list
After a subscription object is created, it's inserted into its
subscriber subscrp_list list under subscriber lock protection,
similarly, before it's destroyed, it should be first removed from
its subscriber->subscrp_list. Since the subscription list is
accessed with subscriber lock, all the subscriptions are valid
during the lock duration. Hence in tipc_subscrb_subscrp_delete(), we
remove subscription get/put and the extra subscriber unlock/lock.

After this change, the subscriptions refcount cleanup is very simple
and does not access any lock.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 18:03:32 -07:00
Bjorn Andersson 5052de8def soc: qcom: smd: Transition client drivers from smd to rpmsg
By moving these client drivers to use RPMSG instead of the direct SMD
API we can reuse them ontop of the newly added GLINK wire-protocol
support found in the 820 and 835 Qualcomm platforms.

As the new (RPMSG-based) and old SMD implementations are mutually
exclusive we have to change all client drivers in one commit, to make
sure we have a working system before and after this transition.

Acked-by: Andy Gross <andy.gross@linaro.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 17:58:07 -07:00
Xin Long f9ba3501d5 sctp: change to save MSG_MORE flag into assoc
David Laight noticed the support for MSG_MORE with datamsg->force_delay
didn't really work as we expected, as the first msg with MSG_MORE set
would always block the following chunks' dequeuing.

This Patch is to rewrite it by saving the MSG_MORE flag into assoc as
David Laight suggested.

asoc->force_delay is used to save MSG_MORE flag before a msg is sent.
All chunks in queue would not be sent out if asoc->force_delay is set
by the msg with MSG_MORE flag, until a new msg without MSG_MORE flag
clears asoc->force_delay.

Note that this change would not affect the flush is generated by other
triggers, like asoc->state != ESTABLISHED, queue size > pmtu etc.

v1->v2:
  Not clear asoc->force_delay after sending the msg with MSG_MORE flag.

Fixes: 4ea0c32f5f ("sctp: add support for MSG_MORE")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Laight <david.laight@aculab.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 17:56:15 -07:00
Arkadi Sharshevsky 1555d204e7 devlink: Support for pipeline debug (dpipe)
The pipeline debug is used to export the pipeline abstractions for the
main objects - tables, headers and entries. The only support for set is
for changing the counter parameter on specific table.

The basic structures:

Header - can represent a real protocol header information or internal
         metadata. Generic protocol headers like IPv4 can be shared
         between drivers. Each driver can add local headers.

Field - part of a header. Can represent protocol field or specific ASIC
        metadata field. Hardware special metadata fields can be mapped
        to different resources, for example switch ASIC ports can have
        internal number which from the systems point of view is mapped
        to netdeivce ifindex.

Match - represent specific match rule. Can describe match on specific
        field or header. The header index should be specified as well
        in order to support several header instances of the same type
        (tunneling).

Action - represents specific action rule. Actions can describe operations
         on specific field values for example like set, increment, etc.
         And header operation like add and delete.

Value - represents value which can be associated with specific match or
        action.

Table - represents a hardware block which can be described with match/
        action behavior. The match/action can be done on the packets
        data or on the internal metadata that it gathered along the
        packets traversal throw the pipeline which is vendor specific
        and should be exported in order to provide understanding of
        ASICs behavior.

Entry - represents single record in a specific table. The entry is
        identified by specific combination of values for match/action.

Prior to accessing the tables/entries the drivers provide the header/
field data base which is used by driver to user-space. The data base
is split between the shared headers and unique headers.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 17:11:54 -07:00
Mark Rutland ffefb6f4d6 net: ipconfig: fix ic_close_devs() use-after-free
Our chosen ic_dev may be anywhere in our list of ic_devs, and we may
free it before attempting to close others. When we compare d->dev and
ic_dev->dev, we're potentially dereferencing memory returned to the
allocator. This causes KASAN to scream for each subsequent ic_dev we
check.

As there's a 1-1 mapping between ic_devs and netdevs, we can instead
compare d and ic_dev directly, which implicitly handles the !ic_dev
case, and avoids the use-after-free. The ic_dev pointer may be stale,
but we will not dereference it.

Original splat:

[    6.487446] ==================================================================
[    6.494693] BUG: KASAN: use-after-free in ic_close_devs+0xc4/0x154 at addr ffff800367efa708
[    6.503013] Read of size 8 by task swapper/0/1
[    6.507452] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc3-00002-gda42158 #8
[    6.514993] Hardware name: AppliedMicro Mustang/Mustang, BIOS 3.05.05-beta_rc Jan 27 2016
[    6.523138] Call trace:
[    6.525590] [<ffff200008094778>] dump_backtrace+0x0/0x570
[    6.530976] [<ffff200008094d08>] show_stack+0x20/0x30
[    6.536017] [<ffff200008bee928>] dump_stack+0x120/0x188
[    6.541231] [<ffff20000856d5e4>] kasan_object_err+0x24/0xa0
[    6.546790] [<ffff20000856d924>] kasan_report_error+0x244/0x738
[    6.552695] [<ffff20000856dfec>] __asan_report_load8_noabort+0x54/0x80
[    6.559204] [<ffff20000aae86ac>] ic_close_devs+0xc4/0x154
[    6.564590] [<ffff20000aaedbac>] ip_auto_config+0x2ed4/0x2f1c
[    6.570321] [<ffff200008084b04>] do_one_initcall+0xcc/0x370
[    6.575882] [<ffff20000aa31de8>] kernel_init_freeable+0x5f8/0x6c4
[    6.581959] [<ffff20000a16df00>] kernel_init+0x18/0x190
[    6.587171] [<ffff200008084710>] ret_from_fork+0x10/0x40
[    6.592468] Object at ffff800367efa700, in cache kmalloc-128 size: 128
[    6.598969] Allocated:
[    6.601324] PID = 1
[    6.603427]  save_stack_trace_tsk+0x0/0x418
[    6.607603]  save_stack_trace+0x20/0x30
[    6.611430]  kasan_kmalloc+0xd8/0x188
[    6.615087]  ip_auto_config+0x8c4/0x2f1c
[    6.619002]  do_one_initcall+0xcc/0x370
[    6.622832]  kernel_init_freeable+0x5f8/0x6c4
[    6.627178]  kernel_init+0x18/0x190
[    6.630660]  ret_from_fork+0x10/0x40
[    6.634223] Freed:
[    6.636233] PID = 1
[    6.638334]  save_stack_trace_tsk+0x0/0x418
[    6.642510]  save_stack_trace+0x20/0x30
[    6.646337]  kasan_slab_free+0x88/0x178
[    6.650167]  kfree+0xb8/0x478
[    6.653131]  ic_close_devs+0x130/0x154
[    6.656875]  ip_auto_config+0x2ed4/0x2f1c
[    6.660875]  do_one_initcall+0xcc/0x370
[    6.664705]  kernel_init_freeable+0x5f8/0x6c4
[    6.669051]  kernel_init+0x18/0x190
[    6.672534]  ret_from_fork+0x10/0x40
[    6.676098] Memory state around the buggy address:
[    6.680880]  ffff800367efa600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    6.688078]  ffff800367efa680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    6.695276] >ffff800367efa700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    6.702469]                       ^
[    6.705952]  ffff800367efa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    6.713149]  ffff800367efa800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    6.720343] ==================================================================
[    6.727536] Disabling lock debugging due to kernel taint

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 21:06:53 -07:00
David Lebrun 402a5bc462 ipv6: sr: select DST_CACHE by default
When CONFIG_IPV6_SEG6_LWTUNNEL is selected, automatically select DST_CACHE.
This allows to remove multiple ifdefs.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 16:05:06 -07:00
David Ahern 4ea8efadf5 net: mpls: Delete route when all nexthops have been deleted
When all devices for all nexthops in a route have been deleted, the
route is effectively dead, so remove it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 14:09:50 -07:00
David Ahern c00e51ddbf net: mpls: Don't show nexthop if device has been deleted
If the device for a nexthop in a multipath route is deleted, the nexthop
is effectively removed from the route. Currently, a route dump still
returns the nexhop though without the device set:

$ ip -f mpls ro ls
100
	nexthopvia inet 10.11.1.2  dev br0
	nexthopvia inet 10.100.3.1  dev eth3
$ ip li del br0
$ ip -f mpls ro ls
100
	nexthopvia inet 10.11.1.2  dev * dead linkdown
	nexthopvia inet 10.100.3.1  dev eth3

Since the nexthop is effectively deleted, drop the hop from the route
dump.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 14:09:50 -07:00
Gao Feng 75c689dca9 netfilter: nf_nat_snmp: Fix panic when snmp_trap_helper fails to register
In the commit 93557f53e1 ("netfilter: nf_conntrack: nf_conntrack snmp
helper"), the snmp_helper is replaced by nf_nat_snmp_hook. So the
snmp_helper is never registered. But it still tries to unregister the
snmp_helper, it could cause the panic.

Now remove the useless snmp_helper and the unregister call in the
error handler.

Fixes: 93557f53e1 ("netfilter: nf_conntrack: nf_conntrack snmp helper")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-27 13:49:13 +02:00
Liping Zhang 9c3f379492 netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister
If one cpu is doing nf_ct_extend_unregister while another cpu is doing
__nf_ct_ext_add_length, then we may hit BUG_ON(t == NULL). Moreover,
there's no synchronize_rcu invocation after set nf_ct_ext_types[id] to
NULL, so it's possible that we may access invalid pointer.

But actually, most of the ct extends are built-in, so the problem listed
above will not happen. However, there are two exceptions: NF_CT_EXT_NAT
and NF_CT_EXT_SYNPROXY.

For _EXT_NAT, the panic will not happen, since adding the nat extend and
unregistering the nat extend are located in the same file(nf_nat_core.c),
this means that after the nat module is removed, we cannot add the nat
extend too.

For _EXT_SYNPROXY, synproxy extend may be added by init_conntrack, while
synproxy extend unregister will be done by synproxy_core_exit. So after
nf_synproxy_core.ko is removed, we may still try to add the synproxy
extend, then kernel panic may happen.

I know it's very hard to reproduce this issue, but I can play a tricky
game to make it happen very easily :)

Step 1. Enable SYNPROXY for tcp dport 1234 at FORWARD hook:
  # iptables -I FORWARD -p tcp --dport 1234 -j SYNPROXY
Step 2. Queue the syn packet to the userspace at raw table OUTPUT hook.
        Also note, in the userspace we only add a 20s' delay, then
        reinject the syn packet to the kernel:
  # iptables -t raw -I OUTPUT -p tcp --syn -j NFQUEUE --queue-num 1
Step 3. Using "nc 2.2.2.2 1234" to connect the server.
Step 4. Now remove the nf_synproxy_core.ko quickly:
  # iptables -F FORWARD
  # rmmod ipt_SYNPROXY
  # rmmod nf_synproxy_core
Step 5. After 20s' delay, the syn packet is reinjected to the kernel.

Now you will see the panic like this:
  kernel BUG at net/netfilter/nf_conntrack_extend.c:91!
  Call Trace:
   ? __nf_ct_ext_add_length+0x53/0x3c0 [nf_conntrack]
   init_conntrack+0x12b/0x600 [nf_conntrack]
   nf_conntrack_in+0x4cc/0x580 [nf_conntrack]
   ipv4_conntrack_local+0x48/0x50 [nf_conntrack_ipv4]
   nf_reinject+0x104/0x270
   nfqnl_recv_verdict+0x3e1/0x5f9 [nfnetlink_queue]
   ? nfqnl_recv_verdict+0x5/0x5f9 [nfnetlink_queue]
   ? nla_parse+0xa0/0x100
   nfnetlink_rcv_msg+0x175/0x6a9 [nfnetlink]
   [...]

One possible solution is to make NF_CT_EXT_SYNPROXY extend built-in, i.e.
introduce nf_conntrack_synproxy.c and only do ct extend register and
unregister in it, similar to nf_conntrack_timeout.c.

But having such a obscure restriction of nf_ct_extend_unregister is not a
good idea, so we should invoke synchronize_rcu after set nf_ct_ext_types
to NULL, and check the NULL pointer when do __nf_ct_ext_add_length. Then
it will be easier if we add new ct extend in the future.

Last, we use kfree_rcu to free nf_ct_ext, so rcu_barrier() is unnecessary
anymore, remove it too.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-27 13:47:29 +02:00
Liping Zhang 83d90219a5 netfilter: nfnl_cthelper: fix a race when walk the nf_ct_helper_hash table
The nf_ct_helper_hash table is protected by nf_ct_helper_mutex, while
nfct_helper operation is protected by nfnl_lock(NFNL_SUBSYS_CTHELPER).
So it's possible that one CPU is walking the nf_ct_helper_hash for
cthelper add/get/del, another cpu is doing nf_conntrack_helpers_unregister
at the same time. This is dangrous, and may cause use after free error.

Note, delete operation will flush all cthelpers added via nfnetlink, so
using rcu to do protect is not easy.

Now introduce a dummy list to record all the cthelpers added via
nfnetlink, then we can walk the dummy list instead of walking the
nf_ct_helper_hash. Also, keep nfnl_cthelper_dump_table unchanged, it
may be invoked without nfnl_lock(NFNL_SUBSYS_CTHELPER) held.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-27 13:47:29 +02:00
Liping Zhang 3b7dabf029 netfilter: invoke synchronize_rcu after set the _hook_ to NULL
Otherwise, another CPU may access the invalid pointer. For example:
    CPU0                CPU1
     -              rcu_read_lock();
     -              pfunc = _hook_;
  _hook_ = NULL;          -
  mod unload              -
     -                 pfunc(); // invalid, panic
     -             rcu_read_unlock();

So we must call synchronize_rcu() to wait the rcu reader to finish.

Also note, in nf_nat_snmp_basic_fini, synchronize_rcu() will be invoked
by later nf_conntrack_helper_unregister, but I'm inclined to add a
explicit synchronize_rcu after set the nf_nat_snmp_hook to NULL. Depend
on such obscure assumptions is not a good idea.

Last, in nfnetlink_cttimeout, we use kfree_rcu to free the time object,
so in cttimeout_exit, invoking rcu_barrier() is not necessary at all,
remove it too.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-27 13:47:28 +02:00
Linus Lüssing e2d9ba4355 batman-adv: restructure rebroadcast counter into forw_packet API
This patch refactors the num_packets counter of a forw_packet in the
following three ways:

1) Removed dual-use of forw_packet::num_packets:
   -> now for aggregation purposes only
2) Using forw_packet::skb::cb::num_bcasts instead:
   -> for easier access in aggregation code later
3) make access to num_bcasts private to batadv_forw_packet_*()

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
[sven@narfation.org: Change num_bcasts to unsigned]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-26 12:46:44 +02:00
Linus Lüssing 99ba18ef02 batman-adv: privatize forw_packet skb assignment
An skb is assigned to a forw_packet only once, shortly after the
forw_packet allocation.

With this patch the assignment is moved into the this allocation
function.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-26 12:46:44 +02:00
Eric Dumazet 43a6684519 ping: implement proper locking
We got a report of yet another bug in ping

http://www.openwall.com/lists/oss-security/2017/03/24/6

->disconnect() is not called with socket lock held.

Fix this by acquiring ping rwlock earlier.

Thanks to Daniel, Alexander and Andrey for letting us know this problem.

Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Jiang <danieljiang0415@gmail.com>
Reported-by: Solar Designer <solar@openwall.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:50:28 -07:00
Sridhar Samudrala 6d4339028b net: Introduce SO_INCOMING_NAPI_ID
This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.

If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:31 -07:00
Sridhar Samudrala 7db6b048da net: Commonize busy polling code to focus on napi_id instead of socket
Move the core functionality in sk_busy_loop() to napi_busy_loop() and
make it independent of sk.

This enables re-using this function in epoll busy loop implementation.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:31 -07:00
Alexander Duyck 37056719bb net: Track start of busy loop instead of when it should end
This patch flips the logic we were using to determine if the busy polling
has timed out.  The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:31 -07:00
Alexander Duyck 2b5cd0dfa3 net: Change return type of sk_busy_loop from bool to void
checking the return value of sk_busy_loop. As there are only a few
consumers of that data, and the data being checked for can be replaced
with a check for !skb_queue_empty() we might as well just pull the code
out of sk_busy_loop and place it in the spots that actually need it.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:30 -07:00
Alexander Duyck e5907459ce tcp: Record Rx hash and NAPI ID in tcp_child_process
While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set.  In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.

In addition to the NAPI ID change I have dropped the code that was
populating the Rx hash since it was actually being populated in
tcp_get_cookie_sock.

Reported-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:30 -07:00
Alexander Duyck 545cd5e5ec net: Busy polling should ignore sender CPUs
This patch is a cleanup/fix for NAPI IDs following the changes that made it
so that sender_cpu and napi_id were doing a better job of sharing the same
location in the sk_buff.

One issue I found is that we weren't validating the napi_id as being valid
before we started trying to setup the busy polling.  This change corrects
that by using the MIN_NAPI_ID value that is now used in both allocating the
NAPI IDs, as well as validating them.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:30 -07:00
Florian Westphal 28ee1b746f secure_seq: downgrade to per-host timestamp offsets
Unfortunately too many devices (not under our control) use tcp_tw_recycle=1,
which depends on timestamps being identical of the same saddr.

Although tcp_tw_recycle got removed in net-next we can't make
such end hosts disappear so downgrade to per-host timestamp offsets.

Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Reported-by: Yvan Vanrossomme <yvan@vanrossomme.net>
Fixes: 95a22caee3 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 19:27:44 -07:00
Alexander Duyck 95f2552113 net: Do not allow negative values for busy_read and busy_poll sysctl interfaces
This change basically codifies what I think was already the limitations on
the busy_poll and busy_read sysctl interfaces.  We weren't checking the
lower bounds and as such could input negative values. The behavior when
that was used was dependent on the architecture. In order to prevent any
issues with that I am just disabling support for values less than 0 since
this way we don't have to worry about any odd behaviors.

By limiting the sysctl values this way it also makes it consistent with how
we handle the SO_BUSY_POLL socket option since the value appears to be
reported as a signed integer value and negative values are rejected.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:02:13 -07:00
David Lebrun af4a2209b1 ipv6: sr: use dst_cache in seg6_input
We already use dst_cache in seg6_output, when handling locally generated
packets. We extend it in seg6_input, to also handle forwarded packets, and avoid
unnecessary fib lookups.

Performances for SRH encapsulation before the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
  884006pps 7072Mb/sec (7072048000bps) errors: 0

Performances after the patch:
Result: OK: 4774543(c4774084+d459) usec, 5000000 (1000byte,0frags)
  1047220pps 8377Mb/sec (8377760000bps) errors: 0

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 14:47:32 -07:00
David Lebrun 19d5a26f5e ipv6: sr: expand skb head only if necessary
To insert or encapsulate a packet with an SRH, we need a large enough skb
headroom. Currently, we are using pskb_expand_head to inconditionally increase
the size of the headroom by the amount needed by the SRH (and IPv6 header).
If this reallocation is performed by another CPU than the one that initially
allocated the skb, then when the initial CPU kfree the skb, it will enter the
__slab_free slowpath, impacting performances.

This patch replaces pskb_expand_head with skb_cow_head, that will reallocate the
skb head only if the headroom is not large enough.

Performances for SRH encapsulation before the patch:
Result: OK: 7348320(c7347271+d1048) usec, 5000000 (1000byte,0frags)
  680427pps 5443Mb/sec (5443416000bps) errors: 0

Performances after the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
  884006pps 7072Mb/sec (7072048000bps) errors: 0

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 14:47:32 -07:00
Geliang Tang 3b1af93cf1 net_sched: use setup_deferrable_timer
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 14:42:52 -07:00
Linus Torvalds 59d9cb91d0 A fix for a writeback deadlock caused by a GFP_KERNEL allocation on the
reclaim path, tagged for stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJY1QoTAAoJEEp/3jgCEfOLtnoH/0uj3Y2QUVHDo0Je9HkQhzVC
 Mp2a/WzN1YO/wdYRdxs6LbVqMIukPJ5kKbKO3WIgpJZR9lxSM9k5Y0YC98YVBHuB
 v1uuTPe60fmdJRYnoD4/SfwXiIiAF5/dzPp80SI4xQ9zN24dS1wBREkH2eXUeoKL
 FCEoHUv7cJju1dGNbcGpv4MV75b0e+HHAxPuDG4fiUdT79Vp+wP7exx0lMkS9ub8
 i5T1GU5gDDBR+SKhpqMIvNgR7s3zdInUs476tk6jz/o049BwybkLIuDhx0+x93/g
 1Y1KXHYPCT3yZeL5kdJxM6Dul6wDYOsr2/wIPHzY0ym9RXqCEy/PFowJnL+7eWk=
 =cTM6
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A fix for a writeback deadlock caused by a GFP_KERNEL allocation on
  the reclaim path, tagged for stable"

* tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client:
  libceph: force GFP_NOIO for socket allocations
2017-03-24 14:35:39 -07:00
David Ahern 6a18c31232 net: mpls: Fix setting ttl_propagate for rt2
Fix copy and paste error setting rt_ttl_propagate.

Fixes: 5b441ac878 ("mpls: allow TTL propagation to IP packets to be configured")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 13:29:17 -07:00
Alexey Dobriyan e013fb7c4c net: make in_aton() 32-bit internally
Converting IPv4 address doesn't need 64-bit arithmetic.

Space savings: 10 bytes!

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
	function                          old     new   delta
	in_aton                            96      86     -10

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 13:27:19 -07:00
subashab@codeaurora.org dddb64bcb3 net: Add sysctl to toggle early demux for tcp and udp
Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3->v4: Store and use read once result instead of querying pointer
again incorrectly.

v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Tom Herbert <tom@herbertland.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 13:17:07 -07:00
WANG Cong a80db69e47 kcm: return immediately after copy_from_user() failure
There is no reason to continue after a copy_from_user()
failure.

Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 13:13:53 -07:00
Jiri Pirko 5952fde10c net: sched: choke: remove dead filter classify code
sch_choke is classless qdisc so it does not define cl_ops. Therefore
filter_list cannot be ever changed, being NULL all the time.
Reason is this check in tc_ctl_tfilter:

	/* Is it classful? */
	cops = q->ops->cl_ops;
	if (!cops)
		return -EINVAL;

So remove this dead code.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 12:47:10 -07:00
Nikolay Aleksandrov eb100e0e24 net: bridge: allow to add externally learned entries from user-space
The NTF_EXT_LEARNED flag was added for switchdev and externally learned
entries, but it can also be used for entries learned via a software
in user-space which requires dynamic entries that do not expire.
One such case that we have is with quagga and evpn which need dynamic
entries but also require to age them themselves.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 12:30:21 -07:00
Nikolay Aleksandrov 7e26bf45e4 net: bridge: allow SW learn to take over HW fdb entries
Allow to take over an entry which was previously learned via HW when it
shows up from a SW port. This is analogous to how HW takes over SW learned
entries already.

Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 12:30:21 -07:00
Alexey Dobriyan d7f6946630 xfrm: use "unsigned int" in __xfrm6_pref_hash()
x86_64 is zero-extending arch so "unsigned int" is preferred over "int"
for address calculations.

Space savings:

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-58 (-58)
	function                                     old     new   delta
	xfrm_hash_resize                            2752    2743      -9
	policy_hash_bysel                            985     973     -12
	policy_hash_direct                          1036     999     -37

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-24 07:03:12 +01:00
Alexey Dobriyan 1560875600 xfrm: remove unused struct xfrm_mgr::id
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-24 07:03:12 +01:00
Eric Dumazet 48481c8fa1 net: neigh: guard against NULL solicit() method
Dmitry posted a nice reproducer of a bug triggering in neigh_probe()
when dereferencing a NULL neigh->ops->solicit method.

This can happen for arp_direct_ops/ndisc_direct_ops and similar,
which can be used for NUD_NOARP neighbours (created when dev->header_ops
is NULL). Admin can then force changing nud_state to some other state
that would fire neigh timer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 21:28:13 -07:00
Davide Caratti add641e7de sched: act_csum: don't mangle TCP and UDP GSO packets
after act_csum computes the checksum on skbs carrying GSO TCP/UDP packets,
subsequent segmentation fails because skb_needs_check(skb, true) returns
true. Because of that, skb_warn_bad_offload() is invoked and the following
message is displayed:

WARNING: CPU: 3 PID: 28 at net/core/dev.c:2553 skb_warn_bad_offload+0xf0/0xfd
<...>

  [<ffffffff8171f486>] skb_warn_bad_offload+0xf0/0xfd
  [<ffffffff8161304c>] __skb_gso_segment+0xec/0x110
  [<ffffffff8161340d>] validate_xmit_skb+0x12d/0x2b0
  [<ffffffff816135d2>] validate_xmit_skb_list+0x42/0x70
  [<ffffffff8163c560>] sch_direct_xmit+0xd0/0x1b0
  [<ffffffff8163c760>] __qdisc_run+0x120/0x270
  [<ffffffff81613b3d>] __dev_queue_xmit+0x23d/0x690
  [<ffffffff81613fa0>] dev_queue_xmit+0x10/0x20

Since GSO is able to compute checksum on individual segments of such skbs,
we can simply skip mangling the packet.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 17:40:33 -07:00
Chenbo Feng 6acc5c2910 Add a eBPF helper function to retrieve socket uid
Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.

Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 17:01:02 -07:00
Chenbo Feng 91b8270f2a Add a helper function to get socket cookie in eBPF
Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 17:01:02 -07:00
David S. Miller 16ae1f2236 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmmii.c
	drivers/net/hyperv/netvsc.c
	kernel/bpf/hashtab.c

Almost entirely overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 16:41:27 -07:00
Linus Torvalds f341d9f08a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Several netfilter fixes from Pablo and the crew:
      - Handle fragmented packets properly in netfilter conntrack, from
        Florian Westphal.
      - Fix SCTP ICMP packet handling, from Ying Xue.
      - Fix big-endian bug in nftables, from Liping Zhang.
      - Fix alignment of fake conntrack entry, from Steven Rostedt.

 2) Fix feature flags setting in fjes driver, from Taku Izumi.

 3) Openvswitch ipv6 tunnel source address not set properly, from Or
    Gerlitz.

 4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.

 5) sk->sk_frag.page not released properly in some cases, from Eric
    Dumazet.

 6) Fix RTNL deadlocks in nl80211, from Johannes Berg.

 7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.

 8) Cure improper inflight handling during AF_UNIX GC, from Andrey
    Ulanov.

 9) sch_dsmark doesn't write to packet headers properly, from Eric
    Dumazet.

10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
    Yeganeh.

11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.

12) Fix nametbl deadlock in tipc, from Ying Xue.

13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
    Pressman.

14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.

15) Fix hashmap allocation handling, from Alexei Starovoitov.

16) nl_fib_input() needs stronger netlink message length checking, from
    Eric Dumazet.

17) Fix double-free of sk->sk_filter during sock clone, from Daniel
    Borkmann.

18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
  net:ethernet:aquantia: Fix for RX checksum offload.
  amd-xgbe: Fix the ECC-related bit position definitions
  sfc: cleanup a condition in efx_udp_tunnel_del()
  Bluetooth: btqcomsmd: fix compile-test dependency
  inet: frag: release spinlock before calling icmp_send()
  tcp: initialize icsk_ack.lrcvtime at session start time
  genetlink: fix counting regression on ctrl_dumpfamily()
  socket, bpf: fix sk_filter use after free in sk_clone_lock
  ipv4: provide stronger user input validation in nl_fib_input()
  bpf: fix hashmap extra_elems logic
  enic: update enic maintainers
  net: bcmgenet: remove bcmgenet_internal_phy_setup()
  ipv6: make sure to initialize sockc.tsflags before first use
  fjes: Do not load fjes driver if extended socket device is not power on.
  fjes: Do not load fjes driver if system does not have extended socket device.
  net/mlx5e: Count LRO packets correctly
  net/mlx5e: Count GSO packets correctly
  net/mlx5: Increase number of max QPs in default profile
  net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
  net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
  ...
2017-03-23 11:29:49 -07:00
Ilya Dryomov 633ee407b9 libceph: force GFP_NOIO for socket allocations
sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [<ffffffff816dd629>] schedule+0x29/0x70
    [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
    [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
    [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
    [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
    [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
    [<ffffffff81086335>] flush_work+0x165/0x250
    [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
    [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
    [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
    [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
    [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
    [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
    [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
    [<ffffffff8115af70>] shrink_slab+0x100/0x140
    [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
    [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
    [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
    [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
    [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
    [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
    [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
    [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
    [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
    [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
    [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
    [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
    [<ffffffff811d8566>] alloc_inode+0x26/0xa0
    [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
    [<ffffffff815b933e>] sock_alloc+0x1e/0x80
    [<ffffffff815ba855>] __sock_create+0x95/0x220
    [<ffffffff815baa04>] sock_create_kern+0x24/0x30
    [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
    [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [<ffffffff81084c19>] process_one_work+0x159/0x4f0
    [<ffffffff8108561b>] worker_thread+0x11b/0x530
    [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
    [<ffffffff8108b6f9>] kthread+0xc9/0xe0
    [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
    [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
    [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Cc: stable@vger.kernel.org # 3.10+, needs backporting
Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov <wintchester@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-03-23 12:03:36 +01:00
Eric Dumazet ec4fbd6475 inet: frag: release spinlock before calling icmp_send()
Dmitry reported a lockdep splat [1] (false positive) that we can fix
by releasing the spinlock before calling icmp_send() from ip_expire()

This is a false positive because sending an ICMP message can not
possibly re-enter the IP frag engine.

[1]
[ INFO: possible circular locking dependency detected ]
4.10.0+ #29 Not tainted
-------------------------------------------------------
modprobe/12392 is trying to acquire lock:
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] spin_lock
include/linux/spinlock.h:299 [inline]
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] __netif_tx_lock
include/linux/netdevice.h:3486 [inline]
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>]
sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180

but task is already holding lock:
 (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock
include/linux/spinlock.h:299 [inline]
 (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>]
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&q->lock)->rlock){+.-...}:
       validate_chain kernel/locking/lockdep.c:2267 [inline]
       __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
       lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:299 [inline]
       ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669
       ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713
       packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459
       deliver_skb net/core/dev.c:1834 [inline]
       dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890
       xmit_one net/core/dev.c:2903 [inline]
       dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923
       sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182
       __dev_xmit_skb net/core/dev.c:3092 [inline]
       __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
       neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308
       neigh_output include/net/neighbour.h:478 [inline]
       ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228
       ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672
       ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545
       ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
       dst_output include/net/dst.h:486 [inline]
       ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
       ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
       ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
       raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985
       __sys_sendmmsg+0x25c/0x750 net/socket.c:2075
       SYSC_sendmmsg net/socket.c:2106 [inline]
       SyS_sendmmsg+0x35/0x60 net/socket.c:2101
       do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281
       return_from_SYSCALL_64+0x0/0x7a

-> #0 (_xmit_ETHER#2){+.-...}:
       check_prev_add kernel/locking/lockdep.c:1830 [inline]
       check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
       validate_chain kernel/locking/lockdep.c:2267 [inline]
       __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
       lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:299 [inline]
       __netif_tx_lock include/linux/netdevice.h:3486 [inline]
       sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
       __dev_xmit_skb net/core/dev.c:3092 [inline]
       __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
       neigh_hh_output include/net/neighbour.h:468 [inline]
       neigh_output include/net/neighbour.h:476 [inline]
       ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
       ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
       dst_output include/net/dst.h:486 [inline]
       ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
       ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
       ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
       icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
       icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
       ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
       call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
       expire_timers kernel/time/timer.c:1307 [inline]
       __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
       run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
       __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
       invoke_softirq kernel/softirq.c:364 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:657 [inline]
       smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
       apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
       __read_once_size include/linux/compiler.h:254 [inline]
       atomic_read arch/x86/include/asm/atomic.h:26 [inline]
       rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
       __rcu_is_watching kernel/rcu/tree.c:1133 [inline]
       rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
       rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
       radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
       filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
       do_fault_around mm/memory.c:3231 [inline]
       do_read_fault mm/memory.c:3265 [inline]
       do_fault+0xbd5/0x2080 mm/memory.c:3370
       handle_pte_fault mm/memory.c:3600 [inline]
       __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
       handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
       __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
       do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
       page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&q->lock)->rlock);
                               lock(_xmit_ETHER#2);
                               lock(&(&q->lock)->rlock);
  lock(_xmit_ETHER#2);

 *** DEADLOCK ***

10 locks held by modprobe/12392:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81329758>]
__do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336
 #1:  (rcu_read_lock){......}, at: [<ffffffff8188cab6>]
filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
spin_lock include/linux/spinlock.h:299 [inline]
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
pte_alloc_one_map mm/memory.c:2944 [inline]
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072
 #3:  (((&q->timer))){+.-...}, at: [<ffffffff81627e72>]
lockdep_copy_map include/linux/lockdep.h:175 [inline]
 #3:  (((&q->timer))){+.-...}, at: [<ffffffff81627e72>]
call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258
 #4:  (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock
include/linux/spinlock.h:299 [inline]
 #4:  (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>]
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201
 #5:  (rcu_read_lock){......}, at: [<ffffffff8389a633>]
ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] spin_trylock
include/linux/spinlock.h:309 [inline]
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_xmit_lock
net/ipv4/icmp.c:219 [inline]
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>]
icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681
 #7:  (rcu_read_lock_bh){......}, at: [<ffffffff838ab9a1>]
ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198
 #8:  (rcu_read_lock_bh){......}, at: [<ffffffff836d1dee>]
__dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324
 #9:  (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at:
[<ffffffff836d3a27>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423

stack backtrace:
CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:52
 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204
 check_prev_add kernel/locking/lockdep.c:1830 [inline]
 check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
 validate_chain kernel/locking/lockdep.c:2267 [inline]
 __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:299 [inline]
 __netif_tx_lock include/linux/netdevice.h:3486 [inline]
 sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
 __dev_xmit_skb net/core/dev.c:3092 [inline]
 __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
 neigh_hh_output include/net/neighbour.h:468 [inline]
 neigh_output include/net/neighbour.h:476 [inline]
 ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
 dst_output include/net/dst.h:486 [inline]
 ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
 expire_timers kernel/time/timer.c:1307 [inline]
 __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:657 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline]
RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline]
RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10
RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c
RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000
R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25
R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000
 </IRQ>
 rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
 filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
 do_fault_around mm/memory.c:3231 [inline]
 do_read_fault mm/memory.c:3265 [inline]
 do_fault+0xbd5/0x2080 mm/memory.c:3370
 handle_pte_fault mm/memory.c:3600 [inline]
 __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011
RIP: 0033:0x7f83172f2786
RSP: 002b:00007fffe859ae80 EFLAGS: 00010293
RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970
RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000
R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040
R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:40:45 -07:00
Eric Dumazet 15bb7745e9 tcp: initialize icsk_ack.lrcvtime at session start time
icsk_ack.lrcvtime has a 0 value at socket creation time.

tcpi_last_data_recv can have bogus value if no payload is ever received.

This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:39:42 -07:00
Stanislaw Gruszka 1d2a6a5e4b genetlink: fix counting regression on ctrl_dumpfamily()
Commit 2ae0f17df1 ("genetlink: use idr to track families") replaced

	if (++n < fams_to_skip)
		continue;
into:

	if (n++ < fams_to_skip)
		continue;

This subtle change cause that on retry ctrl_dumpfamily() call we omit
one family that failed to do ctrl_fill_info() on previous call, because
cb->args[0] = n number counts also family that failed to do
ctrl_fill_info().

Patch fixes the problem and avoid confusion in the future just decrease
n counter when ctrl_fill_info() fail.

User visible problem caused by this bug is failure to get access to
some genetlink family i.e. nl80211. However problem is reproducible
only if number of registered genetlink families is big enough to
cause second call of ctrl_dumpfamily().

Cc: Xose Vazquez Perez <xose.vazquez@gmail.com>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Fixes: 2ae0f17df1 ("genetlink: use idr to track families")
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:38:43 -07:00
Daniel Borkmann a97e50cc4c socket, bpf: fix sk_filter use after free in sk_clone_lock
In sk_clone_lock(), we create a new socket and inherit most of the
parent's members via sock_copy() which memcpy()'s various sections.
Now, in case the parent socket had a BPF socket filter attached,
then newsk->sk_filter points to the same instance as the original
sk->sk_filter.

sk_filter_charge() is then called on the newsk->sk_filter to take a
reference and should that fail due to hitting max optmem, we bail
out and release the newsk instance.

The issue is that commit 278571baca ("net: filter: simplify socket
charging") wrongly combined the dismantle path with the failure path
of xfrm_sk_clone_policy(). This means, even when charging failed, we
call sk_free_unlock_clone() on the newsk, which then still points to
the same sk_filter as the original sk.

Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually
where it tests for present sk_filter and calls sk_filter_uncharge()
on it, which potentially lets sk_omem_alloc wrap around and releases
the eBPF prog and sk_filter structure from the (still intact) parent.

Fix it by making sure that when sk_filter_charge() failed, we reset
newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(),
so that we don't mess with the parents sk_filter.

Only if xfrm_sk_clone_policy() fails, we did reach the point where
either the parent's filter was NULL and as a result newsk's as well
or where we previously had a successful sk_filter_charge(), thus for
that case, we do need sk_filter_uncharge() to release the prior taken
reference on sk_filter.

Fixes: 278571baca ("net: filter: simplify socket charging")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:37:04 -07:00
Joel Scherpelz bbea124bc9 net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.
This commit adds a new sysctl accept_ra_rt_info_min_plen that
defines the minimum acceptable prefix length of Route Information
Options. The new sysctl is intended to be used together with
accept_ra_rt_info_max_plen to configure a range of acceptable
prefix lengths. It is useful to prevent misconfigurations from
unintentionally blackholing too much of the IPv6 address space
(e.g., home routers announcing RIOs for fc00::/7, which is
incorrect).

Signed-off-by: Joel Scherpelz <jscherpelz@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 14:20:54 -07:00
Eric Dumazet c64c0b3cac ipv4: provide stronger user input validation in nl_fib_input()
Alexander reported a KMSAN splat caused by reads of uninitialized
field (tb_id_in) from user provided struct fib_result_nl

It turns out nl_fib_input() sanity tests on user input is a bit
wrong :

User can pretend nlh->nlmsg_len is big enough, but provide
at sendmsg() time a too small buffer.

Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 14:15:49 -07:00
David Ahern a7678c70ef rtnetlink: Add dump all for netconf
Use the rtnl_dump_all to dump all netconf handlers that have been
registered. Allows userspace to send a dump request for PF_UNSPEC
and get all families.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 12:45:17 -07:00
Alexander Potapenko d515684d78 ipv6: make sure to initialize sockc.tsflags before first use
In the case udp_sk(sk)->pending is AF_INET6, udpv6_sendmsg() would
jump to do_append_data, skipping the initialization of sockc.tsflags.
Fix the problem by moving sockc.tsflags initialization earlier.

The bug was detected with KMSAN.

Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 12:40:22 -07:00
Reshetova, Elena 4c355cdfbb net: convert sk_filter.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 12:06:08 -07:00
Ying Xue 557d054c01 tipc: fix nametbl deadlock at tipc_nametbl_unsubscribe
Until now, tipc_nametbl_unsubscribe() is called at subscriptions
reference count cleanup. Usually the subscriptions cleanup is
called at subscription timeout or at subscription cancel or at
subscriber delete.

We have ignored the possibility of this being called from other
locations, which causes deadlock as we try to grab the
tn->nametbl_lock while holding it already.

   CPU1:                             CPU2:
----------                     ----------------
tipc_nametbl_publish
spin_lock_bh(&tn->nametbl_lock)
tipc_nametbl_insert_publ
tipc_nameseq_insert_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
                             tipc_close_conn
                             tipc_subscrb_release_cb
                             tipc_subscrb_delete
                             tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>

   CPU1:                              CPU2:
----------                     ----------------
tipc_nametbl_stop
spin_lock_bh(&tn->nametbl_lock)
tipc_purge_publications
tipc_nameseq_remove_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
                             tipc_close_conn
                             tipc_subscrb_release_cb
                             tipc_subscrb_delete
                             tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>

In this commit, we advance the calling of tipc_nametbl_unsubscribe()
from the refcount cleanup to the intended callers.

Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:59:16 -07:00
Gao Feng cfc62d878f net: tcp: Permit user set TCP_MAXSEG to default value
When user_mss is zero, it means use the default value. But the current
codes don't permit user set TCP_MAXSEG to the default value.
It would return the -EINVAL when val is zero.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:45:13 -07:00
andy zhou bef7f7567a Openvswitch: Refactor sample and recirc actions implementation
Added clone_execute() that both the sample and the recirc
action implementation can use.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:28:35 -07:00
andy zhou 798c166173 openvswitch: Optimize sample action for the clone use cases
With the introduction of open flow 'clone' action, the OVS user space
can now translate the 'clone' action into kernel datapath 'sample'
action, with 100% probability, to ensure that the clone semantics,
which is that the packet seen by the clone action is the same as the
packet seen by the action after clone, is faithfully carried out
in the datapath.

While the sample action in the datpath has the matching semantics,
its implementation is only optimized for its original use.
Specifically, there are two limitation: First, there is a 3 level of
nesting restriction, enforced at the flow downloading time. This
limit turns out to be too restrictive for the 'clone' use case.
Second, the implementation avoid recursive call only if the sample
action list has a single userspace action.

The main optimization implemented in this series removes the static
nesting limit check, instead, implement the run time recursion limit
check, and recursion avoidance similar to that of the 'recirc' action.
This optimization solve both #1 and #2 issues above.

One related optimization attempts to avoid copying flow key as
long as the actions enclosed does not change the flow key. The
detection is performed only once at the flow downloading time.

Another related optimization is to rewrite the action list
at flow downloading time in order to save the fast path from parsing
the sample action list in its original form repeatedly.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:28:35 -07:00
andy zhou 4572ef52a0 openvswitch: Refactor recirc key allocation.
The logic of allocating and copy key for each 'exec_actions_level'
was specific to execute_recirc(). However, future patches will reuse
as well.  Refactor the logic into its own function clone_key().

Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:28:35 -07:00
andy zhou 47c697aa2d openvswitch: Deferred fifo API change.
add_deferred_actions() API currently requires actions to be passed in
as a fully encoded netlink message. So far both 'sample' and 'recirc'
actions happens to carry actions as fully encoded netlink messages.
However, this requirement is more restrictive than necessary, future
patch will need to pass in action lists that are not fully encoded
by themselves.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:28:34 -07:00
Josh Hunt a2d133b1d4 sock: introduce SO_MEMINFO getsockopt
Allows reading of SK_MEMINFO_VARS via socket option. This way an
application can get all meminfo related information in single socket
option call instead of multiple calls.

Adds helper function, sk_get_meminfo(), and uses that for both
getsockopt and sock_diag_put_meminfo().

Suggested by Eric Dumazet.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:18:58 -07:00
Xin Long 581947787e sctp: remove useless err from sctp_association_init
This patch is to remove the unnecessary temporary variable 'err' from
sctp_association_init.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 10:57:52 -07:00
Roopa Prabhu 7b8f7a402d neighbour: fix nlmsg_pid in notifications
neigh notifications today carry pid 0 for nlmsg_pid
in all cases. This patch fixes it to carry calling process
pid when available. Applications (eg. quagga) rely on
nlmsg_pid to ignore notifications generated by their own
netlink operations. This patch follows the routing subsystem
which already sets this correctly.

Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 10:48:49 -07:00
Tejun Heo a05d4fd917 cgroup, net_cls: iterate the fds of only the tasks which are being migrated
The net_cls controller controls the classid field of each socket which
is associated with the cgroup.  Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51 ("cgroups: Allow dynamically changing net_classid").

While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations.  However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css.  This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.

On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds.  Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.

Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse.  Before bfc2cf6f61 ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.

As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those.  The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used.  This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.

The worst symptom is already fixed in upstream by bfc2cf6f61
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.

This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated.  This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup.  As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().

Reported-by: David Goode <dgoode@fb.com>
Fixes: 3b13758f51 ("cgroups: Allow dynamically changing net_classid")
Cc: stable@vger.kernel.org # v4.4+
Cc: Nina Schiff <ninasc@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 10:32:46 -07:00
Jeffy Chen f83bf8da11 netfilter: nfnl_cthelper: Fix memory leak
We have memory leaks of nf_conntrack_helper & expect_policy.

Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-22 12:45:32 +01:00
Pablo Neira Ayuso 2c42225755 netfilter: nfnl_cthelper: fix runtime expectation policy updates
We only allow runtime updates of expectation policies for timeout and
maximum number of expectations, otherwise reject the update.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Liping Zhang <zlpnobody@gmail.com>
2017-03-22 12:20:16 +01:00
Andreas Pape a3a5129e12 batman-adv: handle race condition for claims between gateways
Consider the following situation which has been found in a test setup:
Gateway B has claimed client C and gateway A has the same backbone
network as B. C sends a broad- or multicast to B and directly after
this packet decides to send another packet to A due to a better TQ
value. B will forward the broad-/multicast into the backbone as it is
the responsible gw and after that A will claim C as it has been
chosen by C as the best gateway. If it now happens that A claims C
before it has received the broad-/multicast forwarded by B (due to
backbone topology or due to some delay in B when forwarding the
packet) we get a critical situation: in the current code A will
immediately unclaim C when receiving the multicast due to the
roaming client scenario although the position of C has not changed
in the mesh. If this happens the multi-/broadcast forwarded by B
will be sent back into the mesh by A and we have looping packets
until one of the gateways claims C again.
In order to prevent this, unclaiming of a client due to the roaming
client scenario is only done after a certain time is expired after
the last claim of the client. 100 ms are used here, which should be
slow enough for big backbones and slow gateways but fast enough not
to break the roaming client use case.

Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-22 10:30:53 +01:00
Andreas Pape 4dd72f7360 batman-adv: changed debug messages for easier bla debugging
Some of the bla debug messages are extended and additional messages are
added for easier bla debugging. Some debug messages introduced with the
dat changes in prior patches of this patch series have been changed to
be more compliant to other existing debug messages.

Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-22 10:30:53 +01:00
Andreas Pape 9e794b6bf4 batman-adv: drop unicast packets from other backbone gw
Additional dropping of unicast packets received from another backbone gw if
the same backbone network before being forwarded to the same backbone again
is necessary. It was observed in a test setup that in rare cases these
frames lead to looping unicast traffic backbone->mesh->backbone.

Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-22 10:30:53 +01:00
Andreas Pape 9aa5cd79b5 batman-adv: prevent duplication of ARP replies when DAT is used
If none of the backbone gateways in a bla setup has already knowledge of
the mac address searched for in an incoming ARP request from the backbone
an address resolution via the DHT of DAT is started. The gateway can send
several ARP requests to different DHT nodes and therefore can get several
replies. This patch assures that not all of the possible ARP replies are
returned to the backbone by checking the local DAT cache of the gateway.
If there is an entry in the local cache the gateway has already learned
the requested address and there is no need to forward the additional reply
to the backbone.
Furthermore it is checked if this gateway has claimed the source of the ARP
reply and only forwards it to the backbone if it has claimed the source or
if there is no claim at all.

Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-22 10:30:53 +01:00
Andreas Pape 00311de5fb batman-adv: prevent multiple ARP replies sent by gateways if dat enabled
If dat is enabled it must be made sure that only the backbone gw which has
claimed the remote destination for the ARP request answers the ARP request
directly if the MAC address is known due to the local dat table. This
prevents multiple ARP replies in a common backbone if more than one
gateway already knows the remote mac searched for in the ARP request.

Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-22 10:30:53 +01:00
Soheil Hassas Yeganeh 4ef1b28694 tcp: mark skbs with SCM_TIMESTAMPING_OPT_STATS
SOF_TIMESTAMPING_OPT_STATS can be enabled and disabled
while packets are collected on the error queue.
So, checking SOF_TIMESTAMPING_OPT_STATS in sk->sk_tsflags
is not enough to safely assume that the skb contains
OPT_STATS data.

Add a bit in sock_exterr_skb to indicate whether the
skb contains opt_stats data.

Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 18:44:17 -07:00
Soheil Hassas Yeganeh 8605330aac tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs
__sock_recv_timestamp can be called for both normal skbs (for
receive timestamps) and for skbs on the error queue (for transmit
timestamps).

Commit 1c885808e4
(tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING)
assumes any skb passed to __sock_recv_timestamp are from
the error queue, containing OPT_STATS in the content of the skb.
This results in accessing invalid memory or generating junk
data.

To fix this, set skb->pkt_type to PACKET_OUTGOING for packets
on the error queue. This is safe because on the receive path
on local sockets skb->pkt_type is never set to PACKET_OUTGOING.
With that, copy OPT_STATS from a packet, only if its pkt_type
is PACKET_OUTGOING.

Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 18:44:17 -07:00
Xin Long 23bb09cfbe sctp: out_qlen should be updated when pruning unsent queue
This patch is to fix the issue that sctp_prsctp_prune_sent forgot
to update q->out_qlen when removing a chunk from unsent queue.

Fixes: 8dbdf1f5b0 ("sctp: implement prsctp PRIO policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 18:32:36 -07:00
Xin Long 486a43db2e sctp: remove temporary variable confirm from sctp_packet_transmit
Commit c86a773c78 ("sctp: add dst_pending_confirm flag") introduced
a temporary variable "confirm" in sctp_packet_transmit.

But it broke the rule that longer lines should be above shorter ones.
Besides, this variable is not necessary, so this patch is to just
remove it and use tp->dst_pending_confirm directly.

Fixes: c86a773c78 ("sctp: add dst_pending_confirm flag")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 18:31:02 -07:00
Eric Dumazet aea92fb2e0 sch_dsmark: fix invalid skb_cow() usage
skb_cow(skb, sizeof(ip header)) is not very helpful in this context.

First we need to use pskb_may_pull() to make sure the ip header
is in skb linear part, then use skb_try_make_writable() to
address clones issues.

Fixes: 4c30719f4f ("[PKT_SCHED] dsmark: handle cloned and non-linear skb's")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 17:21:27 -07:00
Nikolay Aleksandrov bf4e0a3db9 net: ipv4: add support for ECMP hash policy choice
This patch adds support for ECMP hash policy choice via a new sysctl
called fib_multipath_hash_policy and also adds support for L4 hashes.
The current values for fib_multipath_hash_policy are:
 0 - layer 3 (default)
 1 - layer 4
If there's an skb hash already set and it matches the chosen policy then it
will be used instead of being calculated (currently only for L4).
In L3 mode we always calculate the hash due to the ICMP error special
case, the flow dissector's field consistentification should handle the
address order thus we can remove the address reversals.
If the skb is provided we always use it for the hash calculation,
otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 15:27:19 -07:00
Andrey Vagin 88997e4208 net/8021q: create device with all possible features in wanted_features
wanted_features is a set of features which have to be enabled if a
hardware allows that.

Currently when a vlan device is created, its wanted_features is set to
current features of its base device.

The problem is that the base device can get new features and they are
not propagated to vlan-s of this device.

If we look at bonding devices, they doesn't have this problem and this
patch suggests to fix this issue by the same way how it works for bonding
devices.

We meet this problem, when we try to create a vlan device over a bonding
device. When a system are booting, real devices require time to be
initialized, so bonding devices created without slaves, then vlan
devices are created and only then ethernet devices are added to the
bonding device. As a result we have vlan devices with disabled
scatter-gather.

* create a bonding device
  $ ip link add bond0 type bond
  $ ethtool -k bond0 | grep scatter
  scatter-gather: off
	tx-scatter-gather: off [requested on]
	tx-scatter-gather-fraglist: off [requested on]

* create a vlan device
  $ ip link add link bond0 name bond0.10 type vlan id 10
  $ ethtool -k bond0.10 | grep scatter
  scatter-gather: off
	tx-scatter-gather: off
	tx-scatter-gather-fraglist: off

* Add a slave device to bond0
  $ ip link set dev eth0 master bond0

And now we can see that the bond0 device has got the scatter-gather
feature, but the bond0.10 hasn't got it.
[root@laptop linux-task-diag]# ethtool -k bond0 | grep scatter
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: on
[root@laptop linux-task-diag]# ethtool -k bond0.10 | grep scatter
scatter-gather: off
	tx-scatter-gather: off
	tx-scatter-gather-fraglist: off

With this patch the vlan device will get all new features from the
bonding device.

Here is a call trace how features which are set in this patch reach
dev->wanted_features.

register_netdevice
   vlan_dev_init
	...
	dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
		       NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE |
		       NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC |
		       NETIF_F_ALL_FCOE;

	dev->features |= dev->hw_features;
	...
    dev->wanted_features = dev->features & dev->hw_features;
    __netdev_update_features(dev);
        vlan_dev_fix_features
	   ...

Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 15:26:26 -07:00
Andrey Ulanov 7df9c24625 net: unix: properly re-increment inflight counter of GC discarded candidates
Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().

The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.

Commit 6209344f5a ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.

This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.

  kernel BUG at net/unix/garbage.c:149!
  RIP: 0010:[<ffffffff8717ebf4>]  [<ffffffff8717ebf4>]
  unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
  Call Trace:
   [<ffffffff8716cfbf>] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
   [<ffffffff8716f6a9>] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
   [<ffffffff86a90a01>] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
   [<ffffffff86a9808a>] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
   [<ffffffff86a980ea>] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
   [<ffffffff86a98284>] kfree_skb+0x184/0x570 net/core/skbuff.c:705
   [<ffffffff871789d5>] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
   [<ffffffff87179039>] unix_release+0x49/0x90 net/unix/af_unix.c:836
   [<ffffffff86a694b2>] sock_release+0x92/0x1f0 net/socket.c:570
   [<ffffffff86a6962b>] sock_close+0x1b/0x20 net/socket.c:1017
   [<ffffffff81a76b8e>] __fput+0x34e/0x910 fs/file_table.c:208
   [<ffffffff81a771da>] ____fput+0x1a/0x20 fs/file_table.c:244
   [<ffffffff81483ab0>] task_work_run+0x1a0/0x280 kernel/task_work.c:116
   [<     inline     >] exit_task_work include/linux/task_work.h:21
   [<ffffffff8141287a>] do_exit+0x183a/0x2640 kernel/exit.c:828
   [<ffffffff8141383e>] do_group_exit+0x14e/0x420 kernel/exit.c:931
   [<ffffffff814429d3>] get_signal+0x663/0x1880 kernel/signal.c:2307
   [<ffffffff81239b45>] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
   [<ffffffff8100666a>] exit_to_usermode_loop+0x1ea/0x2d0
  arch/x86/entry/common.c:156
   [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
   [<ffffffff81009693>] syscall_return_slowpath+0x4d3/0x570
  arch/x86/entry/common.c:259
   [<ffffffff881478e6>] entry_SYSCALL_64_fastpath+0xc4/0xc6

Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov <andreyu@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 15:25:10 -07:00
Peng Tao 380feae0de vsock: cancel packets when failing to connect
Otherwise we'll leave the packets queued until releasing vsock device.
E.g., if guest is slow to start up, resulting ETIMEDOUT on connect, guest
will get the connect requests from failed host sockets.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 14:41:47 -07:00
Peng Tao 073b4f2c50 vsock: add pkt cancel capability
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 14:41:47 -07:00
Peng Tao 36d277bac8 vsock: track pkt owner vsock
So that we can cancel a queued pkt later if necessary.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 14:41:46 -07:00
Herbert Xu 8a0f5ccfb3 crypto: deadlock between crypto_alg_sem/rtnl_mutex/genl_mutex
On Tue, Mar 14, 2017 at 10:44:10AM +0100, Dmitry Vyukov wrote:
>
> Yes, please.
> Disregarding some reports is not a good way long term.

Please try this patch.

---8<---
Subject: netlink: Annotate nlk cb_mutex by protocol

Currently all occurences of nlk->cb_mutex are annotated by lockdep
as a single class.  This causes a false lcokdep cycle involving
genl and crypto_user.

This patch fixes it by dividing cb_mutex into individual classes
based on the netlink protocol.  As genl and crypto_user do not
use the same netlink protocol this breaks the false dependency
loop.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 14:38:15 -07:00
David S. Miller 41e95736b3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your
net-next tree. A couple of new features for nf_tables, and unsorted
cleanups and incremental updates for the Netfilter tree. More
specifically, they are:

1) Allow to check for TCP option presence via nft_exthdr, patch
   from Phil Sutter.

2) Add symmetric hash support to nft_hash, from Laura Garcia Liebana.

3) Use pr_cont() in ebt_log, from Joe Perches.

4) Remove some dead code in arp_tables reported via static analysis
   tool, from Colin Ian King.

5) Consolidate nf_tables expression validation, from Liping Zhang.

6) Consolidate set lookup via nft_set_lookup().

7) Remove unnecessary rcu read lock side in bridge netfilter, from
   Florian Westphal.

8) Remove unused variable in nf_reject_ipv4, from Tahee Yoo.

9) Pass nft_ctx struct to object initialization indirections, from
   Florian Westphal.

10) Add code to integrate conntrack helper into nf_tables, also from
    Florian.

11) Allow to check if interface index or name exists via
    NFTA_FIB_F_PRESENT, from Phil Sutter.

12) Simplify resolve_normal_ct(), from Florian.

13) Use per-limit spinlock in nft_limit and xt_limit, from Liping Zhang.

14) Use rwlock in nft_set_rbtree set, also from Liping Zhang.

15) One patch to remove a useless printk at netns init path in ipvs,
    and several patches to document IPVS knobs.

16) Use refcount_t for reference counter in the Netfilter/IPVS code,
    from Elena Reshetova.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 14:28:08 -07:00
Liping Zhang ae5c682113 netfilter: nfnl_cthelper: fix incorrect helper->expect_class_max
The helper->expect_class_max must be set to the total number of
expect_policy minus 1, since we will use the statement "if (class >
helper->expect_class_max)" to validate the CTA_EXPECT_CLASS attr in
ctnetlink_alloc_expect.

So for compatibility, set the helper->expect_class_max to the
NFCTH_POLICY_SET_NUM attr's value minus 1.

Also: it's invalid when the NFCTH_POLICY_SET_NUM attr's value is zero.
1. this will result "expect_policy = kzalloc(0, GFP_KERNEL);";
2. we cannot set the helper->expect_class_max to a proper value.

So if nla_get_be32(tb[NFCTH_POLICY_SET_NUM]) is zero, report -EINVAL to
the userspace.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-21 15:47:09 +01:00
Reshetova, Elena 4485a841be netfilter: fix the warning on unused refcount variable
net/netfilter/nfnetlink_acct.c: In function 'nfnl_acct_try_del':
net/netfilter/nfnetlink_acct.c:329:15: warning: unused variable 'refcount' [-Wunused-variable]
unsigned int refcount;
             ^

Fixes: b54ab92b84 ("netfilter: refcounter conversions")
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-20 10:49:12 +01:00
Linus Torvalds 8841b5f0cd NFS client fixes for 4.11
Stable Bugfixes:
 - Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
 - Prevent a double free in async nfs4_exchange_id()
 - Squelch a kbuild sparse complaint for xprtrdma
 
 Other Bugfixes:
 - Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
 - Fix a reference leak that causes kernel warnings
 - Make nfs4_cb_sv_ops static to fix a sparse warning
 - Respect a server's max size in CREATE_SESSION
 - Handle errors from nfs4_pnfs_ds_connect
 - Flexfiles layout shouldn't mark devices as unavailable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAljMQsUACgkQ18tUv7Cl
 QOt0FA//eSieOojEm9uJIxfydrJY2VkPgqg0xmIxLhcMmXi/d4kO9GpS9YeJJZi4
 r5oClq1afhXVR83JmNDCvIYUNwf5/lluckuzXZEYlC3qUbXjQt/Nn/FHfrqW8qXV
 HJy4PVwV+BHnfU6Y7p14zzucGPrMeWZQJO+7mRpboe1jcizHOMdcw+Aim7pr44y6
 BI3QcLPtQGY4CnPOEkpDNuEWtO7iMME3bRJOJ2lOWz5smG0KAQo80OTHGXIe4OqR
 d+gHhoHZ2LbZwdbs+rsjAIMFsFrgXqZmXYbQCZ9SEsr4ysj3PesHPdGFrKXCZCSM
 0MjlEcznGl6ooPDD9tO5Bi047Xhq2TlUWF+FIVYOdFur+7oIcJcnJB7epoYEQ2d2
 6RMvddeKmEgW5Y77myIb3G6jTnk7S8dMq5aAGSyUmKoVhybfw0PGFMbZ2gDEpaTG
 HweeaPmR7J0e+MZBiShTBH2zulFcI1qG3kowu/oKccU9jGi/uA7vkXOSkaxkSzST
 +vS30JwArNOj7OFqhGZbi1YzoK0ixxxXLD4DaEDKKm4mOt7g1Zmb0QgVnGSx1V6X
 Or4Y4xTKn0vCt3e61O9dsBRApBCEVSBJMgYb9Z+LUSdQIKoUj+sQPMzY3iGTefcx
 r7qiUddBZerQ0CZCsRxXk/otJawGCO9XFuSY4CksvlReTeyl1Tc=
 =JY3W
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "We have a handful of stable fixes to fix kernel warnings and other
  bugs that have been around for a while. We've also found a few other
  reference counting bugs and memory leaks since the initial 4.11 pull.

  Stable Bugfixes:
   - Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
   - Prevent a double free in async nfs4_exchange_id()
   - Squelch a kbuild sparse complaint for xprtrdma

  Other Bugfixes:
   - Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
   - Fix a reference leak that causes kernel warnings
   - Make nfs4_cb_sv_ops static to fix a sparse warning
   - Respect a server's max size in CREATE_SESSION
   - Handle errors from nfs4_pnfs_ds_connect
   - Flexfiles layout shouldn't mark devices as unavailable"

* tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
  pNFS: return status from nfs4_pnfs_ds_connect
  NFSv4.1 respect server's max size in CREATE_SESSION
  NFS prevent double free in async nfs4_exchange_id
  nfs: make nfs4_cb_sv_ops static
  xprtrdma: Squelch kbuild sparse complaint
  NFS: fix the fault nrequests decreasing for nfs_inode COPY
  NFSv4: fix a reference leak caused WARNING messages
  nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
2017-03-17 14:16:22 -07:00
Chuck Lever eed50879d6 xprtrdma: Squelch kbuild sparse complaint
New complaint from kbuild for 4.9.y:

net/sunrpc/xprtrdma/verbs.c:489:19: sparse: incompatible types in
    comparison expression (different type sizes)

verbs.c:
489	max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);

I can't reproduce this running sparse here. Likewise, "make W=1
net/sunrpc/xprtrdma/verbs.o" never indicated any issue.

A little poking suggests that because the range of its values is
small, gcc can make the actual width of RPCRDMA_MAX_SEND_SGES
smaller than the width of an unsigned integer.

Fixes: 16f906d66c ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-17 16:05:21 -04:00
Tobias Klauser 13b0ea0f59 batman-adv: Omit unnecessary memset of netdev private data
The memory for netdev_priv is allocated using kzalloc in alloc_netdev
(or alloc_netdev_mq respectively) so there is no need to set it to 0
again.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-17 12:53:35 +01:00
Sven Eckelmann 4e09991af2 batman-adv: Use __func__ to add function names to messages
The name of the function might change in which these messages are printed.
It is therefore better to let the compiler handle the insertion of the
correct function name.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-17 12:53:35 +01:00
Reshetova, Elena b54ab92b84 netfilter: refcounter conversions
refcount_t type and corresponding API (see include/linux/refcount.h)
should be used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-17 12:49:43 +01:00
Johannes Berg b6ecfd469e cfg80211: preserve wdev ID across netns changes
When a wdev changes network namespace, its wdev ID will get
reassigned since NETDEV_REGISTER is called again, in the new
network namespace. Avoid that by checking if it was already
assigned before, and document why we do that.

Reported-and-tested-by: Arend Van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-17 11:39:55 +01:00
Eric Dumazet db7f00b8db tcp: tcp_get_info() should read tcp_time_stamp later
Commit b369e7fd41 ("tcp: make TCP_INFO more consistent") moved
lock_sock_fast() earlier in tcp_get_info()

This has the minor effect that jiffies value being sampled at the
beginning of tcp_get_info() is more likely to be off by one, and we
report big tcpi_last_data_sent values (like 0xFFFFFFFF).

Since we lock the socket, fetching tcp_time_stamp right before
doing the jiffies_to_msecs() calls is enough to remove these
wrong values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 21:37:13 -07:00
WANG Cong d12c917691 bridge: resolve a false alarm of lockdep
Andrei reported a false alarm of lockdep at net/bridge/br_fdb.c:109,
this is because in Andrei's case, a spin_bug() was already triggered
before this, therefore the debug_locks is turned off, lockdep_is_held()
is no longer accurate after that. We should use lockdep_assert_held_once()
instead of lockdep_is_held() to respect debug_locks.

Fixes: 410b3d48f5 ("bridge: fdb: add proper lock checks in searching functions")
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 21:29:20 -07:00
David Howells 4d4a6ac73e rxrpc: Ignore BUSY packets on old calls
If we receive a BUSY packet for a call we think we've just completed, the
packet is handed off to the connection processor to deal with - but the
connection processor doesn't expect a BUSY packet and so flags a protocol
error.

Fix this by simply ignoring the BUSY packet for the moment.

The symptom of this may appear as a system call failing with EPROTO.  This
may be triggered by pressing ctrl-C under some circumstances.

This comes about we abort calls due to interruption by a signal (which we
shouldn't do, but that's going to be a large fix and mostly in fs/afs/).
What happens is that we abort the call and may also abort follow up calls
too (this needs offloading somehoe).  So we see a transmission of something
like the following sequence of packets:

	DATA for call N
	ABORT call N
	DATA for call N+1
	ABORT call N+1

in very quick succession on the same channel.  However, the peer may have
deferred the processing of the ABORT from the call N to a background thread
and thus sees the DATA message from the call N+1 coming in before it has
cleared the channel.  Thus it sends a BUSY packet[*].

[*] Note that some implementations (OpenAFS, for example) mark the BUSY
    packet with one plus the callNumber of the call prior to call N.
    Ordinarily, this would be call N, but there's no requirement for the
    calls on a channel to be numbered strictly sequentially (the number is
    required to increase).

    This is wrong and means that the callNumber in the BUSY packet should
    be ignored (it really ought to be N+1 since that's what it's in
    response to).

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 21:27:57 -07:00
David Ahern 4ee39733fb net: ipv6: set route type for anycast routes
Anycast routes have the RTF_ANYCAST flag set, but when dumping routes
for userspace the route type is not set to RTN_ANYCAST. Make it so.

Fixes: 58c4fb86ea ("[IPV6]: Flag RTF_ANYCAST for anycast routes")
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:40:14 -07:00
Soheil Hassas Yeganeh 4396e46187 tcp: remove tcp_tw_recycle
The tcp_tw_recycle was already broken for connections
behind NAT, since the per-destination timestamp is not
monotonically increasing for multiple machines behind
a single destination address.

After the randomization of TCP timestamp offsets
in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
for each connection), the tcp_tw_recycle is broken for all
types of connections for the same reason: the timestamps
received from a single machine is not monotonically increasing,
anymore.

Remove tcp_tw_recycle, since it is not functional. Also, remove
the PAWSPassive SNMP counter since it is only used for
tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
since the strict argument is only set when tcp_tw_recycle is
enabled.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:33:56 -07:00
Soheil Hassas Yeganeh d82bae12dc tcp: remove per-destination timestamp cache
Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
randomizes TCP timestamps per connection. After this commit,
there is no guarantee that the timestamps received from the
same destination are monotonically increasing. As a result,
the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
in struct tcp_metrics_block) is broken and cannot be relied upon.

Remove the per-destination timestamp cache and all related code
paths.

Note that this cache was already broken for caching timestamps of
multiple machines behind a NAT sharing the same address.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:33:56 -07:00
chun Long be7164cd57 tcp_westwood: fix tcp_westwood_info() style mistakes
replace comma to semi colons in tcp_westwood_info().
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:23:28 -07:00
David Ahern 61733c91c4 net: mpls: Fix nexthop alive tracking on down events
Alive tracking of nexthops can account for a link twice if the carrier
goes down followed by an admin down of the same link rendering multipath
routes useless. This is similar to 79099aab38 for UNREGISTER events and
DOWN events.

Fix by tracking number of alive nexthops in mpls_ifdown similar to the
logic in mpls_ifup. Checking the flags per nexthop once after all events
have been processed is simpler than trying to maintian a running count
through all event combinations.

Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive
per a comment from checkpatch:
    WARNING: Prefer WRITE_ONCE(<FOO>, <BAR>) over ACCESS_ONCE(<FOO>) = <BAR>

Fixes: c89359a42e ("mpls: support for dead routes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:22:18 -07:00
Nik Unger 5080f39e8c netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:

1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
  machines

The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html

There appear to be two underlying bugs causing these effects:

- The first issue causes long delays when the rate is slow and no
  delay is configured (e.g., "rate 512kbit"). This is because SKBs are
  not orphaned when no delay is configured, so orphaning does not
  occur until *after* the rate-induced delay has been applied. For
  this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
  dramatically increases the measured bandwidth.

- The second issue is that rate-induced delays are not correctly
  applied, allowing SKB delays to occur in parallel. The indended
  approach is to compute the delay for an SKB and to add this delay to
  the end of the current queue. However, the code does not detect
  existing SKBs in the queue due to improperly testing sch->q.qlen,
  which is nonzero even when packets exist only in the
  rbtree. Consequently, new SKBs do not wait for the current queue to
  empty. When packet delays vary significantly (e.g., if packet sizes
  are different), then this also causes unintended reordering.

I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.

Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:14:06 -07:00
Sven Eckelmann f7a2bd6544 batman-adv: Convert BATADV_PRINT_VID macro to function
The BATADV_PRINT_VID is not free of of possible side-effects. This can be
avoided when the the macro is converted to a simple inline function.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-16 21:14:53 +01:00
Sven Eckelmann a09c94d07b batman-adv: Fix possible side-effects in _batadv_dbg
An argument of a macro should not be evaluated multiple times. Otherwise
embedded operations in these arguments will be executed multiple times.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-16 21:14:52 +01:00
Sven Eckelmann 1fda4c0ac0 batman-adv: Fix unbalanced braces around else statement
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-16 21:14:51 +01:00
Sven Eckelmann c1bacea053 batman-adv: Reduce preprocessor checks in multicast.c
It is not necessary to disable these code sections in case other kernel
features are disabled. Instead the IS_ENABLED tests can be added directly
in the code and the compiler can remove the unnecessary code parts during
its optimization run.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-16 21:14:50 +01:00
Simon Wunderlich b96a898188 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-16 21:14:50 +01:00
David S. Miller b124f41332 Here are two batman-adv bugfixes:
- Keep fragments equally sized, avoids some problems with too small fragments,
    by Sven Eckelmann
 
  - Initialize gateway class correctly when BATMAN V is compiled in,
    by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAljKiocWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSYaD/4o3vSRQbiu95rzaPZbZ8cfFRYz
 FYS7b/SLV8Q8uxzCo05jimnYB9qGD8bZZlAPNhNZLqkSsTXc8vqfwPnyB9ytasrp
 A8jefoMlO0A5w3AQxBErzg3wrYr5Gz6lqkgqZTv06Sb6xa7F/OEo9pMjdbCBX15D
 G9y1FZdLZVhdVE8r9yUohbfcnZ4KkfBlViEXnjAcJoV2x4OOVGQjDdBGpaEQnwyk
 wyGEFuIO2oWW6AlQXCbXomTUZ6bAzJ43lPbRDWTVdrVJg2RV65F4MsNDfLpXVBQV
 GvmB33WVTzphQshxvhnjs2uWglnj8KGVcvpIp7GuRfGgpFJh9tbS2B0bITXCVxWU
 +HkNja6QwObwhEvAWfINaTT9RR1kxWAQk6YB9VEq3RJoHufP49felOZ877oAkK8T
 7CwazIe4rOhA+ISLUcUFk5W9B46DusPRyfhzrDZ6EM0P/ppGWpKAioawltLqX2P3
 kBmIW8kB6YRHnTeCHJZTjdPdQkYeIAYVunnhLvRrmTbf0oyVh3+UnwgAVquwJj/5
 WsEqNPsP0B++nsWmT+MPVOKj/JwcmtKfaDY2sor72aN6TZxq7MF9VGdcZiOu9neB
 zILNcb0oIvUahb2SO2u/gH2E/4am+ED5StQmNxHdJf1A+iCYGXEy/yETgeIm51/S
 TaHd1aH3VwmCR0sAKw==
 =zGdm
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20170316' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are two batman-adv bugfixes:

 - Keep fragments equally sized, avoids some problems with too small fragments,
   by Sven Eckelmann

 - Initialize gateway class correctly when BATMAN V is compiled in,
   by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 12:05:38 -07:00
Or Gerlitz a5e6a3b022 net/sched: fq_codel: Avoid set-but-unused variable
The code introduced by commit 2ccccf5fb4 ("net_sched: update
hierarchical backlog too") only sets prev_backlog in fq_codel_dequeue()
but not using that anywhere, remove that setting.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 12:02:14 -07:00
Or Gerlitz 4dba87b073 net/sched: act_ife: Staticfy find_decode_metaid()
As it's used only on that file.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 12:02:14 -07:00
Kris Murphy 8f3dbfd79e openvswitch: Add missing case OVS_TUNNEL_KEY_ATTR_PAD
Added a case for OVS_TUNNEL_KEY_ATTR_PAD to the switch statement
in ip_tun_from_nlattr in order to prevent the default case
returning an error.

Fixes: b46f6ded90 ("libnl: nla_put_be64(): align on a 64-bit area")
Signed-off-by: Kris Murphy <kriskend@linux.vnet.ibm.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 11:59:46 -07:00
Ido Schimmel 5d7bfd1419 ipv4: fib_rules: Dump FIB rules when registering FIB notifier
In commit c3852ef7f2 ("ipv4: fib: Replay events when registering FIB
notifier") we dumped the FIB tables and replayed the events to the
passed notification block.

However, we merely sent a RULE_ADD notification in case custom rules
were in use. As explained in previous patches, this approach won't work
anymore. Instead, we should notify the caller about all the FIB rules
and let it act accordingly.

Upon registration to the FIB notification chain, replay a RULE_ADD
notification for each programmed FIB rule, custom or not. The integrity
of the dump is ensured by the mechanism introduced in the above
mentioned commit.

Prevent regressions by making sure current listeners correctly sanitize
the notified rules.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 10:18:34 -07:00
Ido Schimmel 6a003a5ff2 ipv4: fib_rules: Add notifier info to FIB rules notifications
Whenever a FIB rule is added or removed, a notification is sent in the
FIB notification chain. However, listeners don't have a way to tell
which rule was added or removed.

This is problematic as we would like to give listeners the ability to
decide which action to execute based on the notified rule. Specifically,
offloading drivers should be able to determine if they support the
reflection of the notified FIB rule and flush their LPM tables in case
they don't.

Do that by adding a notifier info to these notifications and embed the
common FIB rule struct in it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 10:18:33 -07:00
Ido Schimmel 3c71006d15 ipv4: fib_rules: Check if rule is a default rule
Currently, when non-default (custom) FIB rules are used, devices capable
of layer 3 offloading flush their tables and let the kernel do the
forwarding instead.

When these devices' drivers are loaded they register to the FIB
notification chain, which lets them know about the existence of any
custom FIB rules. This is done by sending a RULE_ADD notification based
on the value of 'net->ipv4.fib_has_custom_rules'.

This approach is problematic when VRF offload is taken into account, as
upon the creation of the first VRF netdev, a l3mdev rule is programmed
to direct skbs to the VRF's table.

Instead of merely reading the above value and sending a single RULE_ADD
notification, we should iterate over all the FIB rules and send a
detailed notification for each, thereby allowing offloading drivers to
sanitize the rules they don't support and potentially flush their
tables.

While l3mdev rules are uniquely marked, the default rules are not.
Therefore, when they are being notified they might invoke offloading
drivers to unnecessarily flush their tables.

Solve this by adding an helper to check if a FIB rule is a default rule.
Namely, its selector should match all packets and its action should
point to the local, main or default tables.

As noted by David Ahern, uniquely marking the default rules is
insufficient. When using VRFs, it's common to avoid false hits by moving
the rule for the local table to just before the main table:

Default configuration:
$ ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

Common configuration with VRFs:
$ ip rule show
1000:   from all lookup [l3mdev-table]
32765:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 10:18:33 -07:00
Cong Wang 864e91ca98 ipvs: remove an annoying printk in netns init
At most it is used for debugging purpose, but I don't think
it is even useful for debugging, just remove it.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-16 13:33:39 +01:00
Masashi Honma 335d534938 nl80211: Use signed function for a signed variable
The rssi_threshold is defined as s32.

Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-16 10:54:41 +01:00
Ondřej Lysoněk 5c19dfbe96 mac80211: Use setup_timer instead of init_timer for mesh path
Use setup_timer() and setup_deferrable_timer() to set the data and
function timer fields. It makes the code cleaner and will allow for
easier change of the timer struct internals.

Signed-off-by: Ondřej Lysoněk <ondrej.lysonek@seznam.cz>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <linux-wireless@vger.kernel.org>
Cc: <netdev@vger.kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-16 10:54:04 +01:00
Johannes Berg ea90e0dc8c nl80211: fix dumpit error path RTNL deadlocks
Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.

To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.

Cc: stable@vger.kernel.org
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-16 10:30:03 +01:00
Eric Dumazet 22a0e18eac net: properly release sk_frag.page
I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()

TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.

iSCSI is using such sockets.

Fixes: 5640f76858 ("net: use a per task frag allocator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:37:45 -07:00
Vivien Didelot 0f3da6afee net: dsa: check out-of-range ageing time value
If a DSA switch driver cannot program an ageing time value due to it
being out-of-range, switchdev will raise a stack trace before failing.

To fix this, add ageing_time_min and ageing_time_max members to the
dsa_switch in order for the switch drivers to optionally specify their
supported ageing time limits.

The DSA core will now check for provided ageing time limits and return
-ERANGE from the switchdev prepare phase if the value is out-of-range.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:34:13 -07:00
Vivien Didelot e893de1ba1 net: dsa: dsa_fastest_ageing_time return unsigned
The ageing time is defined as unsigned int, so make
dsa_fastest_ageing_time return an unsigned int instead of int.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:34:13 -07:00
Amritha Nambiar 56f36acd21 mqprio: Modify mqprio to pass user parameters via ndo_setup_tc.
The configurable priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class, overriding the
hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
option is non-zero, the hardware QOS defaults are used.

This patch makes it so that we can pass the data the user provided to
ndo_setup_tc. This allows us to pull in the queue configuration if the
user requested it as well as any additional hardware offload type
requested by using a value other than 1 for the hw value.

Finally it also provides a means for the device driver to return the level
supported for the offload type via the qopt->hw value. Previously we were
just always assuming the value to be 1, in the future values beyond just 1
may be supported.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:20:27 -07:00
Alexander Duyck 2026fecf51 mqprio: Change handling of hw u8 to allow for multiple hardware offload modes
This patch is meant to allow for support of multiple hardware offload type
for a single device. There is currently no bounds checking for the hw
member of the mqprio_qopt structure.  This results in us being able to pass
values from 1 to 255 with all being treated the same.  On retreiving the
value it is returned as 1 for anything 1 or greater being set.

With this change we are currently adding limited bounds checking by
defining an enum and using those values to limit the reported hardware
offloads.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:20:27 -07:00
David S. Miller e11607aad5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree, a
rather large batch of fixes targeted to nf_tables, conntrack and bridge
netfilter. More specifically, they are:

1) Don't track fragmented packets if the socket option IP_NODEFRAG is set.
   From Florian Westphal.

2) SCTP protocol tracker assumes that ICMP error messages contain the
   checksum field, what results in packet drops. From Ying Xue.

3) Fix inconsistent handling of AH traffic from nf_tables.

4) Fix new bitmap set representation with big endian. Fix mismatches in
   nf_tables due to incorrect big endian handling too. Both patches
   from Liping Zhang.

5) Bridge netfilter doesn't honor maximum fragment size field, cap to
   largest fragment seen. From Florian Westphal.

6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB
   bits are now used to store the ctinfo. From Steven Rostedt.

7) Fix element comments with the bitmap set type. Revert the flush
   field in the nft_set_iter structure, not required anymore after
   fixing up element comments.

8) Missing error on invalid conntrack direction from nft_ct, also from
   Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:13:13 -07:00
Or Gerlitz 3d20f1f7bd net/openvswitch: Set the ipv6 source tunnel key address attribute correctly
When dealing with ipv6 source tunnel key address attribute
(OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
dst ip, fix that.

Fixes: 6b26ba3a7d ('openvswitch: netlink attributes for IPv6 tunneling')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:09:09 -07:00
David S. Miller 101c431492 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmgenet.c
	net/core/sock.c

Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 11:59:10 -07:00
Liping Zhang 4494dbc6de netfilter: nft_ct: do cleanup work when NFTA_CT_DIRECTION is invalid
We should jump to invoke __nft_ct_set_destroy() instead of just
return error.

Fixes: edee4f1e92 ("netfilter: nft_ct: add zone id set support")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-15 17:15:54 +01:00
Linus Torvalds ae50dfd616 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
    from Steffen Klassert.

 2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
    Alexei Starovoitov.

 3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
    Michal Schmidt.

 4) We can get a divide by zero when TCP socket are morphed into
    listening state, fix from Eric Dumazet.

 5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
    skb_complete_tx_timestamp(). From Eric Dumazet.

 6) Use after free in dccp_feat_activate_values(), also from Eric
    Dumazet.

 7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
    Jarod Wilson.

 8) Fix use after free in vrf_xmit(), from David Ahern.

 9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
    Alexey Kodanev.

10) Properly check napi_complete_done() return value in order to decide
    whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
    Lendacky.

11) Fix double free of hwmon device in marvell phy driver, from Andrew
    Lunn.

12) Don't crash on malformed netlink attributes in act_connmark, from
    Etienne Noss.

13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
    from Sabrina Dubroca.

14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
    Florian Westphal.

15) Fix routing redirect races in dccp and tcp, basically the ICMP
    handler can't modify the socket's cached route in it's locked by the
    user at this moment. From Jon Maxwell.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
  qed: Enable iSCSI Out-of-Order
  qed: Correct out-of-bound access in OOO history
  qed: Fix interrupt flags on Rx LL2
  qed: Free previous connections when releasing iSCSI
  qed: Fix mapping leak on LL2 rx flow
  qed: Prevent creation of too-big u32-chains
  qed: Align CIDs according to DORQ requirement
  mlxsw: reg: Fix SPVMLR max record count
  mlxsw: reg: Fix SPVM max record count
  net: Resend IGMP memberships upon peer notification.
  dccp: fix memory leak during tear-down of unsuccessful connection request
  tun: fix premature POLLOUT notification on tun devices
  dccp/tcp: fix routing redirect race
  ucc/hdlc: fix two little issue
  vxlan: fix ovs support
  net: use net->count to check whether a netns is alive or not
  bridge: drop netfilter fake rtable unconditionally
  ipv6: avoid write to a possibly cloned skb
  net: wimax/i2400m: fix NULL-deref at probe
  isdn/gigaset: fix NULL-deref at probe
  ...
2017-03-14 21:31:23 -07:00
Vlad Yasevich 37c343b4f4 net: Resend IGMP memberships upon peer notification.
When we notify peers of potential changes,  it's also good to update
IGMP memberships.  For example, during VM migration, updating IGMP
memberships will redirect existing multicast streams to the VM at the
new location.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-14 11:33:44 -07:00
Roopa Prabhu 942c56ad07 lwtunnel: remove unused but set variable
silences the below warning:
net/core/lwtunnel.c: In function ‘lwtunnel_valid_encap_type_attr’:
net/core/lwtunnel.c:165:17: warning: variable ‘nla’ set but not used
[-Wunused-but-set-variable]

Fixes: 9ed59592e3 ("lwtunnel: fix autoload of lwt modules")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:55:45 -07:00
Zhu Yanjun 569f41d187 rds: ib: unmap the scatter/gather list when error
When some errors occur, the scatter/gather list mapped to DMA addresses
should be handled.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
Zhu Yanjun edd08f96db rds: ib: add the static type to the function
The function rds_ib_map_fmr is used only in the ib_fmr.c
file. As such, the static type is added to limit it in this file.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
Zhu Yanjun ea69c88364 rds: ib: remove redundant ib_dealloc_fmr
The function ib_dealloc_fmr will never be called. As such, it should
be removed.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
Zhu Yanjun b418c5276a rds: ib: drop unnecessary rdma_reject
When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
Hannes Frederic Sowa 72ef9c4125 dccp: fix memory leak during tear-down of unsuccessful connection request
This patch fixes a memory leak, which happens if the connection request
is not fulfilled between parsing the DCCP options and handling the SYN
(because e.g. the backlog is full), because we forgot to free the
list of ack vectors.

Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 22:00:42 -07:00
Jon Maxwell 45caeaa5ac dccp/tcp: fix routing redirect race
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320610 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 21:55:47 -07:00
Andrey Vagin 91864f5852 net: use net->count to check whether a netns is alive or not
The previous idea was to check whether a net namespace is in
net_exit_list or not. It doesn't work, because net->exit_list is used in
__register_pernet_operations and __unregister_pernet_operations where
all namespaces are added to a temporary list to make cleanup in a error
case, so list_empty(&net->exit_list) always returns false.

Reported-by: Mantas Mikulėnas <grawity@gmail.com>
Fixes: 002d8a1a6c ("net: skip genenerating uevents for network namespaces that are exiting")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 16:02:27 -07:00
Francois Romieu c55fa3cccb atm: remove an unnecessary loop
Andrey reported this kernel warning:

WARNING: CPU: 0 PID: 4114 at kernel/sched/core.c:7737 __might_sleep+0x149/0x1a0
do not call blocking ops when !TASK_RUNNING; state=1 set at
[<ffffffff813fcb22>] prepare_to_wait+0x182/0x530

The deeply nested alloc_skb is a problem.

Diagnosis: nesting is wrong. It makes zero sense. Fix it and the
implicit task state change problem automagically goes away.

alloc_skb() does not need to be in the "while" loop.

alloc_skb() does not need to be in the {prepare_to_wait/add_wait_queue ...
finish_wait/remove_wait_queue} block.

I claim that:
- alloc_tx() should only perform the "wait_for_decent_tx_drain" part
- alloc_skb() ought to be done directly in vcc_sendmsg
- alloc_skb() failure can be handled gracefully in vcc_sendmsg
- alloc_skb() may use a (m->msg_flags & MSG_DONTWAIT) dependent
  GFP_{KERNEL / ATOMIC} flag

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-and-Tested-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 15:37:29 -07:00
Robert Shearman a59166e470 mpls: allow TTL propagation from IP packets to be configured
Allow TTL propagation from IP packets to MPLS packets to be
configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
allows the TTL to be set in the resulting MPLS packet, with the value
of 0 having the semantics of enabling propagation of the TTL from the
IP header (i.e. non-zero values disable propagation).

Also allow the configuration to be overridden globally by reusing the
same sysctl to control whether the TTL is propagated from IP packets
into the MPLS header. If the per-LWT attribute is set then it
overrides the global configuration. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl,
"net.mpls.default_ttl". This is kept separate from the configuration
of whether IP TTL propagation is enabled as it can be used in the
future when non-IP payloads are supported (i.e. where there is no
payload TTL that can be propagated).

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 15:29:22 -07:00
Robert Shearman 5b441ac878 mpls: allow TTL propagation to IP packets to be configured
Provide the ability to control on a per-route basis whether the TTL
value from an MPLS packet is propagated to an IPv4/IPv6 packet when
the last label is popped as per the theoretical model in RFC 3443
through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
mean disable propagation and 1 to mean enable propagation.

In order to provide the ability to change the behaviour for packets
arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
way for a user to change the behaviour for all existing routes without
having to reprogram them, a global knob is provided. This is done
through the addition of a new per-namespace sysctl,
"net.mpls.ip_ttl_propagate", which defaults to enabled. If the
per-route attribute is set (either enabled or disabled) then it
overrides the global configuration.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 15:29:22 -07:00
Florian Westphal a13b2082ec bridge: drop netfilter fake rtable unconditionally
Andreas reports kernel oops during rmmod of the br_netfilter module.
Hannes debugged the oops down to a NULL rt6info->rt6i_indev.

Problem is that br_netfilter has the nasty concept of adding a fake
rtable to skb->dst; this happens in a br_netfilter prerouting hook.

A second hook (in bridge LOCAL_IN) is supposed to remove these again
before the skb is handed up the stack.

However, on module unload hooks get unregistered which means an
skb could traverse the prerouting hook that attaches the fake_rtable,
while the 'fake rtable remove' hook gets removed from the hooklist
immediately after.

Fixes: 34666d467c ("netfilter: bridge: move br_netfilter out of the core")
Reported-by: Andreas Karis <akaris@redhat.com>
Debugged-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 13:01:10 -07:00
Florian Westphal 79e49503ef ipv6: avoid write to a possibly cloned skb
ip6_fragment, in case skb has a fraglist, checks if the
skb is cloned.  If it is, it will move to the 'slow path' and allocates
new skbs for each fragment.

However, right before entering the slowpath loop, it updates the
nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
to account for the fragment header that will be inserted in the new
ipv6-fragment skbs.

In case original skb is cloned this munges nexthdr value of another
skb.  Avoid this by doing the nexthdr update for each of the new fragment
skbs separately.

This was observed with tcpdump on a bridge device where netfilter ipv6
reassembly is active:  tcpdump shows malformed fragment headers as
the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Andreas Karis <akaris@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 12:53:35 -07:00
Sabrina Dubroca 67e194007b ipv6: make ECMP route replacement less greedy
Commit 2759647247 ("ipv6: fix ECMP route replacement") introduced a
loop that removes all siblings of an ECMP route that is being
replaced. However, this loop doesn't stop when it has replaced
siblings, and keeps removing other routes with a higher metric.
We also end up triggering the WARN_ON after the loop, because after
this nsiblings < 0.

Instead, stop the loop when we have taken care of all routes with the
same metric as the route being replaced.

  Reproducer:
  ===========
    #!/bin/sh

    ip netns add ns1
    ip netns add ns2
    ip -net ns1 link set lo up

    for x in 0 1 2 ; do
        ip link add veth$x netns ns2 type veth peer name eth$x netns ns1
        ip -net ns1 link set eth$x up
        ip -net ns2 link set veth$x up
    done

    ip -net ns1 -6 r a 2000::/64 nexthop via fe80::0 dev eth0 \
            nexthop via fe80::1 dev eth1 nexthop via fe80::2 dev eth2
    ip -net ns1 -6 r a 2000::/64 via fe80::42 dev eth0 metric 256
    ip -net ns1 -6 r a 2000::/64 via fe80::43 dev eth0 metric 2048

    echo "before replace, 3 routes"
    ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
    echo

    ip -net ns1 -6 r c 2000::/64 nexthop via fe80::4 dev eth0 \
            nexthop via fe80::5 dev eth1 nexthop via fe80::6 dev eth2

    echo "after replace, only 2 routes, metric 2048 is gone"
    ip -net ns1 -6 r | grep -v '^fe80\|^ff00'

Fixes: 2759647247 ("ipv6: fix ECMP route replacement")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 12:16:17 -07:00
Liping Zhang 03e5fd0e9b netfilter: nft_set_rbtree: use per-set rwlock to improve the scalability
Karel Rericha reported that in his test case, ICMP packets going through
boxes had normally about 5ms latency. But when running nft, actually
listing the sets with interval flags, latency would go up to 30-100ms.
This was observed when router throughput is from 600Mbps to 2Gbps.

This is because we use a single global spinlock to protect the whole
rbtree sets, so "dumping sets" will race with the "key lookup" inevitably.
But actually they are all _readers_, so it's ok to convert the spinlock
to rwlock to avoid competition between them. Also use per-set rwlock since
each set is independent.

Reported-by: Karel Rericha <karel@unitednetworks.cz>
Tested-by: Karel Rericha <karel@unitednetworks.cz>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 19:30:43 +01:00
Liping Zhang 2cb4bbd75b netfilter: limit: use per-rule spinlock to improve the scalability
The limit token is independent between each rules, so there's no
need to use a global spinlock.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 19:30:31 +01:00
Florian Westphal fc09e4a75a netfilter: nf_conntrack: reduce resolve_normal_ct args
also mark init_conntrack noinline, in most cases resolve_normal_ct will
find an existing conntrack entry.

text    data     bss     dec     hex filename
16735    5707     176   22618    585a net/netfilter/nf_conntrack_core.o
16687    5707     176   22570    582a net/netfilter/nf_conntrack_core.o

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 19:30:20 +01:00
Pablo Neira Ayuso 04166f48d9 Revert "netfilter: nf_tables: add flush field to struct nft_set_iter"
This reverts commit 1f48ff6c53.

This patch is not required anymore now that we keep a dummy list of
set elements in the bitmap set implementation, so revert this before
we forget this code has no clients.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 17:30:16 +01:00
Phil Sutter 055c4b34b9 netfilter: nft_fib: Support existence check
Instead of the actual interface index or name, set destination register
to just 1 or 0 depending on whether the lookup succeeded or not if
NFTA_FIB_F_PRESENT was set in userspace.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:45:36 +01:00
Florian Westphal 1a64edf54f netfilter: nft_ct: add helper set support
this allows to assign connection tracking helpers to
connections via nft objref infrastructure.

The idea is to first specifiy a helper object:

 table ip filter {
    ct helper some-name {
      type "ftp"
      protocol tcp
      l3proto ip
    }
 }

and then assign it via

nft add ... ct helper set "some-name"

helper assignment works for new conntracks only as we cannot expand the
conntrack extension area once it has been committed to the main conntrack
table.

ipv4 and ipv6 protocols are tracked stored separately so
we can also handle families that observe both ipv4 and ipv6 traffic.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:42:09 +01:00
Florian Westphal 84fba05511 netfilter: provide nft_ctx in object init function
this is needed by the upcoming ct helper object type --
we'd like to be able use the table family (ip, ip6, inet) to figure
out which helper has to be requested.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:42:00 +01:00
Pablo Neira Ayuso e920dde516 netfilter: nft_set_bitmap: keep a list of dummy elements
Element comments may come without any prior set flag, so we have to keep
a list of dummy struct nft_set_ext to keep this information around. This
is only useful for set dumps to userspace. From the packet path, this
set type relies on the bitmap representation. This patch simplifies the
logic since we don't need to allocate the dummy nft_set_ext structure
anymore on the fly at the cost of increasing memory consumption because
of the list of dummy struct nft_set_ext.

Fixes: 665153ff57 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:34:21 +01:00
Steven Rostedt (VMware) 170a1fb9c0 netfilter: Force fake conntrack entry to be at least 8 bytes aligned
Since the nfct and nfctinfo have been combined, the nf_conn structure
must be at least 8 bytes aligned, as the 3 LSB bits are used for the
nfctinfo. But there's a fake nf_conn structure to denote untracked
connections, which is created by a PER_CPU construct. This does not
guarantee that it will be 8 bytes aligned and can break the logic in
determining the correct nfctinfo.

I triggered this on a 32bit machine with the following error:

BUG: unable to handle kernel NULL pointer dereference at 00000af4
IP: nf_ct_deliver_cached_events+0x1b/0xfb
*pdpt = 0000000031962001 *pde = 0000000000000000

Oops: 0000 [#1] SMP
[Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
  OK  ]
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
task: c126ec00 task.stack: c1258000
EIP: nf_ct_deliver_cached_events+0x1b/0xfb
EFLAGS: 00010202 CPU: 0
EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
Call Trace:
 <SOFTIRQ>
 ? ipv6_skip_exthdr+0xac/0xcb
 ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
 nf_hook_slow+0x22/0xc7
 nf_hook+0x9a/0xad [ipv6]
 ? ip6t_do_table+0x356/0x379 [ip6_tables]
 ? ip6_fragment+0x9e9/0x9e9 [ipv6]
 ip6_output+0xee/0x107 [ipv6]
 ? ip6_fragment+0x9e9/0x9e9 [ipv6]
 dst_output+0x36/0x4d [ipv6]
 NF_HOOK.constprop.37+0xb2/0xba [ipv6]
 ? icmp6_dst_alloc+0x2c/0xfd [ipv6]
 ? local_bh_enable+0x14/0x14 [ipv6]
 mld_sendpack+0x1c5/0x281 [ipv6]
 ? mark_held_locks+0x40/0x5c
 mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
 call_timer_fn+0x135/0x283
 ? detach_if_pending+0x55/0x55
 ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
 __run_timers+0x111/0x14b
 ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
 run_timer_softirq+0x1c/0x36
 __do_softirq+0x185/0x37c
 ? test_ti_thread_flag.constprop.19+0xd/0xd
 do_softirq_own_stack+0x22/0x28
 </SOFTIRQ>
 irq_exit+0x5a/0xa4
 smp_apic_timer_interrupt+0x2a/0x34
 apic_timer_interrupt+0x37/0x3c

By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
alignment as all cache line sizes are at least 8 bytes or more.

Fixes: a9e419dc7b ("netfilter: merge ctinfo into nfct pointer storage area")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:33:58 +01:00
Florian Westphal 4ca60d08cb netfilter: bridge: honor frag_max_size when refragmenting
consider a bridge with mtu 9000, but end host sending smaller
packets to another host with mtu < 9000.

In this case, after reassembly, bridge+defrag would refragment,
and then attempt to send the reassembled packet as long as it
was below 9k.

Instead we have to cap by the largest fragment size seen.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:31:53 +01:00
Liping Zhang 10596608c4 netfilter: nf_tables: fix mismatch in big-endian system
Currently, there are two different methods to store an u16 integer to
the u32 data register. For example:
  u32 *dest = &regs->data[priv->dreg];
  1. *dest = 0; *(u16 *) dest = val_u16;
  2. *dest = val_u16;

For method 1, the u16 value will be stored like this, either in
big-endian or little-endian system:
  0          15           31
  +-+-+-+-+-+-+-+-+-+-+-+-+
  |   Value   |     0     |
  +-+-+-+-+-+-+-+-+-+-+-+-+

For method 2, in little-endian system, the u16 value will be the same
as listed above. But in big-endian system, the u16 value will be stored
like this:
  0          15           31
  +-+-+-+-+-+-+-+-+-+-+-+-+
  |     0     |   Value   |
  +-+-+-+-+-+-+-+-+-+-+-+-+

So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do
compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong
result in big-endian system, as 0~15 bits will always be zero.

For the similar reason, when loading an u16 value from the u32 data
register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;",
the 2nd method will get the wrong value in the big-endian system.

So introduce some wrapper functions to store/load an u8 or u16
integer to/from the u32 data register, and use them in the right
place.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:30:28 +01:00
Liping Zhang fd89b23a46 netfilter: nft_set_bitmap: fetch the element key based on the set->klen
Currently we just assume the element key as a u32 integer, regardless of
the set key length.

This is incorrect, for example, the tcp port number is only 16 bits.
So when we use the nft_payload expr to get the tcp dport and store
it to dreg, the dport will be stored at 0~15 bits, and 16~31 bits
will be padded with zero.

So the reg->data[dreg] will be looked like as below:
  0          15           31
  +-+-+-+-+-+-+-+-+-+-+-+-+
  | tcp dport |      0    |
  +-+-+-+-+-+-+-+-+-+-+-+-+
But for these big-endian systems, if we treate this register as a u32
integer, the element key will be larger than 65535, so the following
lookup in bitmap set will cause out of bound access.

Another issue is that if we add element with comments in bitmap
set(although the comments will be ignored eventually), the element will
vanish strangely. Because we treate the element key as a u32 integer, so
the comments will become the part of the element key, then the element
key will also be larger than 65535 and out of bound access will happen:
  # nft add element t s { 1 comment test }

Since set->klen is 1 or 2, it's fine to treate the element key as a u8 or
u16 integer.

Fixes: 665153ff57 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13 13:16:42 +01:00
David S. Miller e33cc31630 sch_tbf: Remove bogus semicolon in if() conditional.
Fixes: 49b499718f ("net: sched: make default fifo qdiscs appear in the dump")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 00:00:03 -07:00
Geliang Tang 27303fcf57 drop_monitor: use setup_timer
Use setup_timer() instead of init_timer() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:47:16 -07:00
David Ahern 79099aab38 mpls: Do not decrement alive counter for unregister events
Multipath routes can be rendered usesless when a device in one of the
paths is deleted. For example:

$ ip -f mpls ro ls
100
	nexthop as to 200 via inet 172.16.2.2  dev virt12
	nexthop as to 300 via inet 172.16.3.2  dev br0
101
	nexthop as to 201 via inet6 2000:2::2  dev virt12
	nexthop as to 301 via inet6 2000:3::2  dev br0

$ ip li del br0

When br0 is deleted the other hop is not considered in
mpls_select_multipath because of the alive check -- rt_nhn_alive
is 0.

rt_nhn_alive is decremented once in mpls_ifdown when the device is taken
down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For
a 2 hop route, deleting one device drops the alive count to 0. Since
devices are taken down before unregistering, the decrement on
NETDEV_UNREGISTER is redundant.

Fixes: c89359a42e ("mpls: support for dead routes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:45:36 -07:00
David Ahern e37791ec1a mpls: Send route delete notifications when router module is unloaded
When the mpls_router module is unloaded, mpls routes are deleted but
notifications are not sent to userspace leaving userspace caches
out of sync. Add the call to mpls_notify_route in mpls_net_exit as
routes are freed.

Fixes: 0189197f44 ("mpls: Basic routing support")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:39:32 -07:00
Etienne Noss 52491c7607 act_connmark: avoid crashing on malformed nlattrs with null parms
tcf_connmark_init does not check in its configuration if TCA_CONNMARK_PARMS
is set, resulting in a null pointer dereference when trying to access it.

[501099.043007] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[501099.043039] IP: [<ffffffffc10c60fb>] tcf_connmark_init+0x8b/0x180 [act_connmark]
...
[501099.044334] Call Trace:
[501099.044345]  [<ffffffffa47270e8>] ? tcf_action_init_1+0x198/0x1b0
[501099.044363]  [<ffffffffa47271b0>] ? tcf_action_init+0xb0/0x120
[501099.044380]  [<ffffffffa47250a4>] ? tcf_exts_validate+0xc4/0x110
[501099.044398]  [<ffffffffc0f5fa97>] ? u32_set_parms+0xa7/0x270 [cls_u32]
[501099.044417]  [<ffffffffc0f60bf0>] ? u32_change+0x680/0x87b [cls_u32]
[501099.044436]  [<ffffffffa4725d1d>] ? tc_ctl_tfilter+0x4dd/0x8a0
[501099.044454]  [<ffffffffa44a23a1>] ? security_capable+0x41/0x60
[501099.044471]  [<ffffffffa470ca01>] ? rtnetlink_rcv_msg+0xe1/0x220
[501099.044490]  [<ffffffffa470c920>] ? rtnl_newlink+0x870/0x870
[501099.044507]  [<ffffffffa472cc61>] ? netlink_rcv_skb+0xa1/0xc0
[501099.044524]  [<ffffffffa47073f4>] ? rtnetlink_rcv+0x24/0x30
[501099.044541]  [<ffffffffa472c634>] ? netlink_unicast+0x184/0x230
[501099.044558]  [<ffffffffa472c9d8>] ? netlink_sendmsg+0x2f8/0x3b0
[501099.044576]  [<ffffffffa46d8880>] ? sock_sendmsg+0x30/0x40
[501099.044592]  [<ffffffffa46d8e03>] ? SYSC_sendto+0xd3/0x150
[501099.044608]  [<ffffffffa425fda1>] ? __do_page_fault+0x2d1/0x510
[501099.044626]  [<ffffffffa47fbd7b>] ? system_call_fast_compare_end+0xc/0x9b

Fixes: 22a5dc0e5e ("net: sched: Introduce connmark action")
Signed-off-by: Étienne Noss <etienne.noss@wifirst.fr>
Signed-off-by: Victorien Molle <victorien.molle@wifirst.fr>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:32:41 -07:00
Gao Feng 8b57fd1ec1 net: Eliminate duplicated codes by creating one new function in_dev_select_addr
There are two duplicated loops codes which used to select right
address in current codes. Now eliminate these codes by creating
one new function in_dev_select_addr.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:27:09 -07:00
Xin Long c0d8bab6ae sctp: add get and set sockopt for reconf_enable
This patchset is to add SCTP_RECONFIG_SUPPORTED sockopt, it would
set and get asoc reconf_enable value when asoc_id is set, or it
would set and get ep reconf_enalbe value if asoc_id is 0.

It is also to add sysctl interface for users to set the default
value for reconf_enable.

After this patch, stream reconf will work.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:22:24 -07:00
Xin Long 11ae76e67a sctp: implement receiver-side procedures for the Reconf Response Parameter
This patch is to implement Receiver-Side Procedures for the
Re-configuration Response Parameter in rfc6525 section 5.2.7.

sctp_process_strreset_resp would process the response for any
kind of reconf request, and the stream reconf is applied only
when the response result is success.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:22:24 -07:00
Xin Long c5c4ebb3ab sctp: implement receiver-side procedures for the Add Incoming Streams Request Parameter
This patch is to implement Receiver-Side Procedures for the Add Incoming
Streams Request Parameter described in rfc6525 section 5.2.6.

It is also to fix that it shouldn't have add streams when sending addstrm
in request, as the process in peer will handle it by sending a addstrm out
request back.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:22:24 -07:00
Xin Long 50a41591f1 sctp: implement receiver-side procedures for the Add Outgoing Streams Request Parameter
This patch is to add Receiver-Side Procedures for the Add Outgoing
Streams Request Parameter described in section 5.2.5.

It is also to improve sctp_chunk_lookup_strreset_param, so that it
can be used for processing addstrm_out request.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:22:23 -07:00
Xin Long b444153fb5 sctp: add support for generating add stream change event notification
This patch is to add Stream Change Event described in rfc6525
section 6.1.3.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:22:23 -07:00
Xin Long 692787cef6 sctp: implement receiver-side procedures for the SSN/TSN Reset Request Parameter
This patch is to implement Receiver-Side Procedures for the SSN/TSN
Reset Request Parameter described in rfc6525 section 6.2.4.

The process is kind of complicate, it's wonth having some comments
from section 6.2.4 in the codes.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:22:23 -07:00
Xin Long c95129d127 sctp: add support for generating assoc reset event notification
This patch is to add Association Reset Event described in rfc6525
section 6.1.2.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 23:22:23 -07:00
subashab@codeaurora.org 5425077d73 net: ipv6: Add early demux handler for UDP unicast
While running a single stream UDPv6 test, we observed that amount
of CPU spent in NET_RX softirq was much greater than UDPv4 for an
equivalent receive rate. The test here was run on an ARM64 based
Android system. On further analysis with perf, we found that UDPv6
was spending significant time in the statistics netfilter targets
which did socket lookup per packet. These statistics rules perform
a lookup when there is no socket associated with the skb. Since
there are multiple instances of these rules based on UID, there
will be equal number of lookups per skb.

By introducing early demux for UDPv6, we avoid the redundant lookups.
This also helped to improve the performance (800Mbps -> 870Mbps) on a
CPU limited system in a single stream UDPv6 receive test with 1450
byte sized datagrams using iperf.

v1->v2: Use IPv6 cookie to validate dst instead of 0 as suggested
by Eric

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 22:54:17 -07:00
Jiri Kosina 49b499718f net: sched: make default fifo qdiscs appear in the dump
The original reason [1] for having hidden qdiscs (potential scalability
issues in qdisc_match_from_root() with single linked list in case of large
amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
qdisc linked list to hashtable").

This allows us for bringing more clarity and determinism into the dump by
making default pfifo qdiscs visible.

We're not turning this on by default though, at it was deemed [2] too
intrusive / unnecessary change of default behavior towards userspace.
Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
applications to request complete qdisc hierarchy dump, including the
ones that have always been implicit/invisible.

Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
about singletons would require quite some surgery with very little gain
(seeing no qdisc or seeing noop qdisc in the dump is probably setting
the same user expectation).

[1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
[2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 22:53:02 -07:00
Ido Schimmel d05f7a7dd4 ipv4: fib: Remove redundant argument
We always pass the same event type to fib_notify() and
fib_rules_notify(), so we can safely drop this argument.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-10 09:45:09 -08:00
Ido Schimmel c0243892cb ipv4: fib: Move FIB notification code to a separate file
Most of the code concerned with the FIB notification chain currently
resides in fib_trie.c, but this isn't really appropriate, as the FIB
notification chain is also used for FIB rules.

Therefore, it makes sense to move the common FIB notification code to a
separate file and have it export the relevant functions, which can be
invoked by its different users (e.g., fib_trie.c, fib_rules.c).

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-10 09:45:09 -08:00
David Howells 702f2ac87a rxrpc: Wake up the transmitter if Rx window size increases on the peer
The RxRPC ACK packet may contain an extension that includes the peer's
current Rx window size for this call.  We adjust the local Tx window size
to match.  However, the transmitter can stall if the receive window is
reduced to 0 by the peer and then reopened.

This is because the normal way that the transmitter is re-energised is by
dropping something out of our Tx queue and thus making space.  When a
single gap is made, the transmitter is woken up.  However, because there's
nothing in the Tx queue at this point, this doesn't happen.

To fix this, perform a wake_up() any time we see the peer's Rx window size
increasing.

The observable symptom is that calls start failing on ETIMEDOUT and the
following:

	kAFS: SERVER DEAD state=-62

appears in dmesg.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-10 09:34:23 -08:00
David Howells 6fc166d62c rxrpc: rxrpc_kernel_send_data() needs to handle failed call better
If rxrpc_kernel_send_data() is asked to send data through a call that has
already failed (due to a remote abort, received protocol error or network
error), then return the associated error code saved in the call rather than
ESHUTDOWN.

This allows the caller to work out whether to ask for the abort code or not
based on this.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:30:10 -08:00
Alexey Kodanev 4b3b45edba udp: avoid ufo handling on IP payload compression packets
commit c146066ab8 ("ipv4: Don't use ufo handling on later transformed
packets") and commit f89c56ce71 ("ipv6: Don't use ufo handling on
later transformed packets") added a check that 'rt->dst.header_len' isn't
zero in order to skip UFO, but it doesn't include IPcomp in transport mode
where it equals zero.

Packets, after payload compression, may not require further fragmentation,
and if original length exceeds MTU, later compressed packets will be
transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test
on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000:

* IPv4 case, offset is wrong + unnecessary fragmentation
    udp_ipsec.sh -p comp -m transport -s 2000 &
    tcpdump -ni ltp_ns_veth2
    ...
    IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+],
      proto Compressed IP (108), length 49)
      10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000)
    IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none],
      proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17

* IPv6 case, sending small fragments
    udp_ipsec.sh -6 -p comp -m transport -s 2000 &
    tcpdump -ni ltp_ns_veth2
    ...
    IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
      payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000)
    IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
      payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000)

Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO
if xfrm is set. So the new check will include both cases: IPcomp and IPsec.

Fixes: c146066ab8 ("ipv4: Don't use ufo handling on later transformed packets")
Fixes: f89c56ce71 ("ipv6: Don't use ufo handling on later transformed packets")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:28:42 -08:00
Alexey Kodanev a30aad50c2 tcp: rename *_sequence_number() to *_seq_and_tsoff()
The functions that are returning tcp sequence number also setup
TS offset value, so rename them to better describe their purpose.

No functional changes in this patch.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:25:34 -08:00
David Howells cdfbabfb2f net: Work around lockdep limitation in sockets that use sockets
Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

 (1) If the pagefault handler decides it needs to read pages from AFS, it
     calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
     creating a call requires the socket lock:

	mmap_sem must be taken before sk_lock-AF_RXRPC

 (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
     binds the underlying UDP socket whilst holding its socket lock.
     inet_bind() takes its own socket lock:

	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

 (3) Reading from a TCP socket into a userspace buffer might cause a fault
     and thus cause the kernel to take the mmap_sem, but the TCP socket is
     locked whilst doing this:

	sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks.  The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace.  This is
a limitation in the design of lockdep.

Fix the general case by:

 (1) Double up all the locking keys used in sockets so that one set are
     used if the socket is created by userspace and the other set is used
     if the socket is created by the kernel.

 (2) Store the kern parameter passed to sk_alloc() in a variable in the
     sock struct (sk_kern_sock).  This informs sock_lock_init(),
     sock_init_data() and sk_clone_lock() as to the lock keys to be used.

     Note that the child created by sk_clone_lock() inherits the parent's
     kern setting.

 (3) Add a 'kern' parameter to ->accept() that is analogous to the one
     passed in to ->create() that distinguishes whether kernel_accept() or
     sys_accept4() was the caller and can be passed to sk_alloc().

     Note that a lot of accept functions merely dequeue an already
     allocated socket.  I haven't touched these as the new socket already
     exists before we get the parameter.

     Note also that there are a couple of places where I've made the accepted
     socket unconditionally kernel-based:

	irda_accept()
	rds_rcp_accept_one()
	tcp_accept_from_sock()

     because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal.  I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:23:27 -08:00
Alexander Potapenko 9f138fa609 net: initialize msg.msg_flags in recvfrom
KMSAN reports a use of uninitialized memory in put_cmsg() because
msg.msg_flags in recvfrom haven't been initialized properly.
The flag values don't affect the result on this path, but it's still a
good idea to initialize them explicitly.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 17:21:21 -08:00
Jakub Kicinski abb521e36b ethtool: add CRC32 as an RSS hash function
CRC32 engines are usually easily available in hardware and generate
OK spread for RSS hash.  Add CRC32 RSS hash function to ethtool API.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 16:39:58 -08:00
Paolo Abeni 581319c586 net/socket: use per af lockdep classes for sk queues
Currently the sock queue's spin locks get their lockdep
classes by the default init_spin_lock() initializer:
all socket families get - usually, see below - a single
class for rx, another specific class for tx, etc.
This can lead to false positive lockdep splat, as
reported by Andrey.
Moreover there are two separate initialization points
for the sock queues, one in sk_clone_lock() and one
in sock_init_data(), so that e.g. the rx queue lock
can get one of two possible, different classes, depending
on the socket being cloned or not.
This change tries to address the above, setting explicitly
a per address family lockdep class for each queue's
spinlock. Also, move the duplicated initialization code to a
single location.

v1 -> v2:
 - renamed the init helper

rfc -> v1:
 - no changes, tested with several different workload

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 16:36:45 -08:00
Paolo Abeni 294acf1c01 net/tunnel: set inner protocol in network gro hooks
The gso code of several tunnels type (gre and udp tunnels)
takes for granted that the skb->inner_protocol is properly
initialized and drops the packet elsewhere.

On the forwarding path no one is initializing such field,
so gro encapsulated packets are dropped on forward.

Since commit 3872035241 ("gre: Use inner_proto to obtain
inner header protocol"), this can be reproduced when the
encapsulated packets use gre as the tunneling protocol.

The issue happens also with vxlan and geneve tunnels since
commit 8bce6d7d0d ("udp: Generalize skb_udp_segment"), if the
forwarding host's ingress nic has h/w offload for such tunnel
and a vxlan/geneve device is configured on top of it, regardless
of the configured peer address and vni.

To address the issue, this change initialize the inner_protocol
field for encapsulated packets in both ipv4 and ipv6 gro complete
callbacks.

Fixes: 3872035241 ("gre: Use inner_proto to obtain inner header protocol")
Fixes: 8bce6d7d0d ("udp: Generalize skb_udp_segment")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 13:19:52 -08:00
Zhu Yanjun 3b12f73a5c rds: ib: add error handle
In the function rds_ib_setup_qp, the error handle is missing. When some
error occurs, it is possible that memory leak occurs. As such, error
handle is added.

Cc: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Guanglei Li <guanglei.li@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 13:09:18 -08:00
David Ahern 5be083cedc net: ipv6: Remove redundant RTA_OIF in multipath routes
Dinesh reported that RTA_MULTIPATH nexthops are 8-bytes larger with IPv6
than IPv4. The recent refactoring for multipath support in netlink
messages does discriminate between non-multipath which needs the OIF
and multipath which adds a rtnexthop struct for each hop making the
RTA_OIF attribute redundant. Resolve by adding a flag to the info
function to skip the oif for multipath.

Fixes: beb1afac51 ("net: ipv6: Add support to dump multipath routes
       via RTA_MULTIPATH attribute")
Reported-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 13:04:48 -08:00
Kinglong Mee 5427290d64 SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
The xprt for backchannel is created separately, not in TCP/UDP code.  It
needs the XPT_CONG_CTRL flag set on it too--otherwise requests on the
NFSv4.1 backchannel are rjected in svc_process_common():

1191         if (versp->vs_need_cong_ctrl &&
1192             !test_bit(XPT_CONG_CTRL, &rqstp->rq_xprt->xpt_flags))
1193                 goto err_bad_vers;

Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-09 15:20:46 -05:00
Jiri Pirko 7c92de8eaa flow_dissector: Move GRE dissection into a separate function
Make the main flow_dissect function a bit smaller and move the GRE
dissection into a separate function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-08 23:08:57 -08:00
Jiri Pirko c5ef188e93 flow_dissector: rename "proto again" goto label
Align with "ip_proto_again" label used in the same function and rename
vague "again" to "proto_again".

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-08 23:08:57 -08:00
Jiri Pirko d5774b93f0 flow_dissector: Fix GRE header error path
Now, when an unexpected element in the GRE header appears, we break so
the l4 ports are processed. But since the ports are processed
unconditionally, there will be certainly random values dissected. Fix
this by just bailing out in such situations.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-08 23:08:57 -08:00
Jiri Pirko 4a5d6c8b14 flow_dissector: Move MPLS dissection into a separate function
Make the main flow_dissect function a bit smaller and move the MPLS
dissection into a separate function. Along with that, do the MPLS header
processing only in case the flow dissection user requires it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-08 23:08:57 -08:00
Jiri Pirko 9bf881ffc5 flow_dissector: Move ARP dissection into a separate function
Make the main flow_dissect function a bit smaller and move the ARP
dissection into a separate function. Along with that, do the ARP header
processing only in case the flow dissection user requires it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-08 23:08:57 -08:00
Taehee Yoo c1183db885 netfilter: nf_reject: remove unused variable
variable oiph is not used.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-08 19:02:17 +01:00
Florian Westphal efc9b8e33b netfilter: bridge: remove unneeded rcu_read_lock
as comment says, the function is always called with rcu read lock held.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-08 19:01:23 +01:00
Ying Xue 8e05ba7f84 netfilter: nf_nat_sctp: fix ICMP packet to be dropped accidently
Regarding RFC 792, the first 64 bits of the original SCTP datagram's
data could be contained in ICMP packet, such as:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Type      |     Code      |          Checksum             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                             unused                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      Internet Header + 64 bits of Original Data Datagram      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

However, according to RFC 4960, SCTP datagram header is as below:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Source Port Number        |     Destination Port Number   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                      Verification Tag                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           Checksum                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

It means only the first three fields of SCTP header can be carried in
ICMP packet except for Checksum field.

At present in sctp_manip_pkt(), no matter whether the packet is ICMP or
not, it always calculates SCTP packet checksum. However, not only the
calculation of checksum is unnecessary for ICMP, but also it causes
another fatal issue that ICMP packet is dropped. The header size of
SCTP is used to identify whether the writeable length of skb is bigger
than skb->len through skb_make_writable() in sctp_manip_pkt(). But
when it deals with ICMP packet, skb_make_writable() directly returns
false as the writeable length of skb is bigger than skb->len.
Subsequently ICMP is dropped.

Now we correct this misbahavior. When sctp_manip_pkt() handles ICMP
packet, 8 bytes rather than the whole SCTP header size is used to check
if writeable length of skb is overflowed. Meanwhile, as it's meaningless
to calculate checksum when packet is ICMP, the computation of checksum
is ignored as well.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-08 18:04:06 +01:00
Florian Westphal 7b4fdf77a4 netfilter: don't track fragmented packets
Andrey reports syzkaller splat caused by

NF_CT_ASSERT(!ip_is_fragment(ip_hdr(skb)));

in ipv4 nat.  But this assertion (and the comment) are wrong, this function
does see fragments when IP_NODEFRAG setsockopt is used.

As conntrack doesn't track packets without complete l4 header, only the
first fragment is tracked.

Because applying nat to first packet but not the rest makes no sense this
also turns off tracking of all fragments.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-08 18:02:12 +01:00
Johannes Berg e8e4f5280d mac80211: reject/clear user rate mask if not usable
If the user rate mask results in no (basic) rates being usable,
clear it. Also, if we're already operating when it's set, reject
it instead.

Technically, selecting basic rates as the criterion is a bit too
restrictive, but calculating the usable rates over all stations
(e.g. in AP mode) is harder, and all stations must support the
basic rates. Similarly, in client mode, the basic rates will be
used anyway for control frames.

This fixes the "no supported rates (...) in rate_mask ..." warning
that occurs on TX when you've selected a rate mask that's not
compatible with the connection (e.g. an AP that enables only the
rates 36, 48, 54 and you've selected only 6, 9, 12.)

Reported-by: Kirtika Ruchandani <kirtika@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-08 14:20:01 +01:00
David S. Miller 8474c8caac Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2017-03-06

1) Fix lockdep splat on xfrm policy subsystem initialization.
   From Florian Westphal.

2) When using socket policies on IPv4-mapped IPv6 addresses,
   we access the flow informations of the wrong address family
   what leads to an out of bounds access. Fix this by using
   the family we get with the dst_entry, like we do it for the
   standard policy lookup.

3) vti6 can report a PMTU below IPV6_MIN_MTU. Fix this by
   adding a check for that before sending a ICMPV6_PKT_TOOBIG
   message.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 15:00:37 -08:00
WANG Cong 15e668070a ipv6: reorder icmpv6_init() and ip6_mr_init()
Andrey reported the following kernel crash:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 14446 Comm: syz-executor6 Not tainted 4.10.0+ #82
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88001f311700 task.stack: ffff88001f6e8000
RIP: 0010:ip6mr_sk_done+0x15a/0x3d0 net/ipv6/ip6mr.c:1618
RSP: 0018:ffff88001f6ef418 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 1ffff10003edde8c RCX: ffffc900043ee000
RDX: 0000000000000004 RSI: ffffffff83e3b3f8 RDI: 0000000000000020
RBP: ffff88001f6ef508 R08: fffffbfff0dcc5d8 R09: 0000000000000000
R10: ffffffff86e62ec0 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88001f6ef4e0 R15: ffff8800380a0040
FS:  00007f7a52cec700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000061c500 CR3: 000000001f1ae000 CR4: 00000000000006f0
DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
 rawv6_close+0x4c/0x80 net/ipv6/raw.c:1217
 inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425
 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:432
 sock_release+0x8d/0x1e0 net/socket.c:597
 __sock_create+0x39d/0x880 net/socket.c:1226
 sock_create_kern+0x3f/0x50 net/socket.c:1243
 inet_ctl_sock_create+0xbb/0x280 net/ipv4/af_inet.c:1526
 icmpv6_sk_init+0x163/0x500 net/ipv6/icmp.c:954
 ops_init+0x10a/0x550 net/core/net_namespace.c:115
 setup_net+0x261/0x660 net/core/net_namespace.c:291
 copy_net_ns+0x27e/0x540 net/core/net_namespace.c:396
9pnet_virtio: no channels available for device ./file1
 create_new_namespaces+0x437/0x9b0 kernel/nsproxy.c:106
 unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
 SYSC_unshare kernel/fork.c:2281 [inline]
 SyS_unshare+0x64e/0x1000 kernel/fork.c:2231
 entry_SYSCALL_64_fastpath+0x1f/0xc2

This is because net->ipv6.mr6_tables is not initialized at that point,
ip6mr_rules_init() is not called yet, therefore on the error path when
we iterator the list, we trigger this oops. Fix this by reordering
ip6mr_rules_init() before icmpv6_sk_init().

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:57:33 -08:00
Eric Dumazet 62f8f4d906 dccp: fix use-after-free in dccp_feat_activate_values
Dmitry reported crashes in DCCP stack [1]

Problem here is that when I got rid of listener spinlock, I missed the
fact that DCCP stores a complex state in struct dccp_request_sock,
while TCP does not.

Since multiple cpus could access it at the same time, we need to add
protection.

[1]
BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
net/dccp/feat.c:1541 at addr ffff88003713be68
Read of size 8 by task syz-executor2/8457
CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x292/0x398 lib/dump_stack.c:51
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
 print_address_description mm/kasan/report.c:200 [inline]
 kasan_report_error mm/kasan/report.c:289 [inline]
 kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
 kasan_report mm/kasan/report.c:332 [inline]
 __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
 dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
 dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
 dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
 dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
 dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
 ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
 dst_input include/net/dst.h:507 [inline]
 ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
 __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
 process_backlog+0xe5/0x6c0 net/core/dev.c:4839
 napi_poll net/core/dev.c:5202 [inline]
 net_rx_action+0xe70/0x1900 net/core/dev.c:5267
 __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
 do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
 </IRQ>
 do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
 do_softirq kernel/softirq.c:176 [inline]
 __local_bh_enable_ip+0x1f2/0x200 kernel/softirq.c:181
 local_bh_enable include/linux/bottom_half.h:31 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:971 [inline]
 ip6_finish_output2+0xbb0/0x23d0 net/ipv6/ip6_output.c:123
 ip6_finish_output+0x302/0x960 net/ipv6/ip6_output.c:148
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip6_output+0x1cb/0x8d0 net/ipv6/ip6_output.c:162
 ip6_xmit+0xcdf/0x20d0 include/net/dst.h:501
 inet6_csk_xmit+0x320/0x5f0 net/ipv6/inet6_connection_sock.c:179
 dccp_transmit_skb+0xb09/0x1120 net/dccp/output.c:141
 dccp_xmit_packet+0x215/0x760 net/dccp/output.c:280
 dccp_write_xmit+0x168/0x1d0 net/dccp/output.c:362
 dccp_sendmsg+0x79c/0xb10 net/dccp/proto.c:796
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
 sock_sendmsg_nosec net/socket.c:635 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:645
 SYSC_sendto+0x660/0x810 net/socket.c:1687
 SyS_sendto+0x40/0x50 net/socket.c:1655
 entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x4458b9
RSP: 002b:00007f8ceb77bb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 00000000004458b9
RDX: 0000000000000023 RSI: 0000000020e60000 RDI: 0000000000000017
RBP: 00000000006e1b90 R08: 00000000200f9fe1 R09: 0000000000000020
R10: 0000000000008010 R11: 0000000000000282 R12: 00000000007080a8
R13: 0000000000000000 R14: 00007f8ceb77c9c0 R15: 00007f8ceb77c700
Object at ffff88003713be50, in cache kmalloc-64 size: 64
Allocated:
PID = 8446
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
 kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2738
 kmalloc include/linux/slab.h:490 [inline]
 dccp_feat_entry_new+0x214/0x410 net/dccp/feat.c:467
 dccp_feat_push_change+0x38/0x220 net/dccp/feat.c:487
 __feat_register_sp+0x223/0x2f0 net/dccp/feat.c:741
 dccp_feat_propagate_ccid+0x22b/0x2b0 net/dccp/feat.c:949
 dccp_feat_server_ccid_dependencies+0x1b3/0x250 net/dccp/feat.c:1012
 dccp_make_response+0x1f1/0xc90 net/dccp/output.c:423
 dccp_v6_send_response+0x4ec/0xc20 net/dccp/ipv6.c:217
 dccp_v6_conn_request+0xaba/0x11b0 net/dccp/ipv6.c:377
 dccp_rcv_state_process+0x51e/0x1650 net/dccp/input.c:606
 dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
 sk_backlog_rcv include/net/sock.h:893 [inline]
 __sk_receive_skb+0x36f/0xcc0 net/core/sock.c:479
 dccp_v6_rcv+0xba5/0x1d00 net/dccp/ipv6.c:742
 ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
 dst_input include/net/dst.h:507 [inline]
 ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
 __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
 process_backlog+0xe5/0x6c0 net/core/dev.c:4839
 napi_poll net/core/dev.c:5202 [inline]
 net_rx_action+0xe70/0x1900 net/core/dev.c:5267
 __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
Freed:
PID = 15
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
 slab_free_hook mm/slub.c:1355 [inline]
 slab_free_freelist_hook mm/slub.c:1377 [inline]
 slab_free mm/slub.c:2954 [inline]
 kfree+0xe8/0x2b0 mm/slub.c:3874
 dccp_feat_entry_destructor.part.4+0x48/0x60 net/dccp/feat.c:418
 dccp_feat_entry_destructor net/dccp/feat.c:416 [inline]
 dccp_feat_list_pop net/dccp/feat.c:541 [inline]
 dccp_feat_activate_values+0x57f/0xab0 net/dccp/feat.c:1543
 dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
 dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
 dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
 dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
 ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
 dst_input include/net/dst.h:507 [inline]
 ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
 __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
 process_backlog+0xe5/0x6c0 net/core/dev.c:4839
 napi_poll net/core/dev.c:5202 [inline]
 net_rx_action+0xe70/0x1900 net/core/dev.c:5267
 __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
Memory state around the buggy address:
 ffff88003713bd00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88003713bd80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88003713be00: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
                                                          ^

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:15:27 -08:00
Alexey Khoroshilov 6c4dc75c25 net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump
Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:13:03 -08:00
Sowmini Varadhan b21dd4506b rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races
Commit a93d01f577 ("RDS: TCP: avoid bad page reference in
rds_tcp_listen_data_ready") added the function
rds_tcp_listen_sock_def_readable()  to handle the case when a
partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
null and we would hit a panic  of the form
  BUG: unable to handle kernel NULL pointer dereference at   (null)
  IP:           (null)
   :
  ? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
  tcp_data_queue+0x39d/0x5b0
  tcp_rcv_established+0x2e5/0x660
  tcp_v4_do_rcv+0x122/0x220
  tcp_v4_rcv+0x8b7/0x980
    :
In the above case, it is not fatal to encounter a NULL value for
ready- we should just drop the packet and let the flush of the
acceptor thread finish gracefully.

In general, the tear-down sequence for listen() and accept() socket
that is ensured by this commit is:
     rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
     In rds_tcp_listen_stop():
         serialize with, and prevent, further callbacks using lock_sock()
         flush rds_wq
         flush acceptor workq
         sock_release(listen socket)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:09:59 -08:00
Sowmini Varadhan 16c09b1c76 rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid races
Order of initialization in rds_tcp_init needs to be done so
that resources are set up and destroyed in the correct synchronization
sequence with both the data path, as well as netns create/destroy
path. Specifically,

- we must call register_pernet_subsys and get the rds_tcp_netid
  before calling register_netdevice_notifier, otherwise we risk
  the sequence
    1. register_netdevice_notifier sets up netdev notifier callback
    2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
       the wrong rtn, resulting in a panic with string that is of the form:

  BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
  IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
         :

- the rds_tcp_incoming_slab kmem_cache must be initialized before the
  datapath starts up. The latter can happen any time after the
  pernet_subsys registration of rds_tcp_net_ops, whose -> init
  function sets up the listen socket. If the rds_tcp_incoming_slab has
  not been set up at that time, a panic of the form below may be
  encountered

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
  IP: kmem_cache_alloc+0x90/0x1c0
     :
  rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
  tcp_read_sock+0x96/0x1c0
  rds_tcp_recv_path+0x65/0x80 [rds_tcp]
     :

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:09:59 -08:00
Sowmini Varadhan 8edc3affc0 rds: tcp: Take explicit refcounts on struct net
It is incorrect for the rds_connection to piggyback on the
sock_net() refcount for the netns because this gives rise to
a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
take a ref on the net, and hold the netns down till the connection
tear-down is complete.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:09:59 -08:00
Gao Feng 9c28286b1b decnet: Use TCP nagle macro instead of literal number in decnet
Use existing TCP nagle macro TCP_NAGLE_OFF and TCP_NAGLE_CORK instead
of the literal number 1 and 2 in the current decnet codes.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:07:55 -08:00
Eric Dumazet 9ac25fc063 net: fix socket refcounting in skb_complete_tx_timestamp()
TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
By the time TX completion happens, sk_refcnt might be already 0.

sock_hold()/sock_put() would then corrupt critical state, like
sk_wmem_alloc and lead to leaks or use after free.

Fixes: 62bccb8cdb ("net-timestamp: Make the clone operation stand-alone from phy timestamping")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:06:15 -08:00
Eric Dumazet dd4f10722a net: fix socket refcounting in skb_complete_wifi_ack()
TX skbs do not necessarily hold a reference on skb->sk->sk_refcnt
By the time TX completion happens, sk_refcnt might be already 0.

sock_hold()/sock_put() would then corrupt critical state, like
sk_wmem_alloc.

Fixes: bf7fa551e0 ("mac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi ack path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:06:14 -08:00
David Howells 146d8fef9d rxrpc: Call state should be read with READ_ONCE() under some circumstances
The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 13:59:06 -08:00
Eric Dumazet 02b2faaf0a tcp: fix various issues for sockets morphing to listen state
Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting
tcp_disconnect() path that was never really considered and/or used
before syzkaller ;)

I was not able to reproduce the bug, but it seems issues here are the
three possible actions that assumed they would never trigger on a
listener.

1) tcp_write_timer_handler
2) tcp_delack_timer_handler
3) MTU reduction

Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN
 states from tcp_v6_mtu_reduced()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 13:58:33 -08:00
Ilya Dryomov 7cc5e38f2f libceph: osd_request_timeout option
osd_request_timeout specifies how many seconds to wait for a response
from OSDs before returning -ETIMEDOUT from an OSD request.  0 (default)
means no limit.

osd_request_timeout is osdkeepalive-precise -- in-flight requests are
swept through every osdkeepalive seconds.  With ack vs commit behaviour
gone, abort_request() is really simple.

This is based on a patch from Artur Molchanov <artur.molchanov@synesis.ru>.

Tested-by: Artur Molchanov <artur.molchanov@synesis.ru>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-07 14:30:38 +01:00
Ilya Dryomov b581a5854e libceph: don't set weight to IN when OSD is destroyed
Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs.  Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c532869 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-07 14:30:38 +01:00
Ilya Dryomov 9afd30dbc8 libceph: fix crush_decode() for older maps
Older (shorter) CRUSH maps too need to be finalized.

Fixes: 66a0e2d579 ("crush: remove mutable part of CRUSH map")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-03-07 14:30:37 +01:00
Johannes Berg b61fbda180 mac80211: remove ieee80211_tx_rate_control.max_rate_idx
As promised a little more than 7 years ago, remove it now
since nothing uses it anymore.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-07 09:42:12 +01:00
Johannes Berg a6289d3fcc mac80211: ignore VHT membership selector when parsing rates
There isn't really much harm in not ignoring, since it doesn't
represent a valid rate, but since we already ignore the HT one
also ignore VHT. Also simplify the code a bit.

Fix a typo in the related comment (pointed out by Arend) while
at it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-07 09:41:59 +01:00
David Forster df789fe752 ipv6: Provide ipv6 version of "disable_policy" sysctl
This provides equivalent functionality to the existing ipv4
"disable_policy" systcl. ie. Allows IPsec processing to be skipped
on terminating packets on a per-interface basis.

Signed-off-by: David Forster <dforster@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-06 17:10:20 -08:00
Pablo Neira Ayuso c7a72e3fdb netfilter: nf_tables: add nft_set_lookup()
This new function consolidates set lookup via either name or ID by
introducing a new nft_set_lookup() function. Replace existing spots
where we can use this too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06 18:23:23 +01:00
Liping Zhang c56e3956c1 netfilter: nf_tables: validate the expr explicitly after init successfully
When we want to validate the expr's dependency or hooks, we must do two
things to accomplish it. First, write a X_validate callback function
and point ->validate to it. Second, call X_validate in init routine.
This is very common, such as fib, nat, reject expr and so on ...

It is a little ugly, since we will call X_validate in the expr's init
routine, it's better to do it in nf_tables_newexpr. So we can avoid to
do this again and again. After doing this, the second step listed above
is not useful anymore, remove them now.

Patch was tested by nftables/tests/py/nft-test.py and
nftables/tests/shell/run-tests.sh.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06 18:22:12 +01:00
Colin Ian King 74664cf286 netfilter: arp_tables: remove redundant check on ret being non-zero
ret is initialized to zero and if it is set to non-zero in the
xt_entry_foreach loop then we exit via the out_free label. Hence
the check for ret being non-zero is redundant and can be removed.

Detected by CoverityScan, CID#1357132 ("Logically Dead Code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06 18:04:37 +01:00
Joe Perches 13fa745da2 netfilter: Use pr_cont where appropriate
Logging output was changed when simple printks without KERN_CONT
are now emitted on a new line and KERN_CONT is required to continue
lines so use pr_cont.

Miscellanea:

o realign arguments
o use print_hex_dump instead of a local variant

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06 18:00:48 +01:00
Laura Garcia Liebana 3206caded8 netfilter: nft_hash: support of symmetric hash
This patch provides symmetric hash support according to source
ip address and port, and destination ip address and port.

For this purpose, the __skb_get_hash_symmetric() is used to
identify the flow as it uses FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
flag by default.

The new attribute NFTA_HASH_TYPE has been included to support
different types of hashing functions. Currently supported
NFT_HASH_JENKINS through jhash and NFT_HASH_SYM through symhash.

The main difference between both types are:
 - jhash requires an expression with sreg, symhash doesn't.
 - symhash supports modulus and offset, but not seed.

Examples:

 nft add rule ip nat prerouting ct mark set jhash ip saddr mod 2
 nft add rule ip nat prerouting ct mark set symhash mod 2

By default, jenkins hash will be used if no hash type is
provided for compatibility reasons.

Signed-off-by: Laura Garcia Liebana <laura.garcia@zevenet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06 17:57:42 +01:00
Laura Garcia Liebana 511040eea2 netfilter: nft_hash: rename nft_hash to nft_jhash
This patch renames the local nft_hash structure and functions
to nft_jhash in order to prepare the nft_hash module code to
add new hash functions.

Signed-off-by: Laura Garcia Liebana <laura.garcia@zevenet.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06 17:57:34 +01:00
Phil Sutter 3c1fece881 netfilter: nft_exthdr: Allow checking TCP option presence, too
Honor NFT_EXTHDR_F_PRESENT flag so we check if the TCP option is
present.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-06 17:52:56 +01:00
Vasanthakumar Thiagarajan 8976672736 cfg80211: Share Channel DFS state across wiphys of same DFS domain
Sharing DFS channel state across multiple wiphys (radios) could
be useful with multiple radios on the system. When one radio
completes CAC and markes the channel available another radio
can use this information and start beaconing without really doing
CAC.

Whenever there is a state change in dfs channel associated to
a particular wiphy the the same state change is propagated to
other wiphys having the same DFS reg domain configuration.
Also when a new wiphy is created the dfs channel state of
other existing wiphys of same DFS domain is copied.

Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 13:54:20 +01:00
Vasanthakumar Thiagarajan 34373d12f3 cfg80211: Disallow moving out of operating DFS channel in non-ETSI
For non-ETSI regulatory domain, CAC result on DFS channel
may not be valid once moving out of that channel (as done
during remain-on-channel, scannning and off-channel tx).
Running CAC on an operating DFS channel after every off-channel
operation will only add complexity and disturb the current
link. Better do not allow any off-channel switch from a DFS
operating channel in non-ETSI domain.

Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 13:54:19 +01:00
Vasanthakumar Thiagarajan b35a51c7dd cfg80211: Make pre-CAC results valid only for ETSI domain
DFS requirement for ETSI domain (section 4.7.1.4 in
ETSI EN 301 893 V1.8.1) is the only one which explicitly
states that once DFS channel is marked as available afer
the CAC, this channel will remain in available state even
moving to a different operating channel. But the same is
not explicitly stated in FCC DFS requirement. Also, Pre-CAC
requriements are not explicitly mentioned in FCC requirement.
Current implementation in keeping DFS channel in available
state is same as described in ETSI domain.

For non-ETSI DFS domain, this patch gives a grace period of 2 seconds
since the completion of successful CAC before moving the channel's
DFS state to 'usable' from 'available' state. The same grace period
is checked against the channel's dfs_state_entered timestamp while
deciding if a DFS channel is available for operation. There is a new
radar event, NL80211_RADAR_PRE_CAC_EXPIRED, reported when DFS channel
is moved from available to usable state after the grace period. Also
make sure the DFS channel state is reset to usable once the beaconing
operation on that channel is brought down (like stop_ap, leave_ibss
and leave_mesh) in non-ETSI domain.

Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 13:54:15 +01:00
Ondřej Lysoněk f8f118ceaa mac80211: Use setup_timer instead of init_timer
Use setup_timer() and setup_deferrable_timer() to set the data and
function timer fields. It makes the code cleaner and will allow for
easier change of the timer struct internals.

Signed-off-by: Ondřej Lysoněk <ondrej.lysonek@seznam.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <linux-wireless@vger.kernel.org>
Cc: <netdev@vger.kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 13:23:05 +01:00
Manoharan, Rajkumar fe56c9c17b mac80211: fix mesh fail_avg check
Mesh failure average never be more than 100. Only in case of
fixed path, average will be more than threshold limit (95%).
With recent EWMA changes it may go upto 99 as it is scaled to
100. It make sense to return maximum metric when average is
greater than threshold limit.

Signed-off-by: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:46 +01:00
Johannes Berg 7f406cd16a mac80211: encode rate type (legacy, HT, VHT) with fewer bits
We don't really need three different bits for each, since the
types are mutually exclusive. Use just two bits for it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:45 +01:00
Johannes Berg 0c1eca4e2f cfg80211: refactor cfg80211_calculate_bitrate()
This function contains the HT calculations, which makes no
sense - split that out into a separate function. As a side
effect, this makes the 60G flag independent from HT_MCS so
remove the MCS one from wil6210 (also deleting a duplicate
assignment.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:44 +01:00
Johannes Berg a858958b68 mac80211: remove local pointer from rate_ctrl_ref
This pointer really isn't needed, so remove it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:43 +01:00
Johannes Berg 2fb51c3581 ieee80211: rename CCFS1/CCFS2 to CCFS0/CCFS1
This matches the spec, and otherwise things are really
confusing with the next patch adding CCFS2.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:43 +01:00
Arkadiusz Miskiewicz 68506e9af1 mac80211: Print text for disassociation reason
When disassociation happens only numeric reason is printed
in ieee80211_rx_mgmt_disassoc(). Add text variant, too.

Signed-off-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:42 +01:00
Johannes Berg d4f2997867 cfg80211: combine two nested ifs into a single condition
Combine two instances of having two nested if statements
into a single one with a combined condition to reduce the
indentation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:40 +01:00
Andrew Zaborowski 2c3c5f8c0c mac80211: Add set_cqm_rssi_range_config
Support .set_cqm_rssi_range_config if the beacons are available for
processing in mac80211.  There's no reason that this couldn't be
offloaded by mac80211-based drivers but there's no driver method for
that added in this patch.

Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:38 +01:00
Andrew Zaborowski 4a4b816950 cfg80211: Accept multiple RSSI thresholds for CQM
Change the SET CQM command's RSSI threshold attribute to accept any
number of thresholds as a sorted array.  The API should be backwards
compatible so that if one s32 threshold value is passed, the old
mechanism is enabled.  The netlink event generated is the same in both
cases.

cfg80211 handles an arbitrary number of RSSI thresholds but drivers have
to provide a method (set_cqm_rssi_range_config) that configures a range
set by a high and a low value.  Drivers have to call back when the RSSI
goes out of that range and there's no additional event for each time the
range is reconfigured as there was with the current one-threshold API.

This method doesn't have a hysteresis parameter because there's no
benefit to the cfg80211 code from having the hysteresis be handled by
hardware/driver in terms of the number of wakeups.  At the same time
it would likely be less consistent between drivers if offloaded or
done in the drivers.

Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:38 +01:00
Manoharan, Rajkumar 3eb0928fc3 mac80211: use DECLARE_EWMA for mesh_fail_avg
As moving average is not considering fractional part, it will
get stuck at the same level after certain state. For example,
with current values, it can get stuck at 96. Fortunately the
current threshold 95%, but if it were increased to 96 or more
mesh paths would never be deactivated. Fix failure average
movement by using EWMA helpers, which does take into account
fractional parts.

Signed-off-by: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
[johannes: pick a larger EWMA factor for more precision with
 the limited range that we will feed into it, adjust to new API]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-06 09:21:21 +01:00
Linus Torvalds 8d70eeb84a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix double-free in batman-adv, from Sven Eckelmann.

 2) Fix packet stats for fast-RX path, from Joannes Berg.

 3) Netfilter's ip_route_me_harder() doesn't handle request sockets
    properly, fix from Florian Westphal.

 4) Fix sendmsg deadlock in rxrpc, from David Howells.

 5) Add missing RCU locking to transport hashtable scan, from Xin Long.

 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.

 7) Fix race in NAPI handling between poll handlers and busy polling,
    from Eric Dumazet.

 8) TX path in vxlan and geneve need proper RCU locking, from Jakub
    Kicinski.

 9) SYN processing in DCCP and TCP need to disable BH, from Eric
    Dumazet.

10) Properly handle net_enable_timestamp() being invoked from IRQ
    context, also from Eric Dumazet.

11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.

12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
    Melo.

13) Fix use-after-free in netvsc driver, from Dexuan Cui.

14) Fix max MTU setting in bonding driver, from WANG Cong.

15) xen-netback hash table can be allocated from softirq context, so use
    GFP_ATOMIC. From Anoob Soman.

16) Fix MAC address change bug in bgmac driver, from Hari Vyas.

17) strparser needs to destroy strp_wq on module exit, from WANG Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
  strparser: destroy workqueue on module exit
  sfc: fix IPID endianness in TSOv2
  sfc: avoid max() in array size
  rds: remove unnecessary returned value check
  rxrpc: Fix potential NULL-pointer exception
  nfp: correct DMA direction in XDP DMA sync
  nfp: don't tell FW about the reserved buffer space
  net: ethernet: bgmac: mac address change bug
  net: ethernet: bgmac: init sequence bug
  xen-netback: don't vfree() queues under spinlock
  xen-netback: keep a local pointer for vif in backend_disconnect()
  netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
  netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
  netfilter: nf_conntrack_sip: fix wrong memory initialisation
  can: flexcan: fix typo in comment
  can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
  can: gs_usb: fix coding style
  can: gs_usb: Don't use stack memory for USB transfers
  ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
  ixgbe: update the rss key on h/w, when ethtool ask for it
  ...
2017-03-04 17:31:39 -08:00
Sven Eckelmann 1a9070ec91 batman-adv: Initialize gw sel_class via batadv_algo
The gateway selection class variable is shared between different algorithm
versions. But the interpretation of the content is algorithm specific. The
initialization is therefore also algorithm specific.

But this was implemented incorrectly and the initialization for BATMAN_V
always overwrote the value previously written for BATMAN_IV. This could
only be avoided when BATMAN_V was disabled during compile time.

Using a special batadv_algo hook for this initialization avoids this
problem.

Fixes: 50164d8f50 ("batman-adv: B.A.T.M.A.N. V - implement GW selection logic")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-04 17:32:06 +01:00
Sven Eckelmann 1c2bcc766b batman-adv: Keep fragments equally sized
The batman-adv fragmentation packets have the design problem that they
cannot be refragmented and cannot handle padding by the underlying link.
The latter often leads to problems when networks are incorrectly configured
and don't use a common MTU.

The sender could for example fragment a 1271 byte frame (plus external
ethernet header (14) and batadv unicast header (10)) to fit in a 1280 bytes
large MTU of the underlying link (max. 1294 byte frames). This would create
a 1294 bytes large frame (fragment 2) and a 55 bytes large frame
(fragment 1). The extra 54 bytes are the fragment header (20) added to each
fragment and the external ethernet header (14) for the second fragment.

Let us assume that the next hop is then not able to transport 1294 bytes to
its next hop. The 1294 byte large frame will be dropped but the 55 bytes
large fragment will still be forwarded to its destination.

Or let us assume that the underlying hardware requires that each frame has
a minimum size (e.g. 60 bytes). Then it will pad the 55 bytes frame to 60
bytes. The receiver of the 60 bytes frame will no longer be able to
correctly assemble the two frames together because it is not aware that 5
bytes of the 60 bytes frame are padding and don't belong to the reassembled
frame.

This can partly be avoided by splitting frames more equally. In this
example, the 675 and 674 bytes large fragment frames could both potentially
reach its destination without being too large or too small.

Reported-by: Martin Weinelt <martin@darmstadt.freifunk.net>
Fixes: ee75ed8887 ("batman-adv: Fragment and send skbs larger than mtu")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-03-04 17:31:57 +01:00
Linus Torvalds 0710f3ff91 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc final vfs updates from Al Viro:
 "A few unrelated patches that got beating in -next.

  Everything else will have to go into the next window ;-/"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  hfs: fix hfs_readdir()
  selftest for default_file_splice_read() infoleak
  9p: constify ->d_name handling
2017-03-03 21:44:35 -08:00
WANG Cong f78ef7cd9a strparser: destroy workqueue on module exit
Fixes: 43a0c6751a ("strparser: Stream parser for messages")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-03 20:43:26 -08:00
David S. Miller 20b83643ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Missing check for full sock in ip_route_me_harder(), from
   Florian Westphal.

2) Incorrect sip helper structure initilization that breaks it when
   several ports are used, from Christophe Leroy.

3) Fix incorrect assumption when looking up for matching with adjacent
   intervals in the nft_set_rbtree.

4) Fix broken netlink event error reporting in nf_tables that results
   in misleading ESRCH errors propagated to userspace listeners.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-03 20:40:06 -08:00
Linus Torvalds 1827adb11a Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull sched.h split-up from Ingo Molnar:
 "The point of these changes is to significantly reduce the
  <linux/sched.h> header footprint, to speed up the kernel build and to
  have a cleaner header structure.

  After these changes the new <linux/sched.h>'s typical preprocessed
  size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
  lines), which is around 40% faster to build on typical configs.

  Not much changed from the last version (-v2) posted three weeks ago: I
  eliminated quirks, backmerged fixes plus I rebased it to an upstream
  SHA1 from yesterday that includes most changes queued up in -next plus
  all sched.h changes that were pending from Andrew.

  I've re-tested the series both on x86 and on cross-arch defconfigs,
  and did a bisectability test at a number of random points.

  I tried to test as many build configurations as possible, but some
  build breakage is probably still left - but it should be mostly
  limited to architectures that have no cross-compiler binaries
  available on kernel.org, and non-default configurations"

* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
  sched/headers: Clean up <linux/sched.h>
  sched/headers: Remove #ifdefs from <linux/sched.h>
  sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
  sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
  sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
  sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
  sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
  sched/core: Remove unused prefetch_stack()
  sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
  sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
  sched/headers: Remove <linux/signal.h> from <linux/sched.h>
  sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
  sched/headers: Remove the runqueue_is_locked() prototype
  sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
  sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
  sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
  ...
2017-03-03 10:16:38 -08:00
Zhu Yanjun a8d63a53b3 rds: remove unnecessary returned value check
The function rds_trans_register always returns 0. As such, it is not
necessary to check the returned value.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-03 09:51:22 -08:00
David Howells 37411cad63 rxrpc: Fix potential NULL-pointer exception
Fix a potential NULL-pointer exception in rxrpc_do_sendmsg().  The call
state check that I added should have gone into the else-body of the
if-statement where we actually have a call to check.

Found by CoverityScan CID#1414316 ("Dereference after null check").

Fixes: 540b1c48c3 ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-03 09:48:00 -08:00
Pablo Neira Ayuso 25e94a997b netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
The underlying nlmsg_multicast() already sets sk->sk_err for us to
notify socket overruns, so we should not do anything with this return
value. So we just call nfnetlink_set_err() if:

1) We fail to allocate the netlink message.

or

2) We don't have enough space in the netlink message to place attributes,
   which means that we likely need to allocate a larger message.

Before this patch, the internal ESRCH netlink error code was propagated
to userspace, which is quite misleading. Netlink semantics mandate that
listeners just hit ENOBUFS if the socket buffer overruns.

Reported-by: Alexander Alemayhu <alexander@alemayhu.com>
Tested-by: Alexander Alemayhu <alexander@alemayhu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-03 13:48:34 +01:00
Pablo Neira Ayuso f9121355eb netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
In case of adjacent ranges, we may indeed see either the high part of
the range in first place or the low part of it. Remove this incorrect
assumption, let's make sure we annotate the low part of the interval in
case of we have adjacent interva intervals so we hit a matching in
lookups.

Reported-by: Simon Hanisch <hanisch@wh2.tu-dresden.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-03 13:48:32 +01:00
Christophe Leroy da2f27e9e6 netfilter: nf_conntrack_sip: fix wrong memory initialisation
In commit 82de0be686 ("netfilter: Add helper array
register/unregister functions"),
struct nf_conntrack_helper sip[MAX_PORTS][4] was changed to
sip[MAX_PORTS * 4], so the memory init should have been changed to
memset(&sip[4 * i], 0, 4 * sizeof(sip[i]));

But as the sip[] table is allocated in the BSS, it is already set to 0

Fixes: 82de0be686 ("netfilter: Add helper array register/unregister functions")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-03 13:48:31 +01:00
Ingo Molnar c3edc4010e sched/headers: Move task_struct::signal and task_struct::sighand types and accessors into <linux/sched/signal.h>
task_struct::signal and task_struct::sighand are pointers, which would normally make it
straightforward to not define those types in sched.h.

That is not so, because the types are accompanied by a myriad of APIs (macros and inline
functions) that dereference them.

Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.

With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
trying to put accessors into sched.h as a test fails the following way:

  ./include/linux/sched.h: In function ‘test_signal_types’:
  ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
                    ^

This reduces the size and complexity of sched.h significantly.

Update all headers and .c code that relied on getting the signal handling
functionality from <linux/sched.h> to include <linux/sched/signal.h>.

The list of affected files in the preparatory patch was partly generated by
grepping for the APIs, and partly by doing coverage build testing, both
all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
cross-architecture builds.

Nevertheless some (trivial) build breakage is still expected related to rare
Kconfig combinations and in-flight patches to various kernel code, but most
of it should be handled by this patch.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:43:37 +01:00
Linus Torvalds 69fd110eb6 Merge branch 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs sendmsg updates from Al Viro:
 "More sendmsg work.

  This is a fairly separate isolated stuff (there's a continuation
  around lustre, but that one was too late to soak in -next), thus the
  separate pull request"

* 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ncpfs: switch to sock_sendmsg()
  ncpfs: don't mess with manually advancing iovec on send
  ncpfs: sendmsg does *not* bugger iovec these days
  ceph_tcp_sendpage(): use ITER_BVEC sendmsg
  afs_send_pages(): use ITER_BVEC
  rds: remove dead code
  ceph: switch to sock_recvmsg()
  usbip_recv(): switch to sock_recvmsg()
  iscsi_target: deal with short writes on the tx side
  [nbd] pass iov_iter to nbd_xmit()
  [nbd] switch sock_xmit() to sock_{send,recv}msg()
  [drbd] use sock_sendmsg()
2017-03-02 15:16:38 -08:00
David S. Miller 35576ee17f This contains just the average.h change in order to get it
into the tree before adding new users through -next trees.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAli3yxoACgkQa3t4Rpy0
 AB13Eg//Sru1b80k889hCYM14hViPAlGD2z2K9dLP5XujKuzLP1UYlB8xk0usqWm
 ZyPOUOt50OIgdE4jSPT/P79teArZLOQ/nUpuR/I8JgVLo3TiuOSGD8SfN0dWXDkc
 2/ywYiRed9PwZfRpTvrgyyB3LPp1gvvOmwpLWVf2ndinfL4SHN31V1tBuIJxYAhm
 sFhxhHvy+inpIdfLpZaF/7CydT8oG/k+G3xiR8C1xUCTYEIztVq4ynMkMvA9SEch
 dx7MHLMFyl/mHXGW1JSE2nQ97tdGyqFGBrY4wHHIW105Q10Qta0p9IutMqfzokE1
 Shkes/JFUaJqU+QRXGkomA+BjcT+mvHODwY6rt73o6lEV24EU1FCI4pR+tPgGCL2
 ub893LOIot5xORL2KZwgliMnreGYfdkRPKxAZPgruTm7TENCLQ8c/5R5ZgPhrePT
 +pAtqGAOvSX8WkT2JpWdxcvl16dsREBDIlDFPx8MkCD05+6hDjWsm69RW50d2sNM
 oVpSJG87ZHw6nV4L6i+7lHQSPsZDOS6Y+IpfY1MZmoAI/v+6eFPuoiM/SRLCmDKS
 EX22iw7xe6gOfVMFAwVqxycvF0g5LoO3Bx7oGW0k+o+3X3ZgreuPGGN0C+GKYa7H
 Km8axU1ZdvyB23pY3dlbuqVQFFrsIUlIhcGaWk4tjU00C5F32sg=
 =Bge/
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-03-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This contains just the average.h change in order to get it
into the tree before adding new users through -next trees.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 14:39:17 -08:00
WANG Cong 9d6acb3bc9 ipv6: ignore null_entry in inet6_rtm_getroute() too
Like commit 1f17e2f2c8 ("net: ipv6: ignore null_entry on route dumps"),
we need to ignore null entry in inet6_rtm_getroute() too.

Return -ENETUNREACH here to sync with IPv4 behavior, as suggested by David.

Fixes: a1a22c1206 ("net: ipv6: Keep nexthop of multipath route on admin down")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 14:35:31 -08:00
Wei Wang 7db92362d2 tcp: fix potential double free issue for fastopen_req
tp->fastopen_req could potentially be double freed if a malicious
user does the following:
1. Enable TCP_FASTOPEN_CONNECT sockopt and do a connect() on the socket.
2. Call connect() with AF_UNSPEC to disconnect the socket.
3. Make this socket a listening socket by calling listen().
4. Accept incoming connections and generate child sockets. All child
   sockets will get a copy of the pointer of fastopen_req.
5. Call close() on all sockets. fastopen_req will get freed multiple
   times.

Fixes: 19f6d3f3c8 ("net/tcp-fastopen: Add new API support")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 14:05:41 -08:00
Linus Torvalds 54d7989f47 virtio, vhost: optimizations, fixes
Looks like a quiet cycle for vhost/virtio, just a couple of minor
 tweaks. Most notable is automatic interrupt affinity for blk and scsi.
 Hopefully other devices are not far behind.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYt1rRAAoJECgfDbjSjVRpEZsIALSHevdXWtRHBZUb0ZkqPLQb
 /x2Vn49CcALS1p7iSuP9L027MPeaLKyr0NBT9hptBChp/4b9lnZWyyAo6vYQrzfx
 Ia/hLBYsK4ml6lEwbyfLwqkF2cmYCrZhBSVAILifn84lTPoN7CT0PlYDfA+OCaNR
 geo75qF8KR+AUO0aqchwMRL3RV3OxZKxQr2AR6LttCuhiBgnV3Xqxffg/M3x6ONM
 0ffFFdodm6slem3hIEiGUMwKj4NKQhcOleV+y0fVBzWfLQG9210pZbQyRBRikIL0
 7IsaarpaUr7OrLAZFMGF6nJnyRAaRrt6WknTHZkyvyggrePrGcmGgPm4jrODwY4=
 =2zwv
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull vhost updates from Michael Tsirkin:
 "virtio, vhost: optimizations, fixes

  Looks like a quiet cycle for vhost/virtio, just a couple of minor
  tweaks. Most notable is automatic interrupt affinity for blk and scsi.
  Hopefully other devices are not far behind"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio-console: avoid DMA from stack
  vhost: introduce O(1) vq metadata cache
  virtio_scsi: use virtio IRQ affinity
  virtio_blk: use virtio IRQ affinity
  blk-mq: provide a default queue mapping for virtio device
  virtio: provide a method to get the IRQ affinity mask for a virtqueue
  virtio: allow drivers to request IRQ affinity when creating VQs
  virtio_pci: simplify MSI-X setup
  virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
  virtio_pci: use shared interrupts for virtqueues
  virtio_pci: remove struct virtio_pci_vq_info
  vhost: try avoiding avail index access when getting descriptor
  virtio_mmio: expose header to userspace
2017-03-02 13:53:13 -08:00
Linus Torvalds 0f221a3102 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem fixes from James Morris:
 "Two fixes for the security subsystem:

   - keys: split both rcu_dereference_key() and user_key_payload() into
     versions which can be called with or without holding the key
     semaphore.

   - SELinux: fix Android init(8) breakage due to new cgroup security
     labeling support when using older policy"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  selinux: wrap cgroup seclabel support with its own policy capability
  KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()
2017-03-02 13:22:18 -08:00
Arnaldo Carvalho de Melo 94352d4509 net: Introduce sk_clone_lock() error path routine
When handling problems in cloning a socket with the sk_clone_locked()
function we need to perform several steps that were open coded in it and
its callers, so introduce a routine to avoid this duplication:
sk_free_unlock_clone().

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/net-ui6laqkotycunhtmqryl9bfx@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 13:19:33 -08:00
Arnaldo Carvalho de Melo d5afb6f9b6 dccp: Unlock sock before calling sk_free()
The code where sk_clone() came from created a new socket and locked it,
but then, on the error path didn't unlock it.

This problem stayed there for a long while, till b0691c8ee7 ("net:
Unlock sock before calling sk_free()") fixed it, but unfortunately the
callers of sk_clone() (now sk_clone_locked()) were not audited and the
one in dccp_create_openreq_child() remained.

Now in the age of the syskaller fuzzer, this was finally uncovered, as
reported by Dmitry:

 ---- 8< ----

I've got the following report while running syzkaller fuzzer on
86292b33d4 ("Merge branch 'akpm' (patches from Andrew)")

  [ BUG: held lock freed! ]
  4.10.0+ #234 Not tainted
  -------------------------
  syz-executor6/6898 is freeing memory
  ffff88006286cac0-ffff88006286d3b7, with a lock still held there!
   (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>] spin_lock
  include/linux/spinlock.h:299 [inline]
   (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>]
  sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504
  5 locks held by syz-executor6/6898:
   #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff839a34b4>] lock_sock
  include/net/sock.h:1460 [inline]
   #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff839a34b4>]
  inet_stream_connect+0x44/0xa0 net/ipv4/af_inet.c:681
   #1:  (rcu_read_lock){......}, at: [<ffffffff83bc1c2a>]
  inet6_csk_xmit+0x12a/0x5d0 net/ipv6/inet6_connection_sock.c:126
   #2:  (rcu_read_lock){......}, at: [<ffffffff8369b424>] __skb_unlink
  include/linux/skbuff.h:1767 [inline]
   #2:  (rcu_read_lock){......}, at: [<ffffffff8369b424>] __skb_dequeue
  include/linux/skbuff.h:1783 [inline]
   #2:  (rcu_read_lock){......}, at: [<ffffffff8369b424>]
  process_backlog+0x264/0x730 net/core/dev.c:4835
   #3:  (rcu_read_lock){......}, at: [<ffffffff83aeb5c0>]
  ip6_input_finish+0x0/0x1700 net/ipv6/ip6_input.c:59
   #4:  (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>] spin_lock
  include/linux/spinlock.h:299 [inline]
   #4:  (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>]
  sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504

Fix it just like was done by b0691c8ee7 ("net: Unlock sock before calling
sk_free()").

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170301153510.GE15145@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 13:19:33 -08:00
David S. Miller 6ab2b999e7 Here are two batman-adv bugfixes:
- fix a potential double free when fragment merges fail,
    by Sven Eckelmann
 
  - fix failing tranmission of the 16th (last) fragment if that exists,
    by Linus Lüssing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAli27gkWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeofqQD/0Z/wUItUuS1pODZdcHLrhvc9q0
 0O2L0Uujzm+IHRXkv+3ZatedM3vnqq03w2WIdOf3BAPbkvY+zXO4TRQziE8PKy1p
 aMdC4A/jeGg1c7+PSx+mxhUbRsdP8cdkO3A5AgQxYjbXBlH59595thM8p6CUnWZ0
 M4YPaI7dd3XXWYvfaQ1fBcqwhy6z9uiisv5HF99jxkaFEM2ApK8LOhbmsfJbS13M
 aPgpq/Hjde/RrDGNElmmkWYWdsGAJMnHHVCbX0e3yehJdDZeXciak4BGO0Y2HUHX
 y7M8zjmYIkha2AnmO/3rl1PdOuX/5i43Haf31ojbXx4wK4RbPG6n2NoIngRhND3E
 PRP5t3pzZq/N4nAd9Aj+NSiJadxcrnz26sX0stmVIkbAnEUvsG1yNYUP0squL0bn
 G4EjUafyKonVbayMA90lFKvXujrm3rr0q7AcgpcuJJWWRMe0oHEjbaaIz2jh722S
 S0yeoKbmaXa2Skxfe68Ptajb+ODSpsL758vRhXS/ZTFWV/3iE8wPRRKil/mkeyL/
 pqFF+qxDjRI9S/Hku1A8cegjeBAfBtCxV7A35RP1MCNjv2iltGtBNLLUqiJ9i/C8
 REvrAIgaIIsZb01yi2mLVCNg7PEg/0lD8sulqH5Dkv3amSBZr0EsmulBZMoDdgwS
 7YLi2mqa5eXfGheNOQ==
 =xy9Y
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20170301' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are two batman-adv bugfixes:

 - fix a potential double free when fragment merges fail,
   by Sven Eckelmann

 - fix failing tranmission of the 16th (last) fragment if that exists,
   by Linus Lüssing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 13:16:08 -08:00
Peter Downs f1304f7ba3 openvswitch: actions: fixed a brace coding style warning
Fixed a brace coding style warning reported by checkpatch.pl

Signed-off-by: Peter Downs <padowns@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 13:14:44 -08:00
WANG Cong e3330039ea ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()
Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
-> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
because ip6_null_entry is returned in this path since ip6_null_entry
is kinda default for a ipv6 route table root node. Quote from
David Ahern:

 ip6_null_entry is the root of all ipv6 fib tables making it integrated
 into the table ...

We should ignore any attempt of trying to delete it, like we do in
__ip6_del_rt() path and several others.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Fixes: 0ae8133586 ("net: ipv6: Allow shorthand delete of all nexthops in multipath route")
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 12:43:47 -08:00
Ingo Molnar f719ff9bce sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h>
But first update the code that uses these facilities with the
new header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar b2d0910310 sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h>
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.

But first update code that relied on the implicit header inclusion.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar 5b3cc15aff sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h>
Update the .c files that depend on these APIs.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:33 +01:00
Ingo Molnar 174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ingo Molnar 5b825c3af1 sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:31 +01:00
Ingo Molnar 8703e8a465 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h>
We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/user.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Ingo Molnar 3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Ingo Molnar 4f17722c72 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/loadavg.h>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Ingo Molnar e601757102 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Johannes Berg eb1e011a14 average: change to declare precision, not factor
Declaring the factor is counter-intuitive, and people are prone
to using small(-ish) values even when that makes no sense.

Change the DECLARE_EWMA() macro to take the fractional precision,
in bits, rather than a factor, and update all users.

While at it, add some more documentation.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-02 08:32:46 +01:00
Eric Dumazet 48cac18ecf ipv6: orphan skbs in reassembly unit
Andrey reported a use-after-free in IPv6 stack.

Issue here is that we free the socket while it still has skb
in TX path and in some queues.

It happens here because IPv6 reassembly unit messes skb->truesize,
breaking skb_set_owner_w() badly.

We fixed a similar issue for IPV4 in commit 8282f27449 ("inet: frag:
Always orphan skbs inside ip_defrag()")
Acked-by: Joe Stringer <joe@ovn.org>

==================================================================
BUG: KASAN: use-after-free in sock_wfree+0x118/0x120
Read of size 8 at addr ffff880062da0060 by task a.out/4140

page:ffffea00018b6800 count:1 mapcount:0 mapping:          (null)
index:0x0 compound_mapcount: 0
flags: 0x100000000008100(slab|head)
raw: 0100000000008100 0000000000000000 0000000000000000 0000000180130013
raw: dead000000000100 dead000000000200 ffff88006741f140 0000000000000000
page dumped because: kasan: bad access detected

CPU: 0 PID: 4140 Comm: a.out Not tainted 4.10.0-rc3+ #59
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:15
 dump_stack+0x292/0x398 lib/dump_stack.c:51
 describe_address mm/kasan/report.c:262
 kasan_report_error+0x121/0x560 mm/kasan/report.c:370
 kasan_report mm/kasan/report.c:392
 __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:413
 sock_flag ./arch/x86/include/asm/bitops.h:324
 sock_wfree+0x118/0x120 net/core/sock.c:1631
 skb_release_head_state+0xfc/0x250 net/core/skbuff.c:655
 skb_release_all+0x15/0x60 net/core/skbuff.c:668
 __kfree_skb+0x15/0x20 net/core/skbuff.c:684
 kfree_skb+0x16e/0x4e0 net/core/skbuff.c:705
 inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
 inet_frag_put ./include/net/inet_frag.h:133
 nf_ct_frag6_gather+0x1125/0x38b0 net/ipv6/netfilter/nf_conntrack_reasm.c:617
 ipv6_defrag+0x21b/0x350 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
 nf_hook_entry_hookfn ./include/linux/netfilter.h:102
 nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
 nf_hook ./include/linux/netfilter.h:212
 __ip6_local_out+0x52c/0xaf0 net/ipv6/output_core.c:160
 ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
 ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
 ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
 rawv6_push_pending_frames net/ipv6/raw.c:613
 rawv6_sendmsg+0x2cff/0x4130 net/ipv6/raw.c:927
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
 sock_sendmsg_nosec net/socket.c:635
 sock_sendmsg+0xca/0x110 net/socket.c:645
 sock_write_iter+0x326/0x620 net/socket.c:848
 new_sync_write fs/read_write.c:499
 __vfs_write+0x483/0x760 fs/read_write.c:512
 vfs_write+0x187/0x530 fs/read_write.c:560
 SYSC_write fs/read_write.c:607
 SyS_write+0xfb/0x230 fs/read_write.c:599
 entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203
RIP: 0033:0x7ff26e6f5b79
RSP: 002b:00007ff268e0ed98 EFLAGS: 00000206 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ff268e0f9c0 RCX: 00007ff26e6f5b79
RDX: 0000000000000010 RSI: 0000000020f50fe1 RDI: 0000000000000003
RBP: 00007ff26ebc1220 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ff268e0f9c0 R14: 00007ff26efec040 R15: 0000000000000003

The buggy address belongs to the object at ffff880062da0000
 which belongs to the cache RAWv6 of size 1504
The buggy address ffff880062da0060 is located 96 bytes inside
 of 1504-byte region [ffff880062da0000, ffff880062da05e0)

Freed by task 4113:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
 slab_free_hook mm/slub.c:1352
 slab_free_freelist_hook mm/slub.c:1374
 slab_free mm/slub.c:2951
 kmem_cache_free+0xb2/0x2c0 mm/slub.c:2973
 sk_prot_free net/core/sock.c:1377
 __sk_destruct+0x49c/0x6e0 net/core/sock.c:1452
 sk_destruct+0x47/0x80 net/core/sock.c:1460
 __sk_free+0x57/0x230 net/core/sock.c:1468
 sk_free+0x23/0x30 net/core/sock.c:1479
 sock_put ./include/net/sock.h:1638
 sk_common_release+0x31e/0x4e0 net/core/sock.c:2782
 rawv6_close+0x54/0x80 net/ipv6/raw.c:1214
 inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425
 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:431
 sock_release+0x8d/0x1e0 net/socket.c:599
 sock_close+0x16/0x20 net/socket.c:1063
 __fput+0x332/0x7f0 fs/file_table.c:208
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x19b/0x270 kernel/task_work.c:116
 exit_task_work ./include/linux/task_work.h:21
 do_exit+0x186b/0x2800 kernel/exit.c:839
 do_group_exit+0x149/0x420 kernel/exit.c:943
 SYSC_exit_group kernel/exit.c:954
 SyS_exit_group+0x1d/0x20 kernel/exit.c:952
 entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203

Allocated by task 4115:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
 slab_post_alloc_hook mm/slab.h:432
 slab_alloc_node mm/slub.c:2708
 slab_alloc mm/slub.c:2716
 kmem_cache_alloc+0x1af/0x250 mm/slub.c:2721
 sk_prot_alloc+0x65/0x2a0 net/core/sock.c:1334
 sk_alloc+0x105/0x1010 net/core/sock.c:1396
 inet6_create+0x44d/0x1150 net/ipv6/af_inet6.c:183
 __sock_create+0x4f6/0x880 net/socket.c:1199
 sock_create net/socket.c:1239
 SYSC_socket net/socket.c:1269
 SyS_socket+0xf9/0x230 net/socket.c:1249
 entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203

Memory state around the buggy address:
 ffff880062d9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff880062d9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff880062da0000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                       ^
 ffff880062da0080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff880062da0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 20:55:57 -08:00
Eric Dumazet 13baa00ad0 net: net_enable_timestamp() can be called from irq contexts
It is now very clear that silly TCP listeners might play with
enabling/disabling timestamping while new children are added
to their accept queue.

Meaning net_enable_timestamp() can be called from BH context
while current state of the static key is not enabled.

Lets play safe and allow all contexts.

The work queue is scheduled only under the problematic cases,
which are the static key enable/disable transition, to not slow down
critical paths.

This extends and improves what we did in commit 5fa8bbda38 ("net: use
a work queue to defer net_disable_timestamp() work")

Fixes: b90e5794c5 ("net: dont call jump_label_dec from irq context")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 20:55:57 -08:00
Alexander Potapenko 540e2894f7 net: don't call strlen() on the user buffer in packet_bind_spkt()
KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
uninitialized memory in packet_bind_spkt():
Acked-by: Eric Dumazet <edumazet@google.com>

==================================================================
BUG: KMSAN: use of unitialized memory
CPU: 0 PID: 1074 Comm: packet Not tainted 4.8.0-rc6+ #1891
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
 0000000000000000 ffff88006b6dfc08 ffffffff82559ae8 ffff88006b6dfb48
 ffffffff818a7c91 ffffffff85b9c870 0000000000000092 ffffffff85b9c550
 0000000000000000 0000000000000092 00000000ec400911 0000000000000002
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff82559ae8>] dump_stack+0x238/0x290 lib/dump_stack.c:51
 [<ffffffff818a6626>] kmsan_report+0x276/0x2e0 mm/kmsan/kmsan.c:1003
 [<ffffffff818a783b>] __msan_warning+0x5b/0xb0
mm/kmsan/kmsan_instr.c:424
 [<     inline     >] strlen lib/string.c:484
 [<ffffffff8259b58d>] strlcpy+0x9d/0x200 lib/string.c:144
 [<ffffffff84b2eca4>] packet_bind_spkt+0x144/0x230
net/packet/af_packet.c:3132
 [<ffffffff84242e4d>] SYSC_bind+0x40d/0x5f0 net/socket.c:1370
 [<ffffffff84242a22>] SyS_bind+0x82/0xa0 net/socket.c:1356
 [<ffffffff8515991b>] entry_SYSCALL_64_fastpath+0x13/0x8f
arch/x86/entry/entry_64.o:?
chained origin: 00000000eba00911
 [<ffffffff810bb787>] save_stack_trace+0x27/0x50
arch/x86/kernel/stacktrace.c:67
 [<     inline     >] kmsan_save_stack_with_flags mm/kmsan/kmsan.c:322
 [<     inline     >] kmsan_save_stack mm/kmsan/kmsan.c:334
 [<ffffffff818a59f8>] kmsan_internal_chain_origin+0x118/0x1e0
mm/kmsan/kmsan.c:527
 [<ffffffff818a7773>] __msan_set_alloca_origin4+0xc3/0x130
mm/kmsan/kmsan_instr.c:380
 [<ffffffff84242b69>] SYSC_bind+0x129/0x5f0 net/socket.c:1356
 [<ffffffff84242a22>] SyS_bind+0x82/0xa0 net/socket.c:1356
 [<ffffffff8515991b>] entry_SYSCALL_64_fastpath+0x13/0x8f
arch/x86/entry/entry_64.o:?
origin description: ----address@SYSC_bind (origin=00000000eb400911)
==================================================================
(the line numbers are relative to 4.8-rc6, but the bug persists
upstream)

, when I run the following program as root:

=====================================
 #include <string.h>
 #include <sys/socket.h>
 #include <netpacket/packet.h>
 #include <net/ethernet.h>

 int main() {
   struct sockaddr addr;
   memset(&addr, 0xff, sizeof(addr));
   addr.sa_family = AF_PACKET;
   int fd = socket(PF_PACKET, SOCK_PACKET, htons(ETH_P_ALL));
   bind(fd, &addr, sizeof(addr));
   return 0;
 }
=====================================

This happens because addr.sa_data copied from the userspace is not
zero-terminated, and copying it with strlcpy() in packet_bind_spkt()
results in calling strlen() on the kernel copy of that non-terminated
buffer.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 20:55:57 -08:00
Mike Manning 8953de2f02 net: bridge: allow IPv6 when multicast flood is disabled
Even with multicast flooding turned off, IPv6 ND should still work so
that IPv6 connectivity is provided. Allow this by continuing to flood
multicast traffic originated by us.

Fixes: b6cb5ac833 ("net: bridge: add per-port multicast flood flag")
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Mike Manning <mmanning@brocade.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 20:55:57 -08:00
Linus Torvalds 8f03cf50bc NFS client updates for Linux 4.11
Stable bugfixes:
 - NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
 - xprtrdma: Fix Read chunk padding
 - xprtrdma: Per-connection pad optimization
 - xprtrdma: Disable pad optimization by default
 - xprtrdma: Reduce required number of send SGEs
 - nlm: Ensure callback code also checks that the files match
 - pNFS/flexfiles: If the layout is invalid, it must be updated before retrying
 - NFSv4: Fix reboot recovery in copy offload
 - Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE"
 - NFSv4: fix getacl head length estimation
 - NFSv4: fix getacl ERANGE for sum ACL buffer sizes
 
 Features:
 - Add and use dprintk_cont macros
 - Various cleanups to NFS v4.x to reduce code duplication and complexity
 - Remove unused cr_magic related code
 - Improvements to sunrpc "read from buffer" code
 - Clean up sunrpc timeout code and allow changing TCP timeout parameters
 - Remove duplicate mw_list management code in xprtrdma
 - Add generic functions for encoding and decoding xdr streams
 
 Bugfixes:
 - Clean up nfs_show_mountd_netid
 - Make layoutreturn_ops static and use NULL instead of 0 to fix sparse warnings
 - Properly handle -ERESTARTSYS in nfs_rename()
 - Check if register_shrinker() failed during rpcauth_init()
 - Properly clean up procfs/pipefs entries
 - Various NFS over RDMA related fixes
 - Silence unititialized variable warning in sunrpc
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAli3F7YACgkQ18tUv7Cl
 QOvzrQ//dL+nnBaqsm9bA2wwuVJSQ2R1zdkwHOCWghEWROZrQHzpi0VHu0ZKBLzr
 YsYFhHvIPax9Q8USY4B/QFQ3eUuZILEVn+xDruRxZaJPnsA4Zmr16VJwGF2F68Lh
 CGekA5qybqy8lAG6v96Gyjbi+JqjHNCmelYWRv7SX9IZcDjNJpsEbrSI4LkabTWh
 70WtCl3LBzVMRYRxe8+f0mcx4g4XCQ8pDaQRgRnfKtNeQk/+PgWz66xSNinDakVb
 A8AkaiUadPRgUTpap6HfBSicpRvtLQeLhARC0E4YE5pXp2H/kUt2MFe5szblfSCv
 zf2nrPUbNEHjBypFhERzCZZk6EonY6FeOojyW0g2C+rmPdK7WLlKbwTQFxdRGvsx
 78fIiPRdlDHDp9CXzD8V4xxRBJX/KkicA1Vp8CoyQtmpzpu2fjwT0kr9HeD+aEe6
 293+72QUfk05re2HYWF9MCGGVVLdnLLjrKCgwwRQ0HX5WF6GNQxX/yVgBVlqFeV3
 xc8m7ltKco5N9JxIqwlIpySq2e114EQOqsmHYz3gxd7ID9J1NJz+9H2z2EvgAKZ7
 wIPSLoZrdBdnoXG8ZDDTAvPKeB8l6egi6wjrvGKxewVlMbjzogdARsMKWoifnCfG
 HMkH+IEvLGvFc1pPeLbscJGEdVWXVn0thO+8fkS9F9sE/zMX9PA=
 =01DU
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
   - xprtrdma: Fix Read chunk padding
   - xprtrdma: Per-connection pad optimization
   - xprtrdma: Disable pad optimization by default
   - xprtrdma: Reduce required number of send SGEs
   - nlm: Ensure callback code also checks that the files match
   - pNFS/flexfiles: If the layout is invalid, it must be updated before
     retrying
   - NFSv4: Fix reboot recovery in copy offload
   - Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION
     replies to OP_SEQUENCE"
   - NFSv4: fix getacl head length estimation
   - NFSv4: fix getacl ERANGE for sum ACL buffer sizes

  Features:
   - Add and use dprintk_cont macros
   - Various cleanups to NFS v4.x to reduce code duplication and
     complexity
   - Remove unused cr_magic related code
   - Improvements to sunrpc "read from buffer" code
   - Clean up sunrpc timeout code and allow changing TCP timeout
     parameters
   - Remove duplicate mw_list management code in xprtrdma
   - Add generic functions for encoding and decoding xdr streams

  Bugfixes:
   - Clean up nfs_show_mountd_netid
   - Make layoutreturn_ops static and use NULL instead of 0 to fix
     sparse warnings
   - Properly handle -ERESTARTSYS in nfs_rename()
   - Check if register_shrinker() failed during rpcauth_init()
   - Properly clean up procfs/pipefs entries
   - Various NFS over RDMA related fixes
   - Silence unititialized variable warning in sunrpc"

* tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (64 commits)
  NFSv4: fix getacl ERANGE for some ACL buffer sizes
  NFSv4: fix getacl head length estimation
  Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE"
  NFSv4: Fix reboot recovery in copy offload
  pNFS/flexfiles: If the layout is invalid, it must be updated before retrying
  NFSv4: Clean up owner/group attribute decode
  SUNRPC: Add a helper function xdr_stream_decode_string_dup()
  NFSv4: Remove bogus "struct nfs_client" argument from decode_ace()
  NFSv4: Fix the underestimation of delegation XDR space reservation
  NFSv4: Replace callback string decode function with a generic
  NFSv4: Replace the open coded decode_opaque_inline() with the new generic
  NFSv4: Replace ad-hoc xdr encode/decode helpers with xdr_stream_* generics
  SUNRPC: Add generic helpers for xdr_stream encode/decode
  sunrpc: silence uninitialized variable warning
  nlm: Ensure callback code also checks that the files match
  sunrpc: Allow xprt->ops->timer method to sleep
  xprtrdma: Refactor management of mw_list field
  xprtrdma: Handle stale connection rejection
  xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
  xprtrdma: Shrink send SGEs array
  ...
2017-03-01 16:10:30 -08:00
David Howells 0837e49ab3 KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()
rcu_dereference_key() and user_key_payload() are currently being used in
two different, incompatible ways:

 (1) As a wrapper to rcu_dereference() - when only the RCU read lock used
     to protect the key.

 (2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
     used to protect the key and the may be being modified.

Fix this by splitting both of the key wrappers to produce:

 (1) RCU accessors for keys when caller has the key semaphore locked:

	dereference_key_locked()
	user_key_payload_locked()

 (2) RCU accessors for keys when caller holds the RCU read lock:

	dereference_key_rcu()
	user_key_payload_rcu()

This should fix following warning in the NFS idmapper

  ===============================
  [ INFO: suspicious RCU usage. ]
  4.10.0 #1 Tainted: G        W
  -------------------------------
  ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
  other info that might help us debug this:
  rcu_scheduler_active = 2, debug_locks = 0
  1 lock held by mount.nfs/5987:
    #0:  (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
  stack backtrace:
  CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G        W       4.10.0 #1
  Call Trace:
    dump_stack+0xe8/0x154 (unreliable)
    lockdep_rcu_suspicious+0x140/0x190
    nfs_idmap_get_key+0x380/0x420 [nfsv4]
    nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
    decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
    decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
    nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
    rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
    call_decode+0x29c/0x910 [sunrpc]
    __rpc_execute+0x140/0x8f0 [sunrpc]
    rpc_run_task+0x170/0x200 [sunrpc]
    nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
    _nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
    nfs4_lookup_root+0xe0/0x350 [nfsv4]
    nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
    nfs4_find_root_sec+0xc4/0x100 [nfsv4]
    nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
    nfs4_get_rootfh+0x6c/0x190 [nfsv4]
    nfs4_server_common_setup+0xc4/0x260 [nfsv4]
    nfs4_create_server+0x278/0x3c0 [nfsv4]
    nfs4_remote_mount+0x50/0xb0 [nfsv4]
    mount_fs+0x74/0x210
    vfs_kern_mount+0x78/0x220
    nfs_do_root_mount+0xb0/0x140 [nfsv4]
    nfs4_try_mount+0x60/0x100 [nfsv4]
    nfs_fs_mount+0x5ec/0xda0 [nfs]
    mount_fs+0x74/0x210
    vfs_kern_mount+0x78/0x220
    do_mount+0x254/0xf70
    SyS_mount+0x94/0x100
    system_call+0x38/0xe0

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-03-02 10:09:00 +11:00
David S. Miller 16c54ac903 First round of fixes - details in the commits:
* use a valid hrtimer clock ID in mac80211_hwsim
  * don't reorder frames prior to BA session
  * flush a delayed work at suspend so the state is all valid before
    suspend/resume
  * fix packet statistics in fast-RX, the RX packets
    counter increment was simply missing
  * don't try to re-transmit filtered frames in an aggregation session
  * shorten (for tracing) a debug message
  * typo fix in another debug message
  * fix nul-termination with HWSIM_ATTR_RADIO_NAME in hwsim
  * fix mgmt RX processing when station is looked up by driver/device
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAli1HCYACgkQa3t4Rpy0
 AB1Sfw/+MeYZtAqeem6JzNW7OPPHgJIXHAnDyby79jzDiYZ1GF2c2asWMwBMshuH
 dTYFs15gdkfItrUF9SsvGFLRaRlPJO899lgL54EwcALxqP82Byylm36X29HTUyzB
 MS3lzrr84Ad5F/BkHa949ogcoJwmkmhSYpG5L5EclJs98Vu8pyd/AlBJ+fbNFmHz
 Pe2lDiIjt5H5MvGCZfl5PK1l9GwDLIMhLekn03HEfCN5lNz2c8aru1dhcCXxWSPG
 Oxj2J4n0WknwntKLx5F/Wceee8YDRyUBw0peKYxp8VQPPLVzpoc9GKlaCDBemwVQ
 sfVfgkkZchKtAt1SPJS5yKMfUlEQdqE3w3XIzzRS9bN3AsVK5EKL63zDO7HJ+5nZ
 2hB8kY4fjGUx0RrZKKL0jruSl+AUP5bWJUIt2VM55FM195rXUkDSjsn3M1ZIwiHp
 8fvwLWibKoQg7Y0UIdVQSl03Gfd2dEs0XrvD+fNMI9aNy/EriJZ15pxbNAPSi10Z
 pN02zFCmqxju6w1tppxW9eBoxdbt45lr1cyXY+GN6RTgm2ILM9i+tQHoNf4KuWYl
 5dydXhDkRGpUghevR4K29vBYtoaNK0CteY2xECdRGUVFnMl4QUgHInOeKcSneHCi
 Cvp31WYTAP+U/8ni0VYENqEC0iRbUNrMu2mkECxriHt5go+n5jA=
 =wths
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
First round of fixes - details in the commits:
 * use a valid hrtimer clock ID in mac80211_hwsim
 * don't reorder frames prior to BA session
 * flush a delayed work at suspend so the state is all valid before
   suspend/resume
 * fix packet statistics in fast-RX, the RX packets
   counter increment was simply missing
 * don't try to re-transmit filtered frames in an aggregation session
 * shorten (for tracing) a debug message
 * typo fix in another debug message
 * fix nul-termination with HWSIM_ATTR_RADIO_NAME in hwsim
 * fix mgmt RX processing when station is looked up by driver/device
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 15:08:34 -08:00
Eric Dumazet 449809a66c tcp/dccp: block BH for SYN processing
SYN processing really was meant to be handled from BH.

When I got rid of BH blocking while processing socket backlog
in commit 5413d1babe ("net: do not block BH while processing socket
backlog"), I forgot that a malicious user could transition to TCP_LISTEN
from a state that allowed (SYN) packets to be parked in the socket
backlog while socket is owned by the thread doing the listen() call.

Sure enough syzkaller found this and reported the bug ;)

=================================
[ INFO: inconsistent lock state ]
4.10.0+ #60 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
syz-executor0/5090 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at:
[<ffffffff83a6a370>] spin_lock include/linux/spinlock.h:299 [inline]
 (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at:
[<ffffffff83a6a370>] inet_ehash_insert+0x240/0xad0
net/ipv4/inet_hashtables.c:407
{IN-SOFTIRQ-W} state was registered at:
  mark_irqflags kernel/locking/lockdep.c:2923 [inline]
  __lock_acquire+0xbcf/0x3270 kernel/locking/lockdep.c:3295
  lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
  __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
  _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
  spin_lock include/linux/spinlock.h:299 [inline]
  inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407
  reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline]
  inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764
  tcp_conn_request+0x25cc/0x3310 net/ipv4/tcp_input.c:6399
  tcp_v4_conn_request+0x157/0x220 net/ipv4/tcp_ipv4.c:1262
  tcp_rcv_state_process+0x802/0x4130 net/ipv4/tcp_input.c:5889
  tcp_v4_do_rcv+0x56b/0x940 net/ipv4/tcp_ipv4.c:1433
  tcp_v4_rcv+0x2e12/0x3210 net/ipv4/tcp_ipv4.c:1711
  ip_local_deliver_finish+0x4ce/0xc40 net/ipv4/ip_input.c:216
  NF_HOOK include/linux/netfilter.h:257 [inline]
  ip_local_deliver+0x1ce/0x710 net/ipv4/ip_input.c:257
  dst_input include/net/dst.h:492 [inline]
  ip_rcv_finish+0xb1d/0x2110 net/ipv4/ip_input.c:396
  NF_HOOK include/linux/netfilter.h:257 [inline]
  ip_rcv+0xd90/0x19c0 net/ipv4/ip_input.c:487
  __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4179
  __netif_receive_skb+0x2a/0x170 net/core/dev.c:4217
  netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4245
  napi_skb_finish net/core/dev.c:4602 [inline]
  napi_gro_receive+0x4e6/0x680 net/core/dev.c:4636
  e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4033 [inline]
  e1000_clean_rx_irq+0x5e0/0x1490
drivers/net/ethernet/intel/e1000/e1000_main.c:4489
  e1000_clean+0xb9a/0x2910 drivers/net/ethernet/intel/e1000/e1000_main.c:3834
  napi_poll net/core/dev.c:5171 [inline]
  net_rx_action+0xe70/0x1900 net/core/dev.c:5236
  __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
  invoke_softirq kernel/softirq.c:364 [inline]
  irq_exit+0x19e/0x1d0 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:658 [inline]
  do_IRQ+0x81/0x1a0 arch/x86/kernel/irq.c:250
  ret_from_intr+0x0/0x20
  native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:53
  arch_safe_halt arch/x86/include/asm/paravirt.h:98 [inline]
  default_idle+0x8f/0x410 arch/x86/kernel/process.c:271
  arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:262
  default_idle_call+0x36/0x60 kernel/sched/idle.c:96
  cpuidle_idle_call kernel/sched/idle.c:154 [inline]
  do_idle+0x348/0x440 kernel/sched/idle.c:243
  cpu_startup_entry+0x18/0x20 kernel/sched/idle.c:345
  start_secondary+0x344/0x440 arch/x86/kernel/smpboot.c:272
  verify_cpu+0x0/0xfc
irq event stamp: 1741
hardirqs last  enabled at (1741): [<ffffffff84d49d77>]
__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160
[inline]
hardirqs last  enabled at (1741): [<ffffffff84d49d77>]
_raw_spin_unlock_irqrestore+0xf7/0x1a0 kernel/locking/spinlock.c:191
hardirqs last disabled at (1740): [<ffffffff84d4a732>]
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
hardirqs last disabled at (1740): [<ffffffff84d4a732>]
_raw_spin_lock_irqsave+0xa2/0x110 kernel/locking/spinlock.c:159
softirqs last  enabled at (1738): [<ffffffff84d4deff>]
__do_softirq+0x7cf/0xb7d kernel/softirq.c:310
softirqs last disabled at (1571): [<ffffffff84d4b92c>]
do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&hashinfo->ehash_locks[i])->rlock);
  <Interrupt>
    lock(&(&hashinfo->ehash_locks[i])->rlock);

 *** DEADLOCK ***

1 lock held by syz-executor0/5090:
 #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83406b43>] lock_sock
include/net/sock.h:1460 [inline]
 #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83406b43>]
sock_setsockopt+0x233/0x1e40 net/core/sock.c:683

stack backtrace:
CPU: 1 PID: 5090 Comm: syz-executor0 Not tainted 4.10.0+ #60
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x292/0x398 lib/dump_stack.c:51
 print_usage_bug+0x3ef/0x450 kernel/locking/lockdep.c:2387
 valid_state kernel/locking/lockdep.c:2400 [inline]
 mark_lock_irq kernel/locking/lockdep.c:2602 [inline]
 mark_lock+0xf30/0x1410 kernel/locking/lockdep.c:3065
 mark_irqflags kernel/locking/lockdep.c:2941 [inline]
 __lock_acquire+0x6dc/0x3270 kernel/locking/lockdep.c:3295
 lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:299 [inline]
 inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407
 reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline]
 inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764
 dccp_v6_conn_request+0xada/0x11b0 net/dccp/ipv6.c:380
 dccp_rcv_state_process+0x51e/0x1660 net/dccp/input.c:606
 dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
 sk_backlog_rcv include/net/sock.h:896 [inline]
 __release_sock+0x127/0x3a0 net/core/sock.c:2052
 release_sock+0xa5/0x2b0 net/core/sock.c:2539
 sock_setsockopt+0x60f/0x1e40 net/core/sock.c:1016
 SYSC_setsockopt net/socket.c:1782 [inline]
 SyS_setsockopt+0x2fb/0x3a0 net/socket.c:1765
 entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x4458b9
RSP: 002b:00007fe8b26c2b58 EFLAGS: 00000292 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00000000004458b9
RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000006
RBP: 00000000006e2110 R08: 0000000000000010 R09: 0000000000000000
R10: 00000000208c3000 R11: 0000000000000292 R12: 0000000000708000
R13: 0000000020000000 R14: 0000000000001000 R15: 0000000000000000

Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 15:03:31 -08:00
Yotam Gigi df2c43343b bridge: Fix error path in nbp_vlan_init
Fix error path order in nbp_vlan_init, so if switchdev_port_attr_set
call failes, the vlan_hash wouldn't be destroyed before inited.

Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 14:55:28 -08:00
Liping Zhang 3b45a4106f net: route: add missing nla_policy entry for RTA_MARK attribute
This will add stricter validating for RTA_MARK attribute.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 10:25:56 -08:00
Felix Jia 8c171d6ca5 net/ipv6: avoid possible dead locking on addr_gen_mode sysctl
The addr_gen_mode variable can be accessed by both sysctl and netlink.
Repleacd rtnl_lock() with rtnl_trylock() protect the sysctl operation to
avoid the possbile dead lock.`

Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 10:22:48 -08:00
Eric Dumazet 39e6c8208d net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is
available.
                                          device hard irq

                                          // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

In v4, I added the ideas given by Alexander Duyck in v3 review

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 09:50:58 -08:00