mirror-linux/net
Jakub Sitnicki 91d0b78c51 inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:45:00 -08:00
..
6lowpan
9p xen: branch for v6.2-rc4 2023-01-12 17:02:20 -06:00
802 treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
8021q
appletalk
atm driver core: make struct class.dev_uevent() take a const * 2022-11-24 17:12:15 +01:00
ax25 ax25: af_ax25: Remove unnecessary (void*) conversions 2022-11-16 13:31:03 +00:00
batman-adv Networking changes for 6.2. 2022-12-13 15:47:48 -08:00
bluetooth net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
bpf New Feature: 2022-12-17 14:06:53 -06:00
bpfilter
bridge treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
caif caif: don't assume iov_iter type 2023-01-13 20:44:20 -08:00
can Networking changes for 6.2. 2022-12-13 15:47:48 -08:00
ceph net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
core net: avoid irqsave in skb_defer_free_flush 2023-01-23 22:08:06 -08:00
dcb net: dcb: add helper functions to retrieve PCP and DSCP rewrite maps 2023-01-20 09:33:22 +00:00
dccp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-11-29 13:04:52 -08:00
devlink devlink: remove a dubious assumption in fmsg dumping 2023-01-24 20:31:35 -08:00
dns_resolver
dsa net: dsa: microchip: enable port queues for tc mqprio 2023-01-23 22:12:35 -08:00
ethernet net: ethernet: use sysfs_emit() to instead of scnprintf() 2022-12-07 20:02:44 -08:00
ethtool net: ethtool: fix NULL pointer dereference in pause_prepare_data() 2023-01-25 09:57:41 +00:00
hsr hsr: Use a single struct for self_node. 2022-12-01 20:26:22 -08:00
ieee802154 Merge tag 'ieee802154-for-net-next-2022-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next 2022-12-07 17:33:26 -08:00
ife
ipv4 inet: Add IP_LOCAL_PORT_RANGE socket option 2023-01-25 22:45:00 -08:00
ipv6 ipv6: Make ip6_route_output_flags_noref() static. 2023-01-24 18:12:52 -08:00
iucv
kcm net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
key Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2022-11-29 20:50:51 -08:00
l2tp l2tp: prevent lockdep issue in l2tp_tunnel_register() 2023-01-18 14:44:54 +00:00
l3mdev
lapb
llc
mac80211 wifi: mac80211: drop extra 'e' from ieeee80211... name 2023-01-19 14:57:51 +01:00
mac802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-12-08 18:19:59 -08:00
mctp mctp: Remove device type check at unregister 2022-12-19 17:20:22 -08:00
mpls
mptcp net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
ncsi net/ncsi: Silence runtime memcpy() false positive warning 2022-12-06 17:29:14 -08:00
netfilter net: Kconfig: fix spellos 2023-01-25 22:39:56 -08:00
netlabel
netlink Networking changes for 6.2. 2022-12-13 15:47:48 -08:00
netrom
nfc net: nfc: Fix use-after-free in local_cleanup() 2023-01-13 20:53:44 -08:00
nsh
openvswitch net: openvswitch: release vport resources on failure 2022-12-21 17:48:12 -08:00
packet Networking changes for 6.2. 2022-12-13 15:47:48 -08:00
phonet net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
psample
qrtr net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
rds net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
rfkill Merge wireless into wireless-next 2023-01-17 13:36:25 +02:00
rose
rxrpc rxrpc: Fix wrong error return in rxrpc_connect_call() 2023-01-12 21:51:55 -08:00
sched net: Kconfig: fix spellos 2023-01-25 22:39:56 -08:00
sctp inet: Add IP_LOCAL_PORT_RANGE socket option 2023-01-25 22:45:00 -08:00
smc net/smc: De-tangle ism and smc device initialization 2023-01-25 09:46:49 +00:00
strparser
sunrpc net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
switchdev
tipc net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
tls net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
unix unix: Improve locking scheme in unix_show_fdinfo() 2023-01-16 11:21:11 +00:00
vmw_vsock virtio/vsock: replace virtio_vsock_pkt with sk_buff 2023-01-16 13:26:33 +00:00
wireless wifi: wireless: deny wireless extensions on MLO-capable devices 2023-01-19 20:01:41 +02:00
x25 net/x25: Fix skb leak in x25_lapb_receive_frame() 2022-11-15 20:22:19 -08:00
xdp bpf: Expand map key argument of bpf_redirect_map to u64 2022-11-15 09:00:27 -08:00
xfrm net/sock: Introduce trace_sk_data_ready() 2023-01-23 11:26:50 +00:00
Kconfig
Kconfig.debug
Makefile devlink: move code to a dedicated directory 2023-01-05 22:12:00 -08:00
compat.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
devres.c
socket.c sock: add tracepoint for send recv length 2023-01-13 10:25:10 +00:00
sysctl_net.c