mirror-linux/net/ipv4
mfreemon@cloudflare.com 0796c53424 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
[ Upstream commit b650d953cd ]

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-19 23:08:54 +02:00
..
bpfilter
netfilter netfilter: tproxy: fix deadlock due to missing BH disable 2023-03-17 08:50:25 +01:00
Kconfig tcp: configurable source port perturb table size 2022-11-16 13:02:04 +00:00
Makefile
af_inet.c ipv4: fix data-races around inet->inet_id 2023-08-30 16:11:02 +02:00
ah4.c xfrm: ah: add extack to ah_init_state, ah6_init_state 2022-09-29 07:17:59 +02:00
arp.c neighbour: annotate lockless accesses to n->nud_state 2023-10-10 22:00:42 +02:00
bpf_tcp_ca.c bpf: Use 0 instead of NOT_INIT for btf_struct_access() writes 2022-09-10 17:27:32 -07:00
cipso_ipv4.c cipso: Fix data-races around sysctl. 2022-07-08 12:10:33 +01:00
datagram.c ipv4: fix data-races around inet->inet_id 2023-08-30 16:11:02 +02:00
devinet.c net: ipv4: fix one memleak in __inet_del_ifa() 2023-09-19 12:28:08 +02:00
esp4.c net: ipv4: Use kfree_sensitive instead of kfree 2023-07-27 08:50:45 +02:00
esp4_offload.c xfrm: Linearize the skb after offloading if needed. 2023-06-28 11:12:29 +02:00
fib_frontend.c ipv4: Fix incorrect table ID in IOCTL path 2023-03-22 13:33:50 +01:00
fib_lookup.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-17 11:44:20 -08:00
fib_notifier.c
fib_rules.c ipv4: remove unnecessary type castings 2022-04-30 15:12:58 +01:00
fib_semantics.c neighbour: switch to standard rcu, instead of rcu_bh 2023-10-10 22:00:42 +02:00
fib_trie.c ipv4: annotate data-races around fi->fib_dead 2023-09-19 12:28:00 +02:00
fou.c genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
gre_demux.c
gre_offload.c net: gro: skb_gro_header helper function 2022-08-25 10:33:21 +02:00
icmp.c icmp: guard against too small mtu 2023-04-13 16:55:21 +02:00
igmp.c igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU 2023-09-13 09:42:59 +02:00
inet_connection_sock.c tcp: annotate data-races around icsk->icsk_syn_retries 2023-07-27 08:50:48 +02:00
inet_diag.c net: annotate data-races around sk->sk_mark 2023-08-11 12:08:14 +02:00
inet_fragment.c ipv4: remove unnecessary type castings 2022-04-30 15:12:58 +01:00
inet_hashtables.c tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address. 2023-09-19 12:28:10 +02:00
inet_timewait_sock.c Revert "tcp: avoid the lookup process failing to get sk in ehash table" 2023-07-27 08:50:45 +02:00
inetpeer.c inetpeer: Fix data-races around sysctl. 2022-07-08 12:10:33 +01:00
ip_forward.c ip: Fix data-races around sysctl_ip_fwd_update_priority. 2022-07-15 11:49:55 +01:00
ip_fragment.c net: ip: Handle delivery_time in ip defrag 2022-03-03 14:38:48 +00:00
ip_gre.c erspan: do not use skb_mac_header() in ndo_start_xmit() 2023-03-30 12:49:09 +02:00
ip_input.c ipv4: ignore dst hint for multipath routes 2023-09-19 12:28:01 +02:00
ip_options.c ipv4: drop fragmentation code from ip_options_build() 2022-01-29 17:53:07 +00:00
ip_output.c neighbour: switch to standard rcu, instead of rcu_bh 2023-10-10 22:00:42 +02:00
ip_sockglue.c net: annotate data-races around sk->sk_priority 2023-08-11 12:08:15 +02:00
ip_tunnel.c net: tunnels: annotate lockless accesses to dev->needed_headroom 2023-03-22 13:33:46 +01:00
ip_tunnel_core.c tunnels: fix kasan splat when generating ipv4 pmtu error 2023-08-16 18:27:27 +02:00
ip_vti.c ip_vti: fix potential slab-use-after-free in decode_session6 2023-08-23 17:52:32 +02:00
ipcomp.c xfrm: ipcomp: add extack to ipcomp{4,6}_init_state 2022-09-29 07:18:00 +02:00
ipconfig.c Driver core / kernfs changes for 6.0-rc1 2022-08-04 11:31:20 -07:00
ipip.c net: Add helper function to parse netlink msg of ip_tunnel_parm 2022-10-03 07:59:06 +01:00
ipmr.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-09-22 13:02:10 -07:00
ipmr_base.c ipmr: adopt rcu_read_lock() in mr_dump() 2022-06-24 11:34:38 +01:00
metrics.c ipv4: prevent potential spectre v1 gadget in ip_metrics_convert() 2023-02-01 08:34:45 +01:00
netfilter.c netfilter: Use l3mdev flow key when re-routing mangled packets 2022-05-16 13:03:29 +02:00
netlink.c
nexthop.c neighbour: switch to standard rcu, instead of rcu_bh 2023-10-10 22:00:42 +02:00
ping.c ping: Fix potentail NULL deref for /proc/net/icmp. 2023-04-13 16:55:24 +02:00
proc.c tcp: Don't allocate tcp_death_row outside of struct netns_ipv4. 2022-09-20 10:21:49 -07:00
protocol.c
raw.c net: annotate data-races around sk->sk_priority 2023-08-11 12:08:15 +02:00
raw_diag.c raw: Fix NULL deref in raw_get_next(). 2023-04-13 16:55:23 +02:00
route.c ipv4: Set offload_failed flag in fibmatch results 2023-10-10 22:00:43 +02:00
syncookies.c mptcp: remove MPTCP 'ifdef' in TCP SYN cookies 2023-01-07 11:11:44 +01:00
sysctl_net_ipv4.c tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-10-19 23:08:54 +02:00
tcp.c bpf: tcp_read_skb needs to pop skb regardless of seq 2023-10-10 22:00:41 +02:00
tcp_bbr.c bpf: Switch to new kfunc flags infrastructure 2022-07-21 20:59:42 -07:00
tcp_bic.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_bpf.c bpf, sockmap: Do not inc copied_seq when PEEK flag set 2023-10-10 22:00:41 +02:00
tcp_cdg.c Random number generator fixes for Linux 6.1-rc1. 2022-10-16 15:27:07 -07:00
tcp_cong.c tcp: Add tracepoint for tcp_set_ca_state 2022-04-07 20:33:15 -07:00
tcp_cubic.c bpf: Switch to new kfunc flags infrastructure 2022-07-21 20:59:42 -07:00
tcp_dctcp.c bpf: Switch to new kfunc flags infrastructure 2022-07-21 20:59:42 -07:00
tcp_dctcp.h
tcp_diag.c tcp: Access &tcp_hashinfo via net. 2022-09-20 10:21:49 -07:00
tcp_fastopen.c tcp: annotate data-races around fastopenq.max_qlen 2023-07-27 08:50:49 +02:00
tcp_highspeed.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_htcp.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_hybla.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_illinois.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_input.c tcp: fix delayed ACKs for MSS boundary condition 2023-10-10 22:00:43 +02:00
tcp_ipv4.c tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-10-19 23:08:54 +02:00
tcp_lp.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_metrics.c tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen 2023-08-11 12:08:18 +02:00
tcp_minisocks.c tcp: annotate data-races around tcp_rsk(req)->ts_recent 2023-07-27 08:50:45 +02:00
tcp_nv.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_offload.c tcp: gso: really support BIG TCP 2023-06-14 11:15:20 +02:00
tcp_output.c tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-10-19 23:08:54 +02:00
tcp_rate.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-04-28 13:02:01 -07:00
tcp_recovery.c tcp: Fix data-races around sysctl_tcp_recovery. 2022-07-20 10:14:50 +01:00
tcp_scalable.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_timer.c net: tcp: fix unexcepted socket die when snd_wnd is 0 2023-09-13 09:42:32 +02:00
tcp_ulp.c net/ulp: use consistent error code when blocking ULP 2023-01-24 07:24:43 +01:00
tcp_vegas.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_vegas.h
tcp_veno.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_westwood.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_yeah.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tunnel4.c
udp.c net: annotate data-races around sk->sk_forward_alloc 2023-09-19 12:28:01 +02:00
udp_bpf.c bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser() 2023-03-17 08:50:24 +01:00
udp_diag.c
udp_impl.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udp_offload.c gro: remove rcu_read_lock/rcu_read_unlock from gro_complete handlers 2021-11-24 17:21:42 -08:00
udp_tunnel_core.c net/tunnel: wait until all sk_user_data reader finish before releasing the sock 2022-12-31 13:32:27 +01:00
udp_tunnel_nic.c udp_tunnel: Fix end of loop test in udp_tunnel_nic_unregister() 2022-02-23 12:35:00 +00:00
udp_tunnel_stub.c
udplite.c udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated(). 2023-05-30 14:03:20 +01:00
xfrm4_input.c xfrm: fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets 2023-06-28 11:12:28 +02:00
xfrm4_output.c
xfrm4_policy.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
xfrm4_protocol.c net: xfrm: unexport __init-annotated xfrm4_protocol_init() 2022-06-08 10:10:13 -07:00
xfrm4_state.c
xfrm4_tunnel.c xfrm: tunnel: add extack to ipip_init_state, xfrm6_tunnel_init_state 2022-09-29 07:18:00 +02:00