passt/tcp_conn.h, branch 2026_05_07.1afd4ed

tcp: Use SO_MEMINFO for accurate send buffer overhead accounting

2026-05-07T06:03:14+00:00

The TCP window advertised to the guest/container must balance two
competing needs: large enough to trigger kernel socket buffer
auto-tuning, but not so large that sendmsg() partially fails, causing
retransmissions.

The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
SNDBUF_GET() returns a scaled value that only roughly accounts for
per-skb overhead. The clamped_scale approximation doesn't accurately
track the actual per-segment overhead, which can lead to both excessive
retransmissions and reduced throughput.

We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
SK_MEMINFO_WMEM_QUEUED from the kernel. The latter is presented in the
kernel's own accounting units, i.e. including the sk_buff overhead,
and matches exactly what the kernel's own sk_stream_memory_free()
function is using.

When data is queued and the overhead ratio is observable, we calculate
the per-segment overhead as (wmem_queued - sendq) / num_segments, then
determine how many additional segments should fit in the remaining
buffer space, considering the calculated per-mss overhead. This approach
treats segments as discrete quantities, and produces a more accurate
estimate of available buffer space than a linear scaling factor does.

When the ratio cannot be observed, e.g. because the queue is empty or
we are in a transient state, we fall back to the existing clamped_scale
calculation (scaling between 100% and 75% of buffer capacity).

When SO_MEMINFO succeeds, we also use SK_MEMINFO_SNDBUF directly to
set SNDBUF, avoiding a separate SO_SNDBUF getsockopt() call.

If SO_MEMINFO is unavailable, we fall back to the pre-existing
SNDBUF_GET() - SIOCOUTQ calculation.

Link: https://bugs.passt.top/show_bug.cgi?id=138
Link: https://github.com/containers/podman/issues/28219
Analysed-by: Yumei Huang 
Signed-off-by: Jon Maloy 
Signed-off-by: Stefano Brivio

tcp: Avoid comparison of expressions with different signedness in RTT_SET()

2026-03-10T14:25:05+00:00

With gcc 14.2, building against musl 1.2.5 (slightly outdated Alpine
on x86_64):

tcp.c: In function 'tcp_update_seqack_wnd':
util.h:40:39: warning: comparison of integer expressions of different signedness: 'unsigned int' and 'int' [-Wsign-compare]
   40 | #define MIN(x, y)               (((x) < (y)) ? (x) : (y))
      |                                       ^
tcp_conn.h:63:26: note: in expansion of macro 'MIN'
   63 |         (conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
      |                          ^~~
tcp.c:1234:17: note: in expansion of macro 'RTT_SET'
 1234 |                 RTT_SET(conn, tinfo->tcpi_rtt);
      |                 ^~~~~~~
util.h:40:54: warning: operand of '?:' changes signedness from 'int' to 'unsigned int' due to unsignedness of other operand [-Wsign-compare]
   40 | #define MIN(x, y)               (((x) < (y)) ? (x) : (y))
      |                                                      ^~~
tcp_conn.h:63:26: note: in expansion of macro 'MIN'
   63 |         (conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
      |                          ^~~
tcp.c:1234:17: note: in expansion of macro 'RTT_SET'
 1234 |                 RTT_SET(conn, tinfo->tcpi_rtt);
      |                 ^~~~~~~

for some reason, that's not reported by gcc with glibc.

Cast the result of ilog2() to unsigned before using it, and introduce
0 as lower bound, to make it obvious that we expect the argument to be
always valid, the way we're using it.

Suggested-by: David Gibson 
Fixes: 000601ba86da ("tcp: Adaptive interval based on RTT for socket-side acknowledgement checks")
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

Add missing includes to headers

2026-03-04T16:39:57+00:00

Support build systems like bazel that check that headers are
self-contained.

Also update includes so that clang-include-cleaner succeeds.

Tested with:
clang-include-cleaner-19 --extra-arg=-D_GNU_SOURCE --extra-arg=-DPAGE_SIZE=4096 --extra-arg=-DVERSION=\"git\" --extra-arg=-DHAS_GETRANDOM *.h *.c

Signed-off-by: Peter Foley 
Signed-off-by: Stefano Brivio

tcp: Send TCP keepalive segments after a period of tap-side inactivity

2026-02-24T23:17:45+00:00

There are several circumstances in which a live, but idle TCP connection
can be forgotten by a guest, with no "on the wire" indication that this has
happened.  The most obvious is if the guest abruptly reboots.  A more
subtle case can happen with a half-closed connection, specifically one
in FIN_WAIT_2 state on the guest.  A connection can, legitimately, remain
in this state indefinitely.  If however, a socket in this state is closed
by userspace, Linux at least will remove the kernel socket after 60s
(or as configured in the net.ipv4.tcp_fin_timeout sysctl).

Because there's no on the wire indication in these cases, passt will
pointlessly retain the connection in its flow table, at least until it is
removed by the inactivity timeout after several hours.

To avoid keeping connections around for so long in this state, add
functionality to periodically send TCP keepalive segments to the guest if
we've seen no activity on the tap interface.  If the guest is no longer
aware of the connection, it should respond with an RST which will let
passt remove the stale entry.

To do this we use a method similar to the inactivity timeout - a 1-bit
page replacement / clock algorithm, but with a shorter interval, and only
checking for tap side activity.  Currently we use a 300s interval, meaning
we'll send a keepalive after 5-10 minutes of (tap side) inactivity.

Link: https://bugs.passt.top/show_bug.cgi?id=179
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Re-introduce inactivity timeouts based on a clock algorithm

2026-02-24T23:17:38+00:00

We previously had a mechanism to remove TCP connections which were
inactive for 2 hours.  That was broken for a long time, due to poor
interactions with the timerfd handling, so we removed it.

Adding this long scale timer onto the timerfd handling, which mostly
handles much shorter timeouts is tricky to reason about.  However, for the
inactivity timeouts, we don't require precision.  Instead, we can use
a 1-bit page replacement / "clock" algorithm.  Every INACTIVITY_INTERVAL
(2 hours), a global timer marks every TCP connection as tentatively
inactive.  That flag is cleared if we get any events, either tap side or
socket side.

If the inactive flag is still set when the next INACTIVITY_INTERVAL expires
then the connection has been inactive for an extended period and we reset
and close it.  In practice this means that connections will be removed
after 2-4 hours of inactivity.

This is not a true fix for bug 179, but it does mitigate the damage, by
limiting the time that inactive connections will remain around,

Link: https://bugs.passt.top/show_bug.cgi?id=179
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

migrate: Use forward table information to close() listening sockets

2026-01-31T03:25:08+00:00

On incoming migrations we need to bind() reconstructed sockets to their
correct local address.  We can't do this if the origin passt instance is
in the same namespace and still has those addresses bound.  Arguably that's
a bug in bind()s operation during repair mode, but for now we have to work
around it.

So, to allow local-to-local migrations we close() sockets on the outgoing
side as we process them.  In addition to closing the connected socket we
also have to close the associated listen()ing socket, because that can also
cause an address conflict.

To do that, we introduced the listening_sock field in the connection
state, because we had no other way to find the right listening sockets.
Now that we have the forwarding table, we have a complete list of
listening sockets elsewhere.  We can use that instead, to close all
listening sockets on outbound migration, rather than just the ones that
might conflict.

This is cleaner and, importantly, saves a valuable 32-bits in the flow
state structure.  It does mean that there is a longer window where a peer
attempting to connect during migration might get a Connection Refused.
I think this is an acceptable trade-off for now: arguably we should not
allow local-to-local migrations in any case, since the socket closes make
it impossible to safely roll back migration as per the qemu model.

Signed-off-by: David Gibson 
[sbrivio: Adjust comment to tcp_flow_migrate_source()]
Signed-off-by: Stefano Brivio

tcp: Adaptive interval based on RTT for socket-side acknowledgement checks

2025-12-08T08:15:36+00:00

A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio

tcp: Clamp the retry timeout

2025-12-02T22:05:08+00:00

Clamp the TCP retry timeout as Linux kernel does. If a retry occurs
during the handshake and the RTO is below 3 seconds, re-initialise
it to 3 seconds for data retransmissions according to RFC 6298.

Suggested-by: Stefano Brivio 
Signed-off-by: Yumei Huang 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Rename "retrans" to "retries"

2025-12-02T22:05:08+00:00

Rename "retrans" to "retries" so it can be used for SYN retries.

Signed-off-by: Yumei Huang 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, flow: Replace per-connection in_epoll flag with an epollid in flow_common

2025-10-30T14:32:50+00:00

The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
whether a connection was registered with epoll, not which epoll instance.
This limited flexibility for future multi-epoll support.

Replace the boolean with an epollid field in flow_common that identifies
which epoll instance the flow is registered with.
Use FLOW_EPOLLID_INVALID to indicate when a flow is not registered with
any epoll instance. An epoll_id_to_fd[] mapping table translates
epoll ids to their corresponding epoll file descriptors.

Add helper functions:
- flow_in_epoll() to check if a flow is registered with epoll
- flow_epollfd() to retrieve the epoll fd for a flow's thread
- flow_epollid_register() to register an epoll fd with an epollid
- flow_epollid_set() to set the epollid of a flow
- flow_epollid_clear() to reset the epoll id of a flow

This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
the need to pass the context 'c', since the epoll fd is now directly
accessible from the flow structure via flow_epollfd().

Add a defensive check at the beginning of tcp_flow_repair_queue() to
avoid a false positive with "make clang-tidy":
  error: The 1st argument to 'send' is < 0 but should be >= 0
   3230 |                 ssize_t rc = send(conn->sock, p, MIN(len, chunk), 0);

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio