passt/tcp_conn.h, branch 2026_01_17.81c97f6

tcp: Adaptive interval based on RTT for socket-side acknowledgement checks

2025-12-08T08:15:36+00:00

A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio

tcp: Clamp the retry timeout

2025-12-02T22:05:08+00:00

Clamp the TCP retry timeout as Linux kernel does. If a retry occurs
during the handshake and the RTO is below 3 seconds, re-initialise
it to 3 seconds for data retransmissions according to RFC 6298.

Suggested-by: Stefano Brivio 
Signed-off-by: Yumei Huang 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Rename "retrans" to "retries"

2025-12-02T22:05:08+00:00

Rename "retrans" to "retries" so it can be used for SYN retries.

Signed-off-by: Yumei Huang 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, flow: Replace per-connection in_epoll flag with an epollid in flow_common

2025-10-30T14:32:50+00:00

The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
whether a connection was registered with epoll, not which epoll instance.
This limited flexibility for future multi-epoll support.

Replace the boolean with an epollid field in flow_common that identifies
which epoll instance the flow is registered with.
Use FLOW_EPOLLID_INVALID to indicate when a flow is not registered with
any epoll instance. An epoll_id_to_fd[] mapping table translates
epoll ids to their corresponding epoll file descriptors.

Add helper functions:
- flow_in_epoll() to check if a flow is registered with epoll
- flow_epollfd() to retrieve the epoll fd for a flow's thread
- flow_epollid_register() to register an epoll fd with an epollid
- flow_epollid_set() to set the epollid of a flow
- flow_epollid_clear() to reset the epoll id of a flow

This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
the need to pass the context 'c', since the epoll fd is now directly
accessible from the flow structure via flow_epollfd().

Add a defensive check at the beginning of tcp_flow_repair_queue() to
avoid a false positive with "make clang-tidy":
  error: The 1st argument to 'send' is < 0 but should be >= 0
   3230 |                 ssize_t rc = send(conn->sock, p, MIN(len, chunk), 0);

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

treewide: By default, don't quit source after migration, keep sockets open

2025-07-29T15:57:01+00:00

We are hitting an issue in the KubeVirt integration where some data is
still sent to the source instance even after migration is complete. As
we exit, the kernel closes our sockets and resets connections. The
resulting RST segments are sent to peers, effectively terminating
connections that were meanwhile migrated.

At the moment, this is not done intentionally, but in the future
KubeVirt might enable OVN-Kubernetes features where source and
destination nodes are explicitly getting mirrored traffic for a while,
in order to decrease migration downtime.

By default, don't quit after migration is completed on the source: the
previous behaviour can be enabled with the new, but deprecated,
--migrate-exit option. After migration (as source), the -1 / --one-off
option has no effect.

Also, by default, keep migrated TCP sockets open (in repair mode) as
long as we're running, and ignore events on any epoll descriptor
representing data channels. The previous behaviour can be enabled with
the new, equally deprecated, --migrate-no-linger option.

By keeping sockets open, and not exiting, we prevent the kernel
running on the source node to send out RST segments if further data
reaches us.

Reported-by: Nir Dothan 
Signed-off-by: Stefano Brivio

migrate, tcp: Migrate RFC 7323 timestamp

2025-03-19T14:27:27+00:00

Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson 
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio

tcp: Don't pass both flow pointer and flow index

2025-02-18T12:33:10+00:00

tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window

2025-02-18T12:33:08+00:00

This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: More type safety for tcp_flow_migrate_target_ext()

2025-02-18T12:32:52+00:00

tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

migrate: Migrate TCP flows

2025-02-17T07:29:03+00:00

This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson 
Reviewed-by: David Gibson 
Tested-by: David Gibson 
Signed-off-by: Stefano Brivio