passt/tcp_conn.h, branch 2025_06_11.0293c6f

migrate, tcp: Migrate RFC 7323 timestamp

2025-03-19T14:27:27+00:00

Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson 
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio

tcp: Don't pass both flow pointer and flow index

2025-02-18T12:33:10+00:00

tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window

2025-02-18T12:33:08+00:00

This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: More type safety for tcp_flow_migrate_target_ext()

2025-02-18T12:32:52+00:00

tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

migrate: Migrate TCP flows

2025-02-17T07:29:03+00:00

This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson 
Reviewed-by: David Gibson 
Tested-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values

2025-02-14T11:02:55+00:00

I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest

2025-01-16T20:15:33+00:00

There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.

If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:

  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1

also known as:

  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s

which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.

That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.

In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.

If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.

This is definitely not elegant, but it's a minimal fix.

We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).

I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).

Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio

tcp_splice: Eliminate SPLICE_V6 flag

2024-07-19T16:32:53+00:00

Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection.  Hence, drop the now unused SPLICE_V6 flag.

As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, flow: Remove redundant information, repack connection structures

2024-07-19T16:32:41+00:00

Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure.  Access it from there instead, and
remove it from the TCP specific structure.   With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Use parameterised macros for per-side event/flag bits

2024-07-17T13:30:11+00:00

Both the events and flags fields in tcp_splice_conn have several bits
which are per-side, e.g. OUT_WAIT_0 for side 0 and OUT_WAIT_1 for side 1.
This necessitates some rather awkward ternary expressions when we need
to get the relevant bit for a particular side.

Simplify this by using a parameterised macro for the bit values.  This
needs a ternary expression inside the macros, but makes the places we use
it substantially clearer.

That simplification in turn allows us to use a loop across each side to
implement several things which are currently open coded to do equivalent
things for each side in turn.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio