passt, branch 2025_02_17.a1e48a0

test: Add migration tests

2025-02-17T07:29:36+00:00

PCAP=1 ./run migrate/bidirectional gives an overview of how the
whole thing is working.

Add 12 tests in total, checking basic functionality with and without
flows in both directions, with and without sockets in half-closed
states (both inbound and outbound), migration behaviour under traffic
flood, under traffic flood with > 253 flows, and strict checking of
sequences under flood with ramp patterns in both directions.

These tests need preparation and teardown for each case, as we need
to restore the source guest in its own context and pane before we can
test again. Eventually, we could consider alternating source and
target so that we don't need to restart from scratch every time, but
that's beyond the scope of this initial test implementation.

Trick: './run migrate/*' runs all the tests with preparation and
teardown steps.

Co-authored-by: David Gibson 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

migrate: Migrate TCP flows

2025-02-17T07:29:03+00:00

This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson 
Reviewed-by: David Gibson 
Tested-by: David Gibson 
Signed-off-by: Stefano Brivio

repair, passt-repair: Build and warning fixes for musl

2025-02-17T07:28:48+00:00

Checked against musl 1.2.5.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp_splice: A typo three years ago and SO_RCVLOWAT is gone

2025-02-17T07:28:45+00:00

In commit e5eefe77435a ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:

			if (!bitmap_isset(rcvlowat_set, conn - ts) &&
			    readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

			if (conn->flags & lowat_set_flag &&
			    readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.

But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.

Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Don't wake up on input data if we can't write it anywhere

2025-02-17T07:27:30+00:00

If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:

  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

vhost_user: Clear ring address on GET_VRING_BASE

2025-02-15T04:34:21+00:00

GET_VRING_BASE stops the queue, clearing the call and kick fds.  However,
we don't clear vring.avail.  That means that if vu_queue_notify() is called
it won't realise the queue isn't ready and will die with an EBADFD.

We get this during migration, because for some reason, qemu reconfigures
the vhost-user device when a migration is triggered.  There's a window
between the GET_VRING_BASE and re-establishing the call fd where the
notify function can be called, causing a crash.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values

2025-02-14T11:02:55+00:00

I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Keep updating window and checking for socket data after FIN from guest

2025-02-14T09:04:39+00:00

Once we get a FIN segment from the container/guest, we enter something
resembling CLOSE_WAIT (from the perspective of the peer), but that
doesn't mean that we should stop processing window updates from the
guest and checking for socket data if the guest acknowledges
something.

If we don't do that, we can very easily run into a situation where we
send a burst of data to the tap, get a zero window update, along with
a FIN segment, because the flow is meant to be unidirectional, and now
the connection will be stuck forever, because we'll ignore updates.

Reproducer, server:

  $ pasta --config-net -t 9999 -- sh -c 'echo DONE | socat TCP-LISTEN:9997,shut-down STDIO'

and client:

  $ ./test/rampstream send 50000 | socat -u STDIN TCP:$LOCAL_ADDR:9997
  2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe

while at it, update the message string for the third passive close
state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK.

Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

contrib/selinux: Enable mapping guest memory for libvirt guests

2025-02-14T09:04:39+00:00

This doesn't actually belong to passt's own policy: we should export
an interface and libvirt's policy should use it, because passt's
policy shouldn't be aware of svirt_image_t at all.

However, libvirt doesn't maintain its own policy, which makes policy
updates rather involved. Add this workaround to ensure --vhost-user
is working in combination with libvirt, as it might take ages before
we can get the proper rule in libvirt's policy.

Reported-by: Laine Stump 
Signed-off-by: Stefano Brivio

selinux: Add rules needed to run tests

2025-02-12T23:42:52+00:00

...other than being convenient, they might be reasonably
representative of typical stand-alone usage.

Signed-off-by: Stefano Brivio