passt/passt.c, branch 2025_06_11.0293c6f

udp: Use connect()ed sockets for initiating side

2025-04-07T19:24:36+00:00

Currently we have an asymmetry in how we handle UDP sockets.  For flows
where the target side is a socket, we create a new connect()ed socket
- the "reply socket" specifically for that flow used for sending and
receiving datagrams on that flow and only that flow.  For flows where the
initiating side is a socket, we continue to use the "listening" socket (or
rather, a dup() of it).  This has some disadvantages:

 * We need a hash lookup for every datagram on the listening socket in
   order to work out what flow it belongs to
 * The dup() keeps the socket alive even if automatic forwarding removes
   the listening socket.  However, the epoll data remains the same
   including containing the now stale original fd.  This causes bug 103.
 * We can't (easily) set flow-specific options on an initiating side
   socket, because that could affect other flows as well

Alter the code to use a connect()ed socket on the initiating side as well
as the target side.  There's no way to "clone and connect" the listening
socket (a loose equivalent of accept() for UDP), so we have to create a
new socket.  We have to bind() this socket before we connect() it, which
is allowed thanks to SO_REUSEADDR, but does leave a small window where it
could receive datagrams not intended for this flow.  For now we handle this
by simply discarding any datagrams received between bind() and connect(),
but I intend to improve this in a later patch.

Link: https://bugs.passt.top/show_bug.cgi?id=103
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

packet: Remove redundant TAP_BUF_BYTES define

2025-03-12T22:08:33+00:00

Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing.  They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense.  So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.

In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

conf: Move mode detection into helper function

2025-03-12T22:08:33+00:00

One of the first things we need to do is determine if we're in passt mode
or pasta mode.  Currently this is open-coded in main(), by examining
argv[0].  We want to complexify this a bit in future to cover vhost-user
mode as well.  Prepare for this, by moving the mode detection into a new
conf_mode() function.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

treewide: Mark assorted functions static

2025-03-07T01:21:24+00:00

This marks static a number of functions which are only used in their .c
file, have no prototypes in a .h and were never intended to be globally
exposed.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

migrate: Migrate TCP flows

2025-02-17T07:29:03+00:00

This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson 
Reviewed-by: David Gibson 
Tested-by: David Gibson 
Signed-off-by: Stefano Brivio

Add interfaces and configuration bits for passt-repair

2025-02-12T18:47:28+00:00

In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Co-authored-by: David Gibson 
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

migrate: Skeleton of live migration logic

2025-02-12T18:47:07+00:00

Introduce facilities for guest migration on top of vhost-user
infrastructure.  Add migration facilities based on top of the current
vhost-user infrastructure, moving vu_migrate() and related functions
to migrate.c.

Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.

The migration header consists of a magic number, a version number for the
encoding, and a "compat_version" which represents the oldest version which
is compatible with the current one.  We don't use it yet, but that allows
for the future possibility of backwards compatible protocol extensions.

Co-authored-by: David Gibson 
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

treewide: use _exit() over exit()

2025-02-05T14:19:02+00:00

In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000

Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
  do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
	sleep 100 && \
  sleep 10 && \
  podman rm -fa -t0
done

And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
  #0  0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
  #1  0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
  #2  0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
  #3  0x000055f31b78c50b n/a (n/a + 0x0)
  #4  0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
  #5  0x000055f31b78d5a2 n/a (n/a + 0x0)

Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.

Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.

Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.

Signed-off-by: Paul Holzinger 
Signed-off-by: Stefano Brivio

vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command

2025-01-20T18:51:24+00:00

Set the file descriptor to use to transfer the
backend device state during migration.

Signed-off-by: Laurent Vivier 
[sbrivio: Fixed nits and coding style here and there]
Signed-off-by: Stefano Brivio

seccomp: Unconditionally allow accept(2) even if accept4(2) is present

2025-01-05T22:49:11+00:00

On Alpine Linux 3.21, passt aborts right away as soon as QEMU connects
to it.

Most likely, this has always been the case with musl, because since
musl commit dc01e2cbfb29 ("add fallback emulation for accept4 on old
kernels"), accept4() without flags is implemented using accept().

However, I guess that nobody realised earlier because it's typically
pasta(1) being used on musl-based distributions, and the only place
where we call accept4() without flags is tap_listen_handler().

Add accept() to the list of allowed system calls regardless of the
presence of accept4().

Reported-by: NN708 
Link: https://bugs.passt.top/show_bug.cgi?id=106
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson