passt/passt.h, branch 2025_06_11.0293c6f

tap: Make size of pool_tap[46] purely a tuning parameter

2025-03-20T19:33:09+00:00

Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS).  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices.  It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.

Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends.  In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.

If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad.  The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.

To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary.  This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

Simplify sizing of pkt_buf

2025-03-12T22:08:33+00:00

We define the size of pkt_buf as large enough to hold 128 maximum size
packets.  Well, approximately, since we round down to the page size.  We
don't have any specific reliance on how many packets can fit in the buffer,
we just want it to be big enough to allow reasonable batching.  The
current definition relies on the confusingly named ETH_MAX_MTU and adds
in sizeof(uint32_t) rather non-obviously for the pseudo-physical header
used by the qemu socket (passt mode) protocol.

Instead, just define it to be 8MiB, which is what that complex calculation
works out to.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

packet: Remove redundant TAP_BUF_BYTES define

2025-03-12T22:08:33+00:00

Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing.  They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense.  So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.

In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

conf: Use 0 instead of -1 as "unassigned" mtu value

2025-02-19T05:35:41+00:00

On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default.  However, internally we use (c->mtu == -1) to represent that
state.  We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.

This is unnecessarily confusing.  We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case.  This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

Add interfaces and configuration bits for passt-repair

2025-02-12T18:47:28+00:00

In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Co-authored-by: David Gibson 
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

migrate: Skeleton of live migration logic

2025-02-12T18:47:07+00:00

Introduce facilities for guest migration on top of vhost-user
infrastructure.  Add migration facilities based on top of the current
vhost-user infrastructure, moving vu_migrate() and related functions
to migrate.c.

Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.

The migration header consists of a magic number, a version number for the
encoding, and a "compat_version" which represents the oldest version which
is compatible with the current one.  We don't use it yet, but that allows
for the future possibility of backwards compatible protocol extensions.

Co-authored-by: David Gibson 
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

dhcp, dhcpv6: Add hostname and client fqdn ops

2025-02-10T17:30:24+00:00

Both DHCPv4 and DHCPv6 has the capability to pass the hostname to
clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39
(client fqdn), for some virt deployments like kubevirt is expected to
have the VirtualMachine name as the guest hostname.

This change add the following arguments:
 - -H --hostname NAME to configure the hostname DHCPv4 option(12)
 - --fqdn NAME to configure client fqdn option for both DHCPv4(81) and
   DHCPv6(39)

Signed-off-by: Enrique Llorente 
Signed-off-by: Stefano Brivio

pasta: make it possible to disable socket splicing

2024-12-11T00:47:37+00:00

During testing it is sometimes useful to force traffic which would
normally be forwared by socket splicing through the tap interface.

In this commit, we add a command switch enabling such funtionality
for inbound local traffic.

For outbound local traffic this is much trickier, if even possible,
so leave that for a later commit.

Suggested-by: David Gibson 
Signed-off-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

vhost-user: add vhost-user

2024-11-27T15:47:32+00:00

add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
[sbrivio: as suggested by lvivier, include 
 before including  as C libraries such as musl
 __UAPI_DEF_ETHHDR in  if they already have
 a definition of struct ethhdr]
Signed-off-by: Stefano Brivio

treewide: Introduce 'local mode' for disconnected setups

2024-11-27T04:16:38+00:00

There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.

To this end, if we don't find any usable host interface, instead of
exiting:

- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
  as default gateway

- for IPv6, don't assign any address (forcibly disable DHCPv6), and
  use the *first* link-local address we observe to represent the
  guest/container. Advertise fe80::1 as default gateway

- use 'tap0' as default interface name for pasta

Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.

Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).

Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio