passt/passt.h, branch bug165c

util: Extend sock_probe_mem() to sock_probe_features()

2025-12-02T22:06:25+00:00

sock_probe_mem() currently checks whether we're able to allocate large
socket buffers.  Extend it to also check whether the SO_BINDTODEVICE
socket option is available.  Rename to sock_probe_features() to reflect the
new functionality.

Signed-off-by: David Gibson 
[sbrivio: Add whitespace around "-" in sock_probe_features()]
Signed-off-by: Stefano Brivio

epoll_ctl: Extract epoll operations

2025-10-30T14:32:12+00:00

Centralize epoll_add() and epoll_del() helper functions into new
epoll_ctl.c/h files.

This also moves the union epoll_ref definition from passt.h to
epoll_ctl.h where it's more logically placed.

The new epoll_add() helper simplifies adding file descriptors to epoll
by taking an epoll_ref and events, handling error reporting
consistently across all call sites.

Signed-off-by: Laurent Vivier 
[sbrivio: Include epoll_ctl.h from netlink.c as it's now needed there]
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: forward external source MAC address through tap interface

2025-10-30T11:01:01+00:00

We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts.

This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120

Signed-off-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

Add --stats option to display event statistics

2025-09-19T17:30:27+00:00

Introduce a new --stats DELAY option that displays event statistics
tables showing counts by epoll event type. Statistics are printed to
stderr with a minimum delay between updates, and only when events occur.

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

Fix typo in doc comment

2025-09-06T11:15:19+00:00

Signed-off-by: Volker Diels-Grabsch 
Signed-off-by: Stefano Brivio

treewide: By default, don't quit source after migration, keep sockets open

2025-07-29T15:57:01+00:00

We are hitting an issue in the KubeVirt integration where some data is
still sent to the source instance even after migration is complete. As
we exit, the kernel closes our sockets and resets connections. The
resulting RST segments are sent to peers, effectively terminating
connections that were meanwhile migrated.

At the moment, this is not done intentionally, but in the future
KubeVirt might enable OVN-Kubernetes features where source and
destination nodes are explicitly getting mirrored traffic for a while,
in order to decrease migration downtime.

By default, don't quit after migration is completed on the source: the
previous behaviour can be enabled with the new, but deprecated,
--migrate-exit option. After migration (as source), the -1 / --one-off
option has no effect.

Also, by default, keep migrated TCP sockets open (in repair mode) as
long as we're running, and ignore events on any epoll descriptor
representing data channels. The previous behaviour can be enabled with
the new, equally deprecated, --migrate-no-linger option.

By keeping sockets open, and not exiting, we prevent the kernel
running on the source node to send out RST segments if further data
reaches us.

Reported-by: Nir Dothan 
Signed-off-by: Stefano Brivio

tap: Make size of pool_tap[46] purely a tuning parameter

2025-03-20T19:33:09+00:00

Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS).  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices.  It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.

Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends.  In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.

If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad.  The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.

To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary.  This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

Simplify sizing of pkt_buf

2025-03-12T22:08:33+00:00

We define the size of pkt_buf as large enough to hold 128 maximum size
packets.  Well, approximately, since we round down to the page size.  We
don't have any specific reliance on how many packets can fit in the buffer,
we just want it to be big enough to allow reasonable batching.  The
current definition relies on the confusingly named ETH_MAX_MTU and adds
in sizeof(uint32_t) rather non-obviously for the pseudo-physical header
used by the qemu socket (passt mode) protocol.

Instead, just define it to be 8MiB, which is what that complex calculation
works out to.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

packet: Remove redundant TAP_BUF_BYTES define

2025-03-12T22:08:33+00:00

Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing.  They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense.  So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.

In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

conf: Use 0 instead of -1 as "unassigned" mtu value

2025-02-19T05:35:41+00:00

On the command line -m 0 means "don't assign an MTU" (letting the guest use
its default.  However, internally we use (c->mtu == -1) to represent that
state.  We use (c->mtu == 0) to represent "the user didn't specify on the
command line, so use the default" - but this is only used during conf(),
never afterwards.

This is unnecessarily confusing.  We can instead just initialise c->mtu to
its default (65520) before parsing options and use 0 on both the command
line and internally to represent the "don't assign" special case.  This
ensures that c->mtu is always 0..65535, so we can store it in a uint16_t
which is more natural.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio