passt, branch 2026_05_26.038c51e

vhost_user: Offer VIRTIO_NET_F_GUEST_CSUM

2026-05-26T16:24:31+00:00

According to the virtio-net specification, when the VIRTIO_NET_F_GUEST_CSUM
is negotiated, the device can set VIRTIO_NET_HDR_F_DATA_VALID in the
virtio-net header to indicate that packet checksums have been validated,
allowing the guest to skip verification. Without this feature, the device
must provide fully checksummed packets.

The vhost-user TCP and UDP paths were unconditionally skipping checksum
computation, regardless of whether GUEST_CSUM was negotiated. This
went undetected with Linux guests because Linux's virtio-net driver
honours VIRTIO_NET_HDR_F_DATA_VALID regardless of whether
VIRTIO_NET_F_GUEST_CSUM was negotiated, marking such packets as
CHECKSUM_UNNECESSARY and skipping verification.

iPXE, however, does not negotiate GUEST_CSUM, ignores the DATA_VALID
flag entirely, and always verifies checksums. This caused TCP
connections to fail: the SYN-ACK had a zero TCP checksum, iPXE rejected
it, and the connection timed out in SYN_RCVD.

Adding --pcap happened to mask the bug, because the pcap code path
forces checksum computation to ensure correct captures.

Offer VIRTIO_NET_F_GUEST_CSUM in the device features, and only skip
checksum computation when the guest has actually negotiated it. When
GUEST_CSUM is not negotiated, always compute valid checksums as required
by the specification.

We keep setting VIRTIO_NET_HDR_F_DATA_VALID unconditionally in
VU_HEADER: when GUEST_CSUM is negotiated, the flag lets the guest skip
checksum verification; when it is not, the spec says the guest should
ignore the flags field, so setting it is harmless.

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
[sbrivio: Resolved conflicts, in particular with commit dec66c02b5e4
 ("udp: Pass iov_tail to udp_update_hdr4()/udp_update_hdr6()")]
Signed-off-by: Stefano Brivio

ip: Wrap CASE macro body in braces for pre-C23 compatibility

2026-05-26T10:25:14+00:00

Compiling on RHEL8 (gcc-8.5) gives an error in ip.c.
ip.c:88:3: error: a label can only be part of a statement and
a declaration is not a statement due to the use of static_assert.

The fix is to surround it with {}.

Link: https://bugs.passt.top/show_bug.cgi?id=201
Fixes: 93c3e351f235 ("ip: Define a bound for the string returned by ipproto_name()")
Signed-off-by: Anshu Kumari 
Reviewed-by: David Gibson 
Reviewed-by: Laurent Vivier 
[sbrivio: Fixed coding style]
Signed-off-by: Stefano Brivio

tcp_splice: Simplify tracking of read/written bytes

2026-05-26T10:21:49+00:00

For each each direction of each spliced connection, we keep track of how
many bytes we've read from one socket and written to the other.  However,
we never actually care about the absolute values of these, only the
difference between them, which represents how much data is currently "in
flight" in the splicing pipe.

Simplify the handling by having a single variable tracking the number of
bytes in the pipe.

As a bonus, the new scheme makes it clearer that we don't need to worry
about overflows: pending can never become larger than the maximum pipe
bufffer size, well within 32-bits.

I _think_ the old scheme was safe in the case of overflow - again under
the assumption that read/written can never be further apart than the pipe
buffer size.  However, it's much harder to reason about this case.  It's
certainly plausible that an overflow could occur - sending 4GiB through
a local socket is entirely achievable.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Clean up flow control path for splice forwarding

2026-05-26T10:21:45+00:00

Splice forwarding can be blocked either waiting for data from one side
or waiting for space on the other.  For that reason,
tcp_splice_sock_handler() on either socket can forward data in either or
both directions, depending on whether we have EPOLLIN, EPOLLOUT or both
events.

The flow control for this is quite hard to follow though, since we forward
in one direction, then sometimes loop back with a goto to do it in the
other direction.  Simplify this by adding a tcp_splice_forward() function
with the logic to forward in one direction and calling it either once or
twice from tcp_splice_sock_handler().

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Avoid missing EOF recognition while forwarding

2026-05-26T10:21:41+00:00

tcp_splice_sock_handler() has an optimised path for the common case where
the amount we splice(2) into the pipe is exactly the same as the amount we
splice(2) out again.  If the pipe is empty at that point, we stop
forwarding until we get another epoll event.

However, via a subtle chain of events, this can cause a bug for a
half-closed connection.  Suppose the connection is already half-closed in
the other direction - that is, we've already called shutdown(SHUT_WR) on
the socket for which we're getting the event.  In this event we're getting
the last batch of data in the other direction, and also a FIN.  This can
result in EPOLLIN, EPOLLRDHUP and EPOLLHUP events simultaneously.

We read the last data from the socket and successfully splice it to the
other side.  Since there is no data in the pipe, we exit the forwarding
loop.  However, because we did read data, we don't set the eof flag.

Because we don't set eof, we don't (yet) propagate the FIN to the other
side, or set FIN_SENT_(!fromsidei).  Therefore we don't (yet) recognize
this as a clean termination and set the CLOSING flag.  We would correct
this when we get our next event, however before we can do so we process
the EPOLLHUP event.  Because we haven't recognized this as a clean close
we assume it is an abrupt close and send an RST to the other side.

To avoid this, don't stop attempting to forward data on this path.
Continue for at least one more loop.  If we're at EOF, we'll recognize it
on the next splice(2).  If not it gives us an opportunity to forward more
data without returning to the mail epoll loop.

Reported-by: Paul Holzinger 
Link: https://bugs.passt.top/show_bug.cgi?id=202
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Improve error reporting

2026-05-26T10:21:11+00:00

A number of things can, at least theoretically, go wrong when forwarding
data across a spliced connection.  We generally handle this by resetting
the connection on both sides.  However, in many cases we don't log any
message about why the connection was reset, which can make it hard to
debug why this is happening.

Add a bunch of debug and error logging to make this easier to figure out.
We ratelimit for safety, which requires some tweaks to make the ratelimit
logic work with the flow specific log functions.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_vu: Support multibuffer frames in tcp_vu_send_flag()

2026-05-26T10:18:24+00:00

Build the Ethernet, IP, and TCP headers on the stack instead of
directly in the buffer via pointer casts, then write them into the
iovec with IOV_PUSH_HEADER().  This mirrors the approach already used
in tcp_vu_prepare() and udp_vu_prepare().

Remove the vu_eth(), vu_ip(), vu_payloadv4() and vu_payloadv6() helpers
from vu_common.h, as they are no longer used anywhere.

Introduce tcp_vu_send_dup() to handle DUP_ACK duplication using
vu_collect() and iov_memcpy() instead of a plain memcpy(), so that
the duplicated frame is also properly scattered across multiple iovecs.

Signed-off-by: Laurent Vivier 
Reviewed-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_vu: Support multibuffer frames in tcp_vu_sock_recv()

2026-05-26T10:18:02+00:00

Previously, tcp_vu_sock_recv() assumed a 1:1 mapping between virtqueue
elements and iovecs (one iovec per element), enforced by an ASSERT.
This prevented the use of virtqueue elements with multiple buffers
(e.g. when mergeable rx buffers are not negotiated and headers are
provided in a separate buffer).

Introduce a struct vu_frame to track per-frame metadata: the range of
elements and iovecs that make up each frame, and the frame's total size.
This replaces the head[] array which only tracked element indices.

A separate iov_msg[] array is built for recvmsg() by cloning the data
portions (after stripping headers) using iov_tail helpers.

Then a frame truncation after recvmsg() properly walks the frame and
element arrays to adjust iovec counts and element counts.

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_vu: Build headers on the stack and write them into the iovec

2026-05-26T10:17:16+00:00

tcp_vu_prepare() currently assumes the first iovec element provided by
the guest is large enough to hold all L2-L4 headers, and builds them
in place via pointer casts into iov[0].iov_base.  This assumption is
enforced by an assert().

Since the headers in the buffer are uninitialized anyway, we can just
as well build the Ethernet, IP, and TCP headers on the stack instead,
and write them into the iovec with IOV_PUSH_HEADER().  This mirrors the
approach already used in udp_vu_prepare(), and prepares for support of
elements with multiple iovecs.

Signed-off-by: Laurent Vivier 
Reviewed-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Encode checksum computation flags in a single parameter

2026-05-26T10:17:10+00:00

tcp_fill_headers() takes a pointer to a previously computed IPv4 header
checksum to avoid recalculating it when the payload length doesn't
change, and a separate bool to skip TCP checksum computation.

Replace both parameters with a single uint32_t csum_flags that encodes:
- IP4_CSUM (bit 31): compute IPv4 header checksum from scratch
- TCP_CSUM (bit 30): compute TCP checksum
- IP4_CMASK (low 16 bits): cached IPv4 header checksum value

When IP4_CSUM is not set, the cached checksum is extracted from the low
16 bits.  This is cleaner than the pointer-based approach, and also
avoids a potential dangling pointer issue: a subsequent patch makes
tcp_fill_headers() access ip4h via with_header(), which scopes it to a
temporary variable, so a pointer to ip4h->check would become invalid
after the with_header() block.

Suggested-by: David Gibson 
Signed-off-by: Laurent Vivier 
Reviewed-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio