passt/tcp_buf.c, branch podman23739

flow, treewide: Promote priority of selected flow-linked messages

2026-06-09T02:28:20+00:00

Most of out flow specific log messages are debug level for fear of flooding
the logs, even when they report real error conditions that might be off
significance.

Now that we have the mechanisms for log message rate limiting, we can do
better.  Promote many flow related messages to warning or error level, with
rate limiting.  While we're there add ratelimiting to a handful of existing
warning or error level messages.

They general heuristic is to promote messages that report a failure which
is not something that should be triggered by the guest doing something
weird.  This mostly means failures from socket operations we expect to be
legitimate.

Adding the ratelimiting means plumbing the 'now' timestamp through much
more of the code, hence the large churn.

Signed-off-by: David Gibson

tcp: Encode checksum computation flags in a single parameter

2026-05-26T10:17:10+00:00

tcp_fill_headers() takes a pointer to a previously computed IPv4 header
checksum to avoid recalculating it when the payload length doesn't
change, and a separate bool to skip TCP checksum computation.

Replace both parameters with a single uint32_t csum_flags that encodes:
- IP4_CSUM (bit 31): compute IPv4 header checksum from scratch
- TCP_CSUM (bit 30): compute TCP checksum
- IP4_CMASK (low 16 bits): cached IPv4 header checksum value

When IP4_CSUM is not set, the cached checksum is extracted from the low
16 bits.  This is cleaner than the pointer-based approach, and also
avoids a potential dangling pointer issue: a subsequent patch makes
tcp_fill_headers() access ip4h via with_header(), which scopes it to a
temporary variable, so a pointer to ip4h->check would become invalid
after the with_header() block.

Suggested-by: David Gibson 
Signed-off-by: Laurent Vivier 
Reviewed-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Pass explicit data length to tcp_fill_headers()

2026-05-19T23:21:51+00:00

tcp_fill_headers() computed the TCP payload length from iov_tail_size(),
but with vhost-user multibuffer frames, the iov_tail will be larger than
the actual data.  Pass the data length explicitly so that IP total
length, pseudo-header, and checksum computations use the correct value.

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
Reviewed-by: Jon Maloy 
Signed-off-by: Stefano Brivio

treewide: Make some additional variables static

2026-05-11T22:04:08+00:00

Mark a number of extra variables local to a single module as static.

Signed-off-by: David Gibson 
Reviewed-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

tcp: Extend tcp_send_flag() to send TCP keepalive segments

2026-02-24T23:17:41+00:00

TCP keepalives aren't technically a flag, but they are a zero-data segment
so they can be generated with only a small modification to
tcp_{buf,vu}_send_flag().  Implement this, using a new "pseudo-flag"
value (similar to DUP_ACK), KEEPALIVE.

Signed-off-by: David Gibson 
[sbrivio: Fix trivial merge conflict with 812cdb802c6e]
Signed-off-by: Stefano Brivio

tcp: Move tap header update out of tcp_fill_headers()

2026-02-15T01:52:42+00:00

tcp_fill_headers() currently calls tap_hdr_update() to set the frame
length in the tap-specific header.  This is backend-specific: the tap
backend needs this for its frame length header, but the vhost-user
backend passes NULL for the tap header and doesn't use it at all.

Remove the tap_hdr parameter from tcp_fill_headers() and instead return
the computed L2 frame length.  The tap backend caller,
tcp_l2_buf_fill_headers(), now calls tap_hdr_update() itself with the
returned length.  The vhost-user callers, tcp_vu_send_flag() and
tcp_vu_prepare(), no longer need to pass a NULL tap header.

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

tcp: Retransmit FINs like data segments

2026-01-31T02:56:50+00:00

RFC 9293 doesn't distinguish between regular data segments and FIN segments
for the purposes of retransmissions.  Our existing retransmission logic
will also work for FIN segments, except for one detail: we don't currently
set the ACK_FROM_TAP_DUE flag when we send a FIN.  Add the flag, so that
we'll properly retransmit FIN segments like data segments.

Remove the section from the theory of operation comment that describes a
different way of handling FIN timeouts which (a) isn't correct behaviour
and (b) doesn't appear to be implemented.

I've tested this by adding logic to suppress sending the FIN if retries <
some non-zero value.  We correctly resend the FIN and close normally after
the expected timeouts.

Link: https://bugs.passt.top/show_bug.cgi?id=195
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, udp: Pad batched frames to 60 bytes (802.3 minimum) in non-vhost-user modes

2025-12-08T03:47:22+00:00

Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and
UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a
zero-filled buffer, and send out a part of it if needed to reach
the minimum frame length given by 802.3, that is, 60 bytes altogether.

The frames we might need to pad are IPv4 only (the IPv6 header is
larger), and are typically TCP ACK segments but can also be small
data segments or datagrams.

Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: forward external source MAC address through tap interface

2025-10-30T11:01:01+00:00

We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts.

This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120

Signed-off-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Fix ACK sequence on FIN to tap

2025-10-07T20:22:27+00:00

If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
send a FIN segment to the guest / container acknowledging a sequence
number that's behind what we received so far, we won't have any
further trigger to send an updated ACK segment, as we are now
switching the epoll socket monitoring to edge-triggered mode.

To avoid this situation, in tcp_update_seqack_wnd(), we set the next
acknowledgement sequence to the current observed sequence, regardless
of what was acknowledged socket-side.

However, we don't necessarily call tcp_update_seqack_wnd() before
sending the FIN segment, which might potentially lead to a situation,
not observed in practice, where we unnecessarily cause a
retransmission at some point after our FIN segment.

Avoid that by setting the ACK sequence to whatever we received from
the container / guest, before sending a FIN segment and switching to
EPOLLET.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson