passt/tcp_buf.c, branch bug205

treewide: Make some additional variables static

2026-05-11T22:04:08+00:00

Mark a number of extra variables local to a single module as static.

Signed-off-by: David Gibson 
Reviewed-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

tcp: Extend tcp_send_flag() to send TCP keepalive segments

2026-02-24T23:17:41+00:00

TCP keepalives aren't technically a flag, but they are a zero-data segment
so they can be generated with only a small modification to
tcp_{buf,vu}_send_flag().  Implement this, using a new "pseudo-flag"
value (similar to DUP_ACK), KEEPALIVE.

Signed-off-by: David Gibson 
[sbrivio: Fix trivial merge conflict with 812cdb802c6e]
Signed-off-by: Stefano Brivio

tcp: Move tap header update out of tcp_fill_headers()

2026-02-15T01:52:42+00:00

tcp_fill_headers() currently calls tap_hdr_update() to set the frame
length in the tap-specific header.  This is backend-specific: the tap
backend needs this for its frame length header, but the vhost-user
backend passes NULL for the tap header and doesn't use it at all.

Remove the tap_hdr parameter from tcp_fill_headers() and instead return
the computed L2 frame length.  The tap backend caller,
tcp_l2_buf_fill_headers(), now calls tap_hdr_update() itself with the
returned length.  The vhost-user callers, tcp_vu_send_flag() and
tcp_vu_prepare(), no longer need to pass a NULL tap header.

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

tcp: Retransmit FINs like data segments

2026-01-31T02:56:50+00:00

RFC 9293 doesn't distinguish between regular data segments and FIN segments
for the purposes of retransmissions.  Our existing retransmission logic
will also work for FIN segments, except for one detail: we don't currently
set the ACK_FROM_TAP_DUE flag when we send a FIN.  Add the flag, so that
we'll properly retransmit FIN segments like data segments.

Remove the section from the theory of operation comment that describes a
different way of handling FIN timeouts which (a) isn't correct behaviour
and (b) doesn't appear to be implemented.

I've tested this by adding logic to suppress sending the FIN if retries <
some non-zero value.  We correctly resend the FIN and close normally after
the expected timeouts.

Link: https://bugs.passt.top/show_bug.cgi?id=195
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, udp: Pad batched frames to 60 bytes (802.3 minimum) in non-vhost-user modes

2025-12-08T03:47:22+00:00

Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and
UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a
zero-filled buffer, and send out a part of it if needed to reach
the minimum frame length given by 802.3, that is, 60 bytes altogether.

The frames we might need to pad are IPv4 only (the IPv6 header is
larger), and are typically TCP ACK segments but can also be small
data segments or datagrams.

Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: forward external source MAC address through tap interface

2025-10-30T11:01:01+00:00

We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts.

This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120

Signed-off-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Fix ACK sequence on FIN to tap

2025-10-07T20:22:27+00:00

If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
send a FIN segment to the guest / container acknowledging a sequence
number that's behind what we received so far, we won't have any
further trigger to send an updated ACK segment, as we are now
switching the epoll socket monitoring to edge-triggered mode.

To avoid this situation, in tcp_update_seqack_wnd(), we set the next
acknowledgement sequence to the current observed sequence, regardless
of what was acknowledged socket-side.

However, we don't necessarily call tcp_update_seqack_wnd() before
sending the FIN segment, which might potentially lead to a situation,
not observed in practice, where we unnecessarily cause a
retransmission at some point after our FIN segment.

Avoid that by setting the ACK sequence to whatever we received from
the container / guest, before sending a FIN segment and switching to
EPOLLET.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Store the owner connections for flags frames

2025-09-11T15:11:48+00:00

There is an issue reported by Volker Diels-Grabsch and Boleyn Su.
A segmentation fault occurs when executing the following command:

	(sleep 0.1; ssh -p 22000 127.0.0.1) & passt -f -t 22000:22

It's caused by commit 78da088f7bab ("tcp: unify payload and flags
l2 frames array"). Fix it by storing the owner connections of flags
frames into tcp_frame_conns[] array.

Reported-by: Volker Diels-Grabsch 
Reported-by: Boleyn Su 
Suggested-by: David Gibson 
Fixes: 78da088f7bab ("tcp: unify payload and flags l2 frames array")
Signed-off-by: Yumei Huang 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

Reduce tcp_buf_discard size

2025-09-11T15:09:03+00:00

On kernels without SO_PEEK_OFF, a 16MB static buffer is used to
discard sent data. This patch reduces the buffer to 1MB.

Larger discards are now handled by using multiple iovec entries
pointing to the same 1MB buffer.

Signed-off-by: Xun Gu 
[sbrivio: Drop stray whitespace after BUF_DISCARD_SIZE define]
Signed-off-by: Stefano Brivio

tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data

2025-09-11T15:03:47+00:00

For some reason, tcp_vu_data_from_sock() already takes care of this,
but the non-vhost-user version ignores this possibility and just sends
out a FIN segment whenever we infer we received one host-side,
regardless of the fact that we might have unacknowledged data still to
send.

Somewhat surprisingly, this didn't cause any issue to be reported yet,
until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
came around, leading to the following report from Paul, who hit this
running Podman tests:

  439   0.033032  169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480
  440   0.033055  169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264

we're sending data to the container, up to the edge of the window

  441   0.033059 192.168.122.100 → 169.254.1.2  60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0

and the container acknowledges it

  442   0.033091  169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656

we send more data, all we possibly can, in window

  443   0.033126 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0

and the container shrinks the window due to the issue introduced
by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no
memory"). With a previous patch from this series, we rewind the
sequence, meaning that we assign conn->seq_to_tap from
conn->seq_ack_from_tap, so that we'll retransmit this segment, by
reading again from the socket, and increasing conn->seq_to_tap
once more.

However:

  444   0.033144  169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0

we eventually get a zero-length read from the socket and we miss the
fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a
FIN flag with the most recent sequence. The kernel insists:

  445   0.033147 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0

with its buggy zero-window update, but:

  446   0.033152 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0
  447   0.033202 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0

we don't reset the TAP_FIN_SENT flag anymore, and don't resend
the FIN segment (nor data), as we already rewound the sequence
earlier.

To solve this, hold off the FIN segment until we get a zero-length
read from the socket *and* we know that there's no unacknowledged
pending data, also without vhost-user, in tcp_buf_data_from_sock().

Reported-by: Paul Holzinger 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy