passt/tcp_buf.c, branch 2026_01_17.81c97f6

tcp, udp: Pad batched frames to 60 bytes (802.3 minimum) in non-vhost-user modes

2025-12-08T03:47:22+00:00

Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and
UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a
zero-filled buffer, and send out a part of it if needed to reach
the minimum frame length given by 802.3, that is, 60 bytes altogether.

The frames we might need to pad are IPv4 only (the IPv6 header is
larger), and are typically TCP ACK segments but can also be small
data segments or datagrams.

Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: forward external source MAC address through tap interface

2025-10-30T11:01:01+00:00

We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts.

This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120

Signed-off-by: Jon Maloy 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Fix ACK sequence on FIN to tap

2025-10-07T20:22:27+00:00

If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
send a FIN segment to the guest / container acknowledging a sequence
number that's behind what we received so far, we won't have any
further trigger to send an updated ACK segment, as we are now
switching the epoll socket monitoring to edge-triggered mode.

To avoid this situation, in tcp_update_seqack_wnd(), we set the next
acknowledgement sequence to the current observed sequence, regardless
of what was acknowledged socket-side.

However, we don't necessarily call tcp_update_seqack_wnd() before
sending the FIN segment, which might potentially lead to a situation,
not observed in practice, where we unnecessarily cause a
retransmission at some point after our FIN segment.

Avoid that by setting the ACK sequence to whatever we received from
the container / guest, before sending a FIN segment and switching to
EPOLLET.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Store the owner connections for flags frames

2025-09-11T15:11:48+00:00

There is an issue reported by Volker Diels-Grabsch and Boleyn Su.
A segmentation fault occurs when executing the following command:

	(sleep 0.1; ssh -p 22000 127.0.0.1) & passt -f -t 22000:22

It's caused by commit 78da088f7bab ("tcp: unify payload and flags
l2 frames array"). Fix it by storing the owner connections of flags
frames into tcp_frame_conns[] array.

Reported-by: Volker Diels-Grabsch 
Reported-by: Boleyn Su 
Suggested-by: David Gibson 
Fixes: 78da088f7bab ("tcp: unify payload and flags l2 frames array")
Signed-off-by: Yumei Huang 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

Reduce tcp_buf_discard size

2025-09-11T15:09:03+00:00

On kernels without SO_PEEK_OFF, a 16MB static buffer is used to
discard sent data. This patch reduces the buffer to 1MB.

Larger discards are now handled by using multiple iovec entries
pointing to the same 1MB buffer.

Signed-off-by: Xun Gu 
[sbrivio: Drop stray whitespace after BUF_DISCARD_SIZE define]
Signed-off-by: Stefano Brivio

tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data

2025-09-11T15:03:47+00:00

For some reason, tcp_vu_data_from_sock() already takes care of this,
but the non-vhost-user version ignores this possibility and just sends
out a FIN segment whenever we infer we received one host-side,
regardless of the fact that we might have unacknowledged data still to
send.

Somewhat surprisingly, this didn't cause any issue to be reported yet,
until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
came around, leading to the following report from Paul, who hit this
running Podman tests:

  439   0.033032  169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480
  440   0.033055  169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264

we're sending data to the container, up to the edge of the window

  441   0.033059 192.168.122.100 → 169.254.1.2  60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0

and the container acknowledges it

  442   0.033091  169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656

we send more data, all we possibly can, in window

  443   0.033126 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0

and the container shrinks the window due to the issue introduced
by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no
memory"). With a previous patch from this series, we rewind the
sequence, meaning that we assign conn->seq_to_tap from
conn->seq_ack_from_tap, so that we'll retransmit this segment, by
reading again from the socket, and increasing conn->seq_to_tap
once more.

However:

  444   0.033144  169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0

we eventually get a zero-length read from the socket and we miss the
fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a
FIN flag with the most recent sequence. The kernel insists:

  445   0.033147 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0

with its buggy zero-window update, but:

  446   0.033152 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0
  447   0.033202 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0

we don't reset the TAP_FIN_SENT flag anymore, and don't resend
the FIN segment (nor data), as we already rewound the sequence
earlier.

To solve this, hold off the FIN segment until we get a zero-length
read from the socket *and* we know that there's no unacknowledged
pending data, also without vhost-user, in tcp_buf_data_from_sock().

Reported-by: Paul Holzinger 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy

iov: Update IOV_REMOVE_HEADER() and IOV_PEEK_HEADER()

2025-09-03T18:42:17+00:00

Provide a temporary variable of the wanted type to store
the header if the memory in the iovec array is not contiguous.

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

Correct various function comment headers

2025-06-04T10:32:04+00:00

This commit refines function comment headers for improved accuracy
and consistency. Key changes include:

- Corrected parameter/return descriptions (e.g., `logtime`, `__daemon`).
- Added missing and removed incorrect parameter documentation (e.g.,
  `tcp_vu_sock_recv`, `ndp`).
- Standardized comments to the `/** ... */` style for functions
  like `udp_flow_close` and `ns_enter`.
- Ensured function names in comments consistently use `()`.
- Addressed minor typos and updated comments for renamed functions.

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

tcp, flow: Better use flow specific logging heleprs

2025-03-14T22:40:40+00:00

A number of places in the TCP code use general logging functions, instead
of the flow specific ones.  That includes a few older ones as well as many
places in the new migration code.  Thus they either don't identify which
flow the problem happened on, or identify it in a non-standard way.

Convert many of these to use the existing flow specific helpers.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Set PSH flag for last incoming packets in a batch

2025-01-21T13:28:44+00:00

So far we omitted setting PSH flags for inbound traffic altogether: as
we ignore the nature of the data we're sending, we can't conclude that
some data is more or less urgent. This works fine with Linux guests,
as the Linux kernel doesn't do much with it, on input: it will
generally deliver data to the application layer without delay.

However, with Windows, things change: if we don't set the PSH flag on
interactive inbound traffic, we can expect long delays before the data
is delivered to the application.

This is very visible with RDP, where packets we send on behalf of the
RDP client are delivered with delays exceeding one second:

  $ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls
  33170   0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0
  33171   0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51
  33172   0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0
  33173   1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370

in this example (packet capture taken by passt), frame #33172 is a
mouse event sent by the RDP client, and frame #33173 is the first
event (display reacting to click) sent back by the server. This
appears as a 1.4 s delay before we get frame #33173.

If we set PSH, instead:

  $ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls
  314   0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116
  315   0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0
  316   0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116
  317   0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53

here, in frame #316, our mouse event is delivered without a delay and
receives a response in approximately 12 ms.

Set PSH on the last segment for any batch we dequeue from the socket,
that is, set it whenever we know that we might not be sending data to
the same port for a while.

Reported-by: NN708
Link: https://bugs.passt.top/show_bug.cgi?id=107
Signed-off-by: Stefano Brivio