passt, branch 2025_09_11.6cbcccc

tcp: Store the owner connections for flags frames

2025-09-11T15:11:48+00:00

There is an issue reported by Volker Diels-Grabsch and Boleyn Su.
A segmentation fault occurs when executing the following command:

	(sleep 0.1; ssh -p 22000 127.0.0.1) & passt -f -t 22000:22

It's caused by commit 78da088f7bab ("tcp: unify payload and flags
l2 frames array"). Fix it by storing the owner connections of flags
frames into tcp_frame_conns[] array.

Reported-by: Volker Diels-Grabsch 
Reported-by: Boleyn Su 
Suggested-by: David Gibson 
Fixes: 78da088f7bab ("tcp: unify payload and flags l2 frames array")
Signed-off-by: Yumei Huang 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

Reduce tcp_buf_discard size

2025-09-11T15:09:03+00:00

On kernels without SO_PEEK_OFF, a 16MB static buffer is used to
discard sent data. This patch reduces the buffer to 1MB.

Larger discards are now handled by using multiple iovec entries
pointing to the same 1MB buffer.

Signed-off-by: Xun Gu 
[sbrivio: Drop stray whitespace after BUF_DISCARD_SIZE define]
Signed-off-by: Stefano Brivio

tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data

2025-09-11T15:03:47+00:00

For some reason, tcp_vu_data_from_sock() already takes care of this,
but the non-vhost-user version ignores this possibility and just sends
out a FIN segment whenever we infer we received one host-side,
regardless of the fact that we might have unacknowledged data still to
send.

Somewhat surprisingly, this didn't cause any issue to be reported yet,
until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
came around, leading to the following report from Paul, who hit this
running Podman tests:

  439   0.033032  169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480
  440   0.033055  169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264

we're sending data to the container, up to the edge of the window

  441   0.033059 192.168.122.100 → 169.254.1.2  60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0

and the container acknowledges it

  442   0.033091  169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656

we send more data, all we possibly can, in window

  443   0.033126 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0

and the container shrinks the window due to the issue introduced
by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no
memory"). With a previous patch from this series, we rewind the
sequence, meaning that we assign conn->seq_to_tap from
conn->seq_ack_from_tap, so that we'll retransmit this segment, by
reading again from the socket, and increasing conn->seq_to_tap
once more.

However:

  444   0.033144  169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0

we eventually get a zero-length read from the socket and we miss the
fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a
FIN flag with the most recent sequence. The kernel insists:

  445   0.033147 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0

with its buggy zero-window update, but:

  446   0.033152 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0
  447   0.033202 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0

we don't reset the TAP_FIN_SENT flag anymore, and don't resend
the FIN segment (nor data), as we already rewound the sequence
earlier.

To solve this, hold off the FIN segment until we get a zero-length
read from the socket *and* we know that there's no unacknowledged
pending data, also without vhost-user, in tcp_buf_data_from_sock().

Reported-by: Paul Holzinger 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy

tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent

2025-09-11T15:03:46+00:00

We currently have a number of discrepancies in the tcp_tap_handler()
path between the half-closed connection path and the regular one, and
they are mostly a result of code duplication, which comes in turn from
the fact that tcp_data_from_tap() deals with data transfers as well as
general connection bookkeeping, so we can't use it for half-closed
connections.

This suggests that we should probably rework it into two or more
functions, in the long term, but for the moment being I'm just fixing
one obvious issue, which is the lack of fast retransmissions in the
TAP_FIN_RCVD path, and a potential one, which is the fact we don't
handle socket flush failures.

Add fast re-transmit for half-closed connections, and handle the case
of socket flush (tcp_sock_consume()) flush failure in the same way as
tcp_data_from_tap() handles it.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy

tcp: Cast operands of sequence comparison macros to uint32_t before using them

2025-09-11T15:03:45+00:00

Otherwise, passing signed types causes automatic promotion of the
result of the subtractions as well, which is not what we want, as
these macros rely on unsigned 32-bit arithmetic.

The next patch introduces a ssize_t operand for SEQ_LE, illustrating
the issue.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy

tcp: Don't try to transmit right after the peer shrank the window to zero

2025-09-11T15:03:44+00:00

If the peer shrinks the window to zero, we'll skip storing the new
window, as a convenient way to cause window probes (which exceed any
zero-sized window, strictly speaking) if we don't get window updates
in a while.

As we do so, though, we need to ensure we don't try to queue more data
from the socket right after we process this window update, as the
entire point of a zero-window advertisement is to keep us from sending
more data.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy

tcp: Fix closing logic for half-closed connections

2025-09-11T15:03:42+00:00

First off, don't close connections half-closed by the guest before
our own FIN is acknowledged by the guest itself.

That is, after we receive a FIN from the guest (TAP_FIN_RCVD), if we
don't have any data left to send from the socket (SOCK_FIN_RCVD, or
EPOLLHUP), we send a FIN segment to the guest (TAP_FIN_SENT), but we
need to actually have it acknowledged (and have no pending
retransmissions) before we can close the connection: check for
TAP_FIN_ACKED, first.

Then, if we set TAP_FIN_SENT, and we receive an ACK segment from the
guest, set TAP_FIN_ACKED. This was entirely missing for the
TAP_FIN_RCVD case, and as we fix the problem described above, this
becomes relevant as well.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy

tcp: Rewind sequence when guest shrinks window to zero

2025-09-11T15:03:40+00:00

A window shrunk to zero means by definition that anything else that
might be in flight is now out of window. Restart from the currently
acknowledged sequence.

We need to do that both in tcp_tap_window_update(), where we already
check for zero-window updates, as well as in tcp_data_from_tap(),
because we might get one of those updates in a batch of packets that
also contains a non-zero window update.

Suggested-by: Jon Maloy 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger 
Reviewed-by: Jon Maloy

tcp: Factor sequence rewind for retransmissions into a new function

2025-09-11T15:03:38+00:00

...as I'm going to need a third occurrence of this in the next change.

This introduces a small functional change in tcp_data_from_tap(): the
sequence was previously rewound to the highest ACK number we found in
the current packet batch, and not to the current value of
seq_ack_from_tap.

The two might differ in case tcp_sock_consume() failed, because in
that case we're ignoring that ACK altogether. But if we're ignoring
it, it looks more correct to me to start retransmitting from an
earlier sequence anyway.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger

tcp: FIN flags have to be retransmitted as well

2025-09-11T15:03:21+00:00

If we're retransmitting any data, and we sent a FIN segment to our
peer, regardless of whether it was received, we obviously have to
retransmit it as well, given that it can only come with the last data
segment, or after it.

Unconditionally clear the internal TAP_FIN_SENT flag whenever we
re-transmit, so that we know we have to send it again, in case.

Reported-by: Paul Holzinger 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger