passt, branch 2025_12_10.d04c480

pif: Correctly set scope_id for guest-side link local addresses

2025-12-10T07:37:29+00:00

pif_sockaddr() is supposed to generate a suitable socket address, either
for the host, or for the guest, depending on the 'pif' parameter.  When
given a link-local address, this means it needs to generate a suitable
scope_id to specify which link.  It does this for the host, but not for the
guest.

I think this was done on the assumption that we won't ever generate guest
side link local addresses when forwarding connections.  That, however, is
not the case, at least with the recent extensions to "local mode".  Fix the
problem by properly populating the scope_id field for guest addresses.

Link: https://bugs.passt.top/show_bug.cgi?id=181
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Correct timer expiry value in trace message

2025-12-10T07:37:06+00:00

000601ba8 ("tcp: Adaptive interval based on RTT for socket-side
acknowledgement checks") added (amongst other things) a new trace message
showing the expiry time for the TCP timer in ms rather than s.

Unfortunately there were some arithmetic errors in the message, meaning it
will print incorrect numbers.  Correct them

Fixes: 000601ba86da ("tcp: Adaptive interval based on RTT for socket-side acknowledgement checks")
Link: https://bugs.passt.top/show_bug.cgi?id=182
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice, flow: Add socket to epoll set before connect(), drop assert

2025-12-09T00:27:24+00:00

...otherwise, if we have a real error on connect() (that is, not
EINPROGRESS), we'll return early from tcp_splice_connect() and later
try to fetch the epoll file descriptor:

  ASSERTION FAILED in flow_epollfd (flow.c:362): f->epollid < ((1 << 8) - 1)

which is still (correctly) EPOLLFD_ID_INVALID.

Replace the ASSERT() in flow_epollfd() with a warning, as it looks
like there might be harmless cases where the socket is not in the
epoll set yet, and we'll just crash for nothing. We can turn this back
to an ASSERT() once we audit these paths in more detail.

Link: https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-4473411
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

fedora: Fix build on Fedora 43, selinux_requires_min not available on Copr builders

2025-12-08T10:17:14+00:00

For some reason, on Copr:

  Building target platforms: aarch64
  Building for target aarch64
  error: line 42: Unknown tag: %selinux_requires_min
  Child return code was: 1

Use %selinux_requires_min starting from current Rawhide / Fedora 44,
there it works.

Signed-off-by: Stefano Brivio

tcp: Skip redundant ACK on partial sendmsg() failure

2025-12-08T08:15:36+00:00

...we'll send a duplicate ACK right away in this case, and this
redundant, earlier check is not just useless, but it might actually
be harmful as we'll now send a triple ACK which might cause two
retransmissions.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Send a duplicate ACK also on complete sendmsg() failure

2025-12-08T08:15:36+00:00

...in order to trigger a fast retransmit as soon as possible. There's
no benefit in forcing the sender to wait for a longer time than that.

We already do this on partial failures (short socket writes), but for
historical reason not on complete failures. Make these two cases
consistent between each other.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Allow exceeding the available sending buffer size in window advertisements

2025-12-08T08:15:36+00:00

If the remote peer is advertising a bigger value than our current
sending buffer, it means that a bigger sending buffer is likely to
benefit throughput.

We can get a bigger sending buffer by means of the buffer size
auto-tuning performed by the Linux kernel, which is triggered by
aggressively filling the sending buffer.

Use an adaptive boost factor, up to 150%, depending on:

- how much data we sent so far: we don't want to risk retransmissions
  for short-lived connections, as the latency cost would be
  unacceptable, and

- the current RTT value, as we need a bigger buffer for higher
  transmission delays

The factor we use is not quite a bandwidth-delay product, as we're
missing the time component of the bandwidth, which is not interesting
here: we are trying to make the buffer grow at the beginning of a
connection, progressively, as more data is sent.

The tuning of the amount of boost factor we want to apply was done
somewhat empirically but it appears to yield the available throughput
in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
within a few seconds for the 300 ms case.

Note that we want to apply this factor only if the window advertised
by the peer is bigger than the current sending buffer, as we only need
this for auto-tuning, and we absolutely don't want to incur
unnecessary retransmissions otherwise.

The related condition in tcp_update_seqack_wnd() is not redundant as
there's a subtractive factor, sendq, in the calculation of the window
limit. If the sending buffer is smaller than the peer's advertised
window, the additional limit we might apply might be lower than we
would do otherwise.

Assuming that the sending buffer is reported as 100k, sendq is
20k, we could have these example cases:

1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
   buffer, so we boost its size to 150k, and we limit the window
   to 120k

2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
   buffer, so we aren't trying to trigger buffer auto-tuning and
   we'll stick to the existing, more conservative calculation,
   by limiting the window to 100 - 20 = 80k

If we omitted the new condition, we would always use the boosted
value, that is, 120k, even if potentially causing unnecessary
retransmissions.

Signed-off-by: Stefano Brivio

tcp: Don't limit window to less-than-MSS values, use zero instead

2025-12-08T08:15:36+00:00

If the sender uses data clumping (including Nagle's algorithm) for
Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
means the sender might stop sending altogether, and window updates
after a low window condition are just as important as they are in
a zero-window condition.

For simplicity, approximate that limit to zero, as we have an
implementation forcing window updates after zero-sized windows.
This matches the suggestion from RFC 813, section 4.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Acknowledge everything if it looks like bulk traffic, not interactive

2025-12-08T08:15:36+00:00

...instead of checking if the current sending buffer is less than
SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
ACK segments: we rely on having enough data at once from the sender
to make the buffer grow by means of TCP buffer size tuning
implemented in the Linux kernel.

This is important if we're trying to maximise throughput, but not
desirable for interactive traffic, where we want to be transparent as
possible and avoid introducing unnecessary latency.

Use the tcpi_delivery_rate field reported by the Linux kernel, if
available, to calculate the current bandwidth-delay product: if it's
significantly smaller than the available sending buffer, conclude that
we're not bandwidth-bound and this is likely to be interactive
traffic, so acknowledge data only as it's acknowledged by the peer.

Conversely, if the bandwidth-delay product is comparable to the size
of the sending buffer (more than 5%), we're probably bandwidth-bound
or... bound to be: acknowledge everything in that case.

Signed-off-by: Stefano Brivio

tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window

2025-12-08T08:15:36+00:00

We correctly avoid doing that at the beginning of tcp_prepare_flags(),
but we might clear the flag later on if we actually end up sending a
"flag" segment.

Make sure we don't, otherwise we might delay window updates after a
zero-window condition significantly, and significantly affect
throughput.

In some cases, we're forcing peers to send zero-window probes or
keep-alive segments.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson