passt/tcp.c, branch 2026_05_07.1afd4ed

tcp: Use SO_MEMINFO for accurate send buffer overhead accounting

2026-05-07T06:03:14+00:00

The TCP window advertised to the guest/container must balance two
competing needs: large enough to trigger kernel socket buffer
auto-tuning, but not so large that sendmsg() partially fails, causing
retransmissions.

The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
SNDBUF_GET() returns a scaled value that only roughly accounts for
per-skb overhead. The clamped_scale approximation doesn't accurately
track the actual per-segment overhead, which can lead to both excessive
retransmissions and reduced throughput.

We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
SK_MEMINFO_WMEM_QUEUED from the kernel. The latter is presented in the
kernel's own accounting units, i.e. including the sk_buff overhead,
and matches exactly what the kernel's own sk_stream_memory_free()
function is using.

When data is queued and the overhead ratio is observable, we calculate
the per-segment overhead as (wmem_queued - sendq) / num_segments, then
determine how many additional segments should fit in the remaining
buffer space, considering the calculated per-mss overhead. This approach
treats segments as discrete quantities, and produces a more accurate
estimate of available buffer space than a linear scaling factor does.

When the ratio cannot be observed, e.g. because the queue is empty or
we are in a transient state, we fall back to the existing clamped_scale
calculation (scaling between 100% and 75% of buffer capacity).

When SO_MEMINFO succeeds, we also use SK_MEMINFO_SNDBUF directly to
set SNDBUF, avoiding a separate SO_SNDBUF getsockopt() call.

If SO_MEMINFO is unavailable, we fall back to the pre-existing
SNDBUF_GET() - SIOCOUTQ calculation.

Link: https://bugs.passt.top/show_bug.cgi?id=138
Link: https://github.com/containers/podman/issues/28219
Analysed-by: Yumei Huang 
Signed-off-by: Jon Maloy 
Signed-off-by: Stefano Brivio

tcp: Handle errors from tcp_send_flag()

2026-04-24T21:35:43+00:00

tcp_send_flag() can fail in two different ways:
   - tcp_prepare_flags() returns -ECONNRESET when getsockopt(TCP_INFO)
     fails: the socket is broken and the connection must be reset.
   - tcp_vu_send_flag() returns -EAGAIN when vu_collect() finds no
     available vhost-user buffers: this is a transient condition
     equivalent to a dropped packet on the wire.

Have tcp_vu_send_flag() return -EAGAIN instead of a bare -1 for the
buffer-unavailable case. Absorb -EAGAIN in the tcp_send_flag()
dispatcher so that callers only see fatal errors.

Check the return value at each call site and handle fatal errors:
   - in tcp_data_from_tap(), return -1 so the caller resets
   - in tcp_tap_handler(), goto reset
   - in tcp_timer_handler()/tcp_sock_handler()/tcp_conn_from_sock_finish(),
     call tcp_rst() and return
   - in tcp_tap_conn_from_sock(), set CLOSING flag, call
     FLOW_ACTIVATE() to avoid leaving the flow in TYPED state, and
     return
   - in tcp_connect_finish(), call tcp_rst() and return
   - in tcp_keepalive(), call tcp_rst() and continue the loop
   - in tcp_flow_migrate_target_ext(), goto fail

The call in tcp_rst_do() is left unchecked: we are already
resetting, and tcp_sock_rst() still needs to run regardless.

Link: https://bugs.passt.top/show_bug.cgi?id=194
Signed-off-by: Anshu Kumari 
Signed-off-by: Stefano Brivio

tcp: Replace send buffer boost with EPOLLOUT monitoring

2026-04-20T21:20:41+00:00

Currently we use the SNDBUF boost mechanism to force TCP auto-tuning.
However, it doesn't always work, and sometimes causes a lot of
retransmissions. As a result, the throughput suffers.

This patch replaces it with monitoring EPOLLOUT when sendmsg() failure
(with EAGAIN and EWOULDBLOCK) and partial sends occur.

Tested with iperf3 inside pasta: throughput is now comparable to running
iperf3 directly on the host without pasta. However, retransmissions can
still be elevated when RTT >= 50ms. For example, when RTT is between
200ms and 500ms, retransmission count varies from 30 to 120 in roughly
80% of test runs.

Link: https://bugs.passt.top/show_bug.cgi?id=138
Link: https://github.com/containers/podman/issues/28219
Suggested-by: Stefano Brivio 
Signed-off-by: Yumei Huang 
Signed-off-by: Stefano Brivio

tap, tcp, udp: Use rate-limited logging

2026-04-15T18:59:35+00:00

Now that rate-limited logging macros are available, promote several
debug messages to higher severity levels.  These messages were
previously kept at debug to prevent guests from flooding host
logs, but with rate limiting they can safely be made visible in
normal operation.

In tap.c, refactor tap4_is_fragment() to use warn_ratelimit() instead
of its ad-hoc rate limiting, and promote the guest MAC address change
message to info level.

In tcp.c, promote the invalid TCP SYN endpoint message to warn level.

In udp.c, promote dropped datagram messages to warn level, and
rate-limit the unrecoverable socket error message.

In udp_flow.c, promote flow allocation failures to err_ratelimit.

Link: https://bugs.passt.top/show_bug.cgi?id=134
Signed-off-by: Anshu Kumari 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

serialise: Split functions user for serialisation from util.c

2026-03-28T13:35:41+00:00

The read_all_buf() and write_all_buf() functions in util.c are
primarily used for serialising data structures to a stream during
migraiton.  We're going to have further use for such serialisation
when we add dynamic configuration updates, where we'll want to share
the code with the client program.

To make that easier move the functions into a new serialise.c
file, and rename thematically.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

treewide: Spell ASSERT() as assert()

2026-03-20T20:05:29+00:00

The standard library assert(3), at least with glibc, hits our seccomp
filter and dies with SIGSYS before it's able to print a message, making it
near useless.  Therefore, since 7a8ed9459dfe ("Make assertions actually
useful") we've instead used our own implementation, named ASSERT().

This makes our code look slightly odd though - ASSERT() has the same
overall effect as assert(), it's just a different implementation.  More
importantly this makes it awkward to share code between passt/pasta proper
and things that compile in a more typical environment.  We're going to want
that for our upcoming dynamic configuration tool.

Address this by overriding the standard library's assert() implementation
with our own, instead of giving ours its own name.

The standard assert() is supposed to be omitted if NDEBUG is defined,
which ours doesn't do.  Implement that as well, so ours doesn't
unexpectedly differ.  For the -DNDEBUG case we do this by *not* overriding
assert(), since it will be a no-op anyway.  This requires a few places to
add a #include  to let us compile (albeit with warnings) when
-DNDEBUG.

Signed-off-by: David Gibson 
[sbrivio: Fix some conflicts and missing conversions as a result of
 applying "vu_common: Move iovec management into vu_collect()" first]
Signed-off-by: Stefano Brivio

fwd: Unify TCP and UDP forwarding tables

2026-03-11T21:11:30+00:00

Currently TCP and UDP each have their own forwarding tables.  This is
awkward in a few places, where we need switch statements to select the
correct table.  More importantly, it would make things awkward and messy to
extend to other protocols in future, which we're likely to want to do.

Merge the TCP and UDP tables into a single table per (source) pif, with the
protocol given in each rule entry.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

fwd: Split forwarding table from port scanning state

2026-03-11T21:11:30+00:00

For hsitorical reasons, struct fwd_ports contained both the new forwarding
table and some older state related to port / scanning auto-forwarding
detection.  They are related, but keeping them together prevents some
future reworks we want to do.

Separate them into struct fwd_table (for the table) and struct fwd_scan
for the scanning state.  Adjusting all the users makes for a logically
straightforward, but fairly extensive patch.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Avoid comparison of expressions with different signedness in tcp_timer_handler()

2026-03-06T07:24:01+00:00

With gcc 14.2, building against musl 1.2.5 (slightly outdated Alpine
on x86_64):

tcp.c: In function 'tcp_timer_handler':
util.h:40:39: warning: comparison of integer expressions of different signedness: 'unsigned int' and 'int' [-Wsign-compare]
   40 | #define MIN(x, y)               (((x) < (y)) ? (x) : (y))
      |                                       ^
tcp.c:2593:31: note: in expansion of macro 'MIN'
 2593 |                         max = MIN(TCP_MAX_RETRIES, max);
      |                               ^~~
util.h:40:54: warning: operand of '?:' changes signedness from 'int' to 'unsigned int' due to unsignedness of other operand [-Wsign-compare]
   40 | #define MIN(x, y)               (((x) < (y)) ? (x) : (y))
      |                                                      ^~~
tcp.c:2593:31: note: in expansion of macro 'MIN'
 2593 |                         max = MIN(TCP_MAX_RETRIES, max);
      |                               ^~~

for some reason, that's not reported by gcc with glibc.

Make the temporary 'max' variable unsigned, as we know it can't be
negative anyway.

While at it, add the customary blank line between variable
declarations and code.

Fixes: 3dde0e07804e ("tcp: Update data retransmission timeout")
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

fwd, pif: Replace with pif_sock_l4() with pif_listen()

2026-03-04T16:52:55+00:00

It turns out all users of pif_sock_l4() use it for "listening" sockets,
which now all have a common epoll reference format.  We can take advantage
of that to pass the necessary epoll information into pif_sock_l4() in a
more natural way, rather than as an opaque u32.

That in turn allows union fwd_listen_ref can become a struct, since the
union only exist to allow the meaningful fields to be coerced into a u32
for pif_sock_l4().

Rename pif_sock_l4() to pif_listen() to reflect the new semantics.  While
we're there, remove the static_assert() on the fwd_listen_ref's size.  We
do still need it to fit into 32 bits, but that constraint is imposed only
by the fact that it needs to fit into the whole, epoll_ref structure,
which we already check with a static_assert() in epoll_ctl.h.

Signed-off-by: David Gibson 
Reported-by: Peter Foley 
Signed-off-by: Stefano Brivio