passt/tcp_splice.c, branch 2025_06_11.0293c6f

tcp_splice: Don't clobber errno before checking for EAGAIN

2025-04-09T20:57:27+00:00

Like many places, tcp_splice_sock_handler() needs to handle EAGAIN
specially, in this case for both of its splice() calls.  Unfortunately it
tests for EAGAIN some time after those calls.  In between there has been
at least a flow_trace() which could have clobbered errno.  Move the test on
errno closer to the relevant system calls to avoid this problem.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Don't double count bytes read on EINTR

2025-04-09T20:57:16+00:00

In tcp_splice_sock_handler(), if we get an EINTR on our second splice()
(pipe to output socket) we - as we should - go back and retry it.  However,
we do so *after* we've already updated our byte counters.  That does no
harm for the conn->written[] counter - since the second splice() returned
an error it will be advanced by 0.  However we also advance the
conn->read[] counter, and then do so again when the splice() succeeds.
This results in the counters being out of sync, and us thinking we have
remaining data in the pipe when we don't, which can leave us in an
infinite loop once the stream finishes.

Fix this by moving the EINTR handling to directly next to the splice()
call (which is what we usually do for EINTR).  As a bonus this removes one
mildly confusing goto.

For symmetry, also rework the EINTR handling on the first splice() the same
way, although that doesn't (as far as I can tell) have buggy side effects.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2779347687
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

flow: Add flow_perror() helper

2025-02-18T12:33:12+00:00

Our general logging helpers include a number of _perror() variants which,
like perror(3) include the description of the current errno.  We didn't
have those for our flow specific logging helpers, though.  Fill this gap
with flow_perror() and flow_dbg_perror(), and use them where it's useful.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: A typo three years ago and SO_RCVLOWAT is gone

2025-02-17T07:28:45+00:00

In commit e5eefe77435a ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:

			if (!bitmap_isset(rcvlowat_set, conn - ts) &&
			    readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

			if (conn->flags & lowat_set_flag &&
			    readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.

But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.

Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Don't wake up on input data if we can't write it anywhere

2025-02-17T07:27:30+00:00

If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:

  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
  41.0230:          pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from read-side call
  41.0230:          Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
  41.0230:          Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values

2025-02-14T11:02:55+00:00

I added this a long long time ago because it dramatically improved
throughput back then: with rmem_max and wmem_max >= 4 MiB, we would
force send and receive buffer sizes for TCP sockets to the maximum
allowed value.

This effectively disables TCP auto-tuning, which would otherwise allow
us to exceed those limits, as crazy as it might sound. But in any
case, it made sense.

Now that we have zero (internal) copies on every path, plus vhost-user
support, it turns out that these settings are entirely obsolete. I get
substantially the same throughput in every test we perform, even with
very short durations (one second).

The settings are not just useless: they actually cause us quite some
trouble on guest state migration, because they lead to huge queues
that need to be moved as well.

Drop those settings.

Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice, udp_flow: fcntl64() support on PPC64 depends on glibc version

2025-02-03T21:43:35+00:00

I explicitly added fcntl64() to the list of allowed system calls for
PPC64 a while ago, and now it turns out it's not available in recent
Debian builds. The warning from seccomp.sh is harmless because we
unconditionally try to enable fcntl() anyway, but take care of it
anyway.

Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

util: Rename and make global vu_remove_watch()

2025-02-03T06:32:51+00:00

vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set.  There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp_splice: Set (again) TCP_NODELAY on both sides

2025-01-06T09:10:29+00:00

In commit 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced
sockets") I just assumed that we wouldn't benefit from disabling
Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).

It turns out that with some patterns, such as a PostgreSQL server
in a container receiving parameterised, short queries, for which pasta
sees several short inbound messages (Parse, Bind, Describe, Execute
and Sync commands getting each one their own packet, 5 to 49 bytes TCP
payload each), we'll read them usually in two batches, and send them
in matching batches, for example:

  9165.2467:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from read-side call
  9165.2468:          Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
  9165.2469:          pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from read-side call
  9165.2470:          Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
  9165.2944:          pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)

and the kernel delivers the first one, waits for acknowledgement from
the receiver, then delivers the second one. This adds very substantial
and unnecessary delay. It's usually a fixed ~40ms between the two
batches, which is clearly unacceptable for loopback connections.

In this example, the delay is shown by the timestamp of the response
from socket 118. The peer (server) doesn't actually take that long
(less than a millisecond), but it takes that long for the kernel to
deliver our request.

To avoid batching and delays, disable Nagle's algorithm by setting
TCP_NODELAY on both internal and external sockets: this way, we get
one inbound packet for each original message, we transfer them right
away, and the kernel delivers them to the process in the container as
they are, without delay.

We can do this safely as we don't care much about network utilisation
when there's in fact pretty much no network (loopback connections).

This is unfortunately not visible in the TCP request-response tests
from the test suite because, with smaller messages (we use one byte),
Nagle's algorithm doesn't even kick in. It's probably not trivial to
implement a universal test covering this case.

Fixes: 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
Signed-off-by: Stefano Brivio

treewide: Dodge dynamic memory allocation in strerror() from glibc > 2.40

2024-12-11T11:21:23+00:00

With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.

This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).

Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).

Reported-by: Paul Holzinger 
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson 
Tested-by: Paul Holzinger