<feed xmlns='http://www.w3.org/2005/Atom'>
<title>passt/tcp_splice.c, branch podman23739</title>
<subtitle>Plug A Simple Socket Transport</subtitle>
<link rel='alternate' type='text/html' href='https://passt.top/passt/'/>
<entry>
<title>flow, treewide: Promote priority of selected flow-linked messages</title>
<updated>2026-06-09T02:28:20+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-06-05T12:30:40+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=dd8923b8adb9ab1e1ad79727ee0a912131f6e2cb'/>
<id>dd8923b8adb9ab1e1ad79727ee0a912131f6e2cb</id>
<content type='text'>
Most of out flow specific log messages are debug level for fear of flooding
the logs, even when they report real error conditions that might be off
significance.

Now that we have the mechanisms for log message rate limiting, we can do
better.  Promote many flow related messages to warning or error level, with
rate limiting.  While we're there add ratelimiting to a handful of existing
warning or error level messages.

They general heuristic is to promote messages that report a failure which
is not something that should be triggered by the guest doing something
weird.  This mostly means failures from socket operations we expect to be
legitimate.

Adding the ratelimiting means plumbing the 'now' timestamp through much
more of the code, hence the large churn.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Most of out flow specific log messages are debug level for fear of flooding
the logs, even when they report real error conditions that might be off
significance.

Now that we have the mechanisms for log message rate limiting, we can do
better.  Promote many flow related messages to warning or error level, with
rate limiting.  While we're there add ratelimiting to a handful of existing
warning or error level messages.

They general heuristic is to promote messages that report a failure which
is not something that should be triggered by the guest doing something
weird.  This mostly means failures from socket operations we expect to be
legitimate.

Adding the ratelimiting means plumbing the 'now' timestamp through much
more of the code, hence the large churn.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Improve EOF and read stall exit conditions</title>
<updated>2026-06-05T07:46:52+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-06-05T00:34:16+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=21f4d13c4cd4db24b65926265c98d5f41f0c6a9b'/>
<id>21f4d13c4cd4db24b65926265c98d5f41f0c6a9b</id>
<content type='text'>
At the end of our loop we have a conditional 'break' that exits if we're
at EOF on the read side and have nothing left in the pipe.  This makes
sense: at EOF there's nothing left to do read-side and with nothing in the
pipe there's nothing to do write side either.

The same is true if the read side hit an EAGAIN and the pipe is empty:
there's nothing we can do (for now) read side, and with an empty pipe
nothing write side either.  So, generalise the condition to exit on either
EOF or EAGAIN read side.

Furthermore, if the read side is at EOF or EAGAIN and there's already
nothing in the pipe before the write-side splice(), then that write side
splice() can't accomplish anything, so exit the loop early in that case
avoiding a harmless but unnecessary write-splice().

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
[sbrivio: Minor comment fix]
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
At the end of our loop we have a conditional 'break' that exits if we're
at EOF on the read side and have nothing left in the pipe.  This makes
sense: at EOF there's nothing left to do read-side and with nothing in the
pipe there's nothing to do write side either.

The same is true if the read side hit an EAGAIN and the pipe is empty:
there's nothing we can do (for now) read side, and with an empty pipe
nothing write side either.  So, generalise the condition to exit on either
EOF or EAGAIN read side.

Furthermore, if the read side is at EOF or EAGAIN and there's already
nothing in the pipe before the write-side splice(), then that write side
splice() can't accomplish anything, so exit the loop early in that case
avoiding a harmless but unnecessary write-splice().

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
[sbrivio: Minor comment fix]
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Remove questionable "optimisation" of pending bytes tracking</title>
<updated>2026-06-04T04:35:29+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-28T05:02:12+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=2420ad2f8f86494f9318ceaeea391502ee947095'/>
<id>2420ad2f8f86494f9318ceaeea391502ee947095</id>
<content type='text'>
We have a special path that avoids updating conn-&gt;pending when the amounts
read and written are equal.  This has a conceptual complexity cost, in
particular, it means that conn-&gt;pending[] is not accurate to its normal
meaning for a section of the loop body.

conn-&gt;pending[] shares a cacheline with conn-&gt;pipe[] and conn-&gt;s[], so it's
almost certainly cache-hot.  It's questionable that avoiding the update
of pending even outweighs the extra conditional branch, let alone saves
anything of significance.  Remove it.

This allows us to move the updates to conn-&gt;pending closer to the actual
splice() calls, making it easier to reason about its value.  It also lets
us move the conn-&gt;pending updates so they can piggy back on existing tests
rather than needing a conditional expression to avoid clobbering it when
splice() returns -1 (EAGAIN).

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We have a special path that avoids updating conn-&gt;pending when the amounts
read and written are equal.  This has a conceptual complexity cost, in
particular, it means that conn-&gt;pending[] is not accurate to its normal
meaning for a section of the loop body.

conn-&gt;pending[] shares a cacheline with conn-&gt;pipe[] and conn-&gt;s[], so it's
almost certainly cache-hot.  It's questionable that avoiding the update
of pending even outweighs the extra conditional branch, let alone saves
anything of significance.  Remove it.

This allows us to move the updates to conn-&gt;pending closer to the actual
splice() calls, making it easier to reason about its value.  It also lets
us move the conn-&gt;pending updates so they can piggy back on existing tests
rather than needing a conditional expression to avoid clobbering it when
splice() returns -1 (EAGAIN).

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Simplify / correct OUT_WAIT flag handling</title>
<updated>2026-06-04T04:35:26+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-28T05:02:11+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=4ccb2eebaa024a42cc4d5ba0112ade67666a3446'/>
<id>4ccb2eebaa024a42cc4d5ba0112ade67666a3446</id>
<content type='text'>
We set the OUT_WAIT flag if we stop forwarding due to EAGAIN, but there's
still data in the pipe.  That ensures we wake up when the output socket has
room to drain the pipe into.

We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT
event, but that's not quite right.  Even though it's called on an EPOLLOUT,
tcp_splice_forward() could, in principle empty the pipe, but also read
enough new data from the other side to fill it again.  That would set
OUT_WAIT internally, but it would be cleared after returning meaning
we could miss a necessary wakeup.

The condition on whether we need write side wakeups is actually fairly
simple: we need them if and only if we return to the main loop with data
in the pipe.  Maintain that in a single place - right after we exit the
forwarding loop in tcp_splice_forward().

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We set the OUT_WAIT flag if we stop forwarding due to EAGAIN, but there's
still data in the pipe.  That ensures we wake up when the output socket has
room to drain the pipe into.

We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT
event, but that's not quite right.  Even though it's called on an EPOLLOUT,
tcp_splice_forward() could, in principle empty the pipe, but also read
enough new data from the other side to fill it again.  That would set
OUT_WAIT internally, but it would be cleared after returning meaning
we could miss a necessary wakeup.

The condition on whether we need write side wakeups is actually fairly
simple: we need them if and only if we return to the main loop with data
in the pipe.  Maintain that in a single place - right after we exit the
forwarding loop in tcp_splice_forward().

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Simplify shutdown(2) handling</title>
<updated>2026-06-04T04:35:23+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-28T05:02:10+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=cd61ad02d4181e8fa34e2bcb8f8436fd9a64714b'/>
<id>cd61ad02d4181e8fa34e2bcb8f8436fd9a64714b</id>
<content type='text'>
At the end of tcp_splice_forward(), we check for half-closed connections
in either direction and propagate the FIN to the other side with a
shutdown(2).

However, it's unnecessary to check both directions: a FIN from side X will
cause an EPOLLRDUP on side X's socket, which will trigger
tcp_splice_forward() from side X to side !X.  Likewise for the other side.
So we only need to check for "forward" FIN propagation.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
At the end of tcp_splice_forward(), we check for half-closed connections
in either direction and propagate the FIN to the other side with a
shutdown(2).

However, it's unnecessary to check both directions: a FIN from side X will
cause an EPOLLRDUP on side X's socket, which will trigger
tcp_splice_forward() from side X to side !X.  Likewise for the other side.
So we only need to check for "forward" FIN propagation.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Remove goto from forwarding loop</title>
<updated>2026-06-04T04:35:21+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-28T05:02:09+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=987ac99098480d403b8aa922736c239a4aa6de1b'/>
<id>987ac99098480d403b8aa922736c239a4aa6de1b</id>
<content type='text'>
The forwarding look in tcp_splice_forward() has a retry label that we goto
in some cases.  However, the only difference between a 'goto retry' and
a 'continue' is that the 'continue' will reset the 'more' variable to 0.

The fist goto retry only occurs if never_read is set, which can only be
the case if we never changed 'more' in the first place, so is strictly
equivalent to a continue.  In the second case, 'more' can be set though.

'more' is set by a heuristic that if we're able to read most of a pipe's
worth of data at once, there's probably more coming, so we should prepare
the write-side for that.  However, on a goto retry we have a new read side
splice.  If this time we *don't* get most of a pipe's worth of data, that
suggests that contrary to expectations from the previous loop we have now
temporarily run out of input data and so SPLICE_F_MORE is no longer
a good guess for the next write side splice().  In other words, the second
read-splice() gives us better data for the heuristic than keeping our guess
from the first one, so resetting 'more' is valuable.

So, we could replace both gotos with continues.  But they're already at the
end the loop body, so a continue is a no-op.  Just remove them.  That, in
turn removes the need for the never_read variable.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The forwarding look in tcp_splice_forward() has a retry label that we goto
in some cases.  However, the only difference between a 'goto retry' and
a 'continue' is that the 'continue' will reset the 'more' variable to 0.

The fist goto retry only occurs if never_read is set, which can only be
the case if we never changed 'more' in the first place, so is strictly
equivalent to a continue.  In the second case, 'more' can be set though.

'more' is set by a heuristic that if we're able to read most of a pipe's
worth of data at once, there's probably more coming, so we should prepare
the write-side for that.  However, on a goto retry we have a new read side
splice.  If this time we *don't* get most of a pipe's worth of data, that
suggests that contrary to expectations from the previous loop we have now
temporarily run out of input data and so SPLICE_F_MORE is no longer
a good guess for the next write side splice().  In other words, the second
read-splice() gives us better data for the heuristic than keeping our guess
from the first one, so resetting 'more' is valuable.

So, we could replace both gotos with continues.  But they're already at the
end the loop body, so a continue is a no-op.  Just remove them.  That, in
turn removes the need for the never_read variable.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Improve EOF exit condition for the loop</title>
<updated>2026-06-04T04:35:18+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-28T05:02:08+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=4a6187008f1ac3db2e221b3b21151f4d72fa8821'/>
<id>4a6187008f1ac3db2e221b3b21151f4d72fa8821</id>
<content type='text'>
In tcp_splice_forward() we exit the forwarding loop if we have an EOF on
the read side.  However, this potentially leaves data in the pipe, even if
the write side hasn't yet blocked.  It's not clear to me whether this could
leave data indefinitely in the pipe with no events to keep it moving,
but it's not clear to me that it couldn't either.

Stay in the loop until either the write side blocks or we've emptied
the pipe.

Secondly, this test is after several tests on how much we wrote which
might also cause a retry.  However, if we've reached EOF and the pipe is
empty, there's nothing more to do, regardless of how much we wrote, so
we should exit, regardless of those conditions.  So move this exit test
above the retry conditions.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In tcp_splice_forward() we exit the forwarding loop if we have an EOF on
the read side.  However, this potentially leaves data in the pipe, even if
the write side hasn't yet blocked.  It's not clear to me whether this could
leave data indefinitely in the pipe with no events to keep it moving,
but it's not clear to me that it couldn't either.

Stay in the loop until either the write side blocks or we've emptied
the pipe.

Secondly, this test is after several tests on how much we wrote which
might also cause a retry.  However, if we've reached EOF and the pipe is
empty, there's nothing more to do, regardless of how much we wrote, so
we should exit, regardless of those conditions.  So move this exit test
above the retry conditions.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling</title>
<updated>2026-06-04T04:35:16+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-28T05:02:07+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=1d475b1dbf22dac4a9fda3ebb444b0343392c5c3'/>
<id>1d475b1dbf22dac4a9fda3ebb444b0343392c5c3</id>
<content type='text'>
There are two ways we can tell one of our sockets has received a FIN.  We
can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
(EOF) on the socket.  We currently use both, in a mildly confusing way:
we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
some other close out logic is based on seeing an EOF.

Simplify this by setting the flag based on only the EOF.  To make sure we
don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
forwarding path for EPOLLRDHUP as well as EPOLLIN.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There are two ways we can tell one of our sockets has received a FIN.  We
can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
(EOF) on the socket.  We currently use both, in a mildly confusing way:
we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
some other close out logic is based on seeing an EOF.

Simplify this by setting the flag based on only the EOF.  To make sure we
don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
forwarding path for EPOLLRDHUP as well as EPOLLIN.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp_splice: Remove never-invoked SO_RCVLOWAT logic</title>
<updated>2026-06-04T04:35:02+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-28T05:02:06+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=630e9bf1decf94618a036a020b7c920c8ab6126c'/>
<id>630e9bf1decf94618a036a020b7c920c8ab6126c</id>
<content type='text'>
tcp_splice_forward() contains some logic to use the SO_RCVLOWAT
setsockopt().  This appears to be aimed at interrupt (epoll) mitigation, so
that we're not always waking for a socket that's getting frequent small
amounts of data.

However, the logic is never invoked, and hasn't been since at least
2022_07_14.b86cd00:  it's conditional on
    readlen &gt; (long)c-&gt;tcp.pipe_size / 10
However, immediately before that we've invoked 'continue' if:
    readlen &gt;= (long)c-&gt;tcp_pipe_size * 10 / 100
which is a strictly weaker condition.

While it's possible we want to restore a working version of that interrupt
mitigation at some point, for the time being this logic just confuses the
picture and makes some other cleanups more awkward.  We haven't had it
for over 3 years, so it's clearly not vital.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tcp_splice_forward() contains some logic to use the SO_RCVLOWAT
setsockopt().  This appears to be aimed at interrupt (epoll) mitigation, so
that we're not always waking for a socket that's getting frequent small
amounts of data.

However, the logic is never invoked, and hasn't been since at least
2022_07_14.b86cd00:  it's conditional on
    readlen &gt; (long)c-&gt;tcp.pipe_size / 10
However, immediately before that we've invoked 'continue' if:
    readlen &gt;= (long)c-&gt;tcp_pipe_size * 10 / 100
which is a strictly weaker condition.

While it's possible we want to restore a working version of that interrupt
mitigation at some point, for the time being this logic just confuses the
picture and makes some other cleanups more awkward.  We haven't had it
for over 3 years, so it's clearly not vital.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp, tcp_splice: Make helper for setting SO_LINGER socket option</title>
<updated>2026-05-27T08:17:18+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2026-05-13T07:18:20+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=98e3c015b3791ff55381e5ee687f541721d1695e'/>
<id>98e3c015b3791ff55381e5ee687f541721d1695e</id>
<content type='text'>
Both spliced and non-spliced TCP in some cases set the SO_LINGER socket
option in order to to force a TCP RST on a socket side connection.  In each
case we open code the setsockopt() logic.  We're shortly going to add
another place that needs this, so move the setsockopt() and error handling
logic into a shared helper.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Both spliced and non-spliced TCP in some cases set the SO_LINGER socket
option in order to to force a TCP RST on a socket side connection.  In each
case we open code the setsockopt() logic.  We're shortly going to add
another place that needs this, so move the setsockopt() and error handling
logic into a shared helper.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
