<feed xmlns='http://www.w3.org/2005/Atom'>
<title>passt, branch 2023_11_19.4f1709d</title>
<subtitle>Plug A Simple Socket Transport</subtitle>
<link rel='alternate' type='text/html' href='https://passt.top/passt/'/>
<entry>
<title>valgrind: Don't disable optimizations for valgrind builds</title>
<updated>2023-11-19T08:10:30+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-16T09:15:59+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=4f1709db1b61c14729a6313d860323ec65772a37'/>
<id>4f1709db1b61c14729a6313d860323ec65772a37</id>
<content type='text'>
When we plan to use valgrind, we need to build passt a bit differently:
  * We need debug symbols so that valgrind can match problems it finds to
    meaningful locations
  * We need to allow additional syscalls in the seccomp filter, because
    valgrind's wrappers need them

Currently we also disable optimization (-O0).  This is unfortunate, because
it will make valgrind tests even slower than they'd otherwise be.  Worse,
it's possible that the asm behaviour without optimization might be
different enough that valgrind could miss a use of uninitialized variable
or other fault it would detect.

I suspect this was originally done because without it inlining could mean
that suppressions we use don't reliably match the places we want them to.
Alas, it turns out this is true even with -O0.  We've now implemented a
more robust workaround for this (explicit ((noinline)) attributes when
compiled with -DVALGRIND).  So, we can re-enable optimization for valgrind
builds, making them closer to regular builds.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When we plan to use valgrind, we need to build passt a bit differently:
  * We need debug symbols so that valgrind can match problems it finds to
    meaningful locations
  * We need to allow additional syscalls in the seccomp filter, because
    valgrind's wrappers need them

Currently we also disable optimization (-O0).  This is unfortunate, because
it will make valgrind tests even slower than they'd otherwise be.  Worse,
it's possible that the asm behaviour without optimization might be
different enough that valgrind could miss a use of uninitialized variable
or other fault it would detect.

I suspect this was originally done because without it inlining could mean
that suppressions we use don't reliably match the places we want them to.
Alas, it turns out this is true even with -O0.  We've now implemented a
more robust workaround for this (explicit ((noinline)) attributes when
compiled with -DVALGRIND).  So, we can re-enable optimization for valgrind
builds, making them closer to regular builds.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>valgrind: Adjust suppression for MSG_TRUNC with NULL buffer</title>
<updated>2023-11-19T08:10:12+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-16T09:15:58+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=f7724647b19e0e20d6a11e0405f15e4ff169394e'/>
<id>f7724647b19e0e20d6a11e0405f15e4ff169394e</id>
<content type='text'>
valgrind complains if we pass a NULL buffer to recv(), even if we use
MSG_TRUNC, in which case it's actually safe.  For a long time we've had
a valgrind suppression for this.  It singles out the recv() in
tcp_sock_consume(), the only place we use MSG_TRUNC.

However, tcp_sock_consume() only has a single caller, which makes it a
prime candidate for inlining.  If inlined, it won't appear on the stack and
valgrind won't match the correct suppression.

It appears that certain compiler versions (for example gcc-13.2.1 in
Fedora 39) will inline this function even with the -O0 we use for valgrind
builds.  This breaks the suppression leading to a spurious failure in the
tests.

There's not really any way to adjust the suppression itself without making
it overly broad (we don't want to match other recv() calls).  So, as a hack
explicitly prevent inlining of this function when we're making a valgrind
build.  To accomplish this add an explicit -DVALGRIND when making such a
build.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
valgrind complains if we pass a NULL buffer to recv(), even if we use
MSG_TRUNC, in which case it's actually safe.  For a long time we've had
a valgrind suppression for this.  It singles out the recv() in
tcp_sock_consume(), the only place we use MSG_TRUNC.

However, tcp_sock_consume() only has a single caller, which makes it a
prime candidate for inlining.  If inlined, it won't appear on the stack and
valgrind won't match the correct suppression.

It appears that certain compiler versions (for example gcc-13.2.1 in
Fedora 39) will inline this function even with the -O0 we use for valgrind
builds.  This breaks the suppression leading to a spurious failure in the
tests.

There's not really any way to adjust the suppression itself without making
it overly broad (we don't want to match other recv() calls).  So, as a hack
explicitly prevent inlining of this function when we're making a valgrind
build.  To accomplish this add an explicit -DVALGRIND when making such a
build.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>udp,pasta: Periodically scan for ports to automatically forward</title>
<updated>2023-11-19T08:08:39+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-15T05:25:34+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=457ff122e33cf6a6e559b073f41c530e42d9c597'/>
<id>457ff122e33cf6a6e559b073f41c530e42d9c597</id>
<content type='text'>
pasta supports automatic port forwarding, where we look for listening
sockets in /proc/net (in both namespace and outside) and establish port
forwarding to match.

For TCP we do this scan both at initial startup, then periodically
thereafter.  For UDP however, we currently only scan at start.  So unlike
TCP we won't update forwarding to handle services that start after pasta
has begun.

There's no particular reason for that, other than that we didn't implement
it.  So, remove that difference, by scanning for new UDP forwards
periodically too.  The logic is basically identical to that for TCP, but it
needs some changes to handle the mildly different data structures in the
UDP case.

Link: https://bugs.passt.top/show_bug.cgi?id=45
Link: https://github.com/rootless-containers/rootlesskit/issues/383
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
pasta supports automatic port forwarding, where we look for listening
sockets in /proc/net (in both namespace and outside) and establish port
forwarding to match.

For TCP we do this scan both at initial startup, then periodically
thereafter.  For UDP however, we currently only scan at start.  So unlike
TCP we won't update forwarding to handle services that start after pasta
has begun.

There's no particular reason for that, other than that we didn't implement
it.  So, remove that difference, by scanning for new UDP forwards
periodically too.  The logic is basically identical to that for TCP, but it
needs some changes to handle the mildly different data structures in the
UDP case.

Link: https://bugs.passt.top/show_bug.cgi?id=45
Link: https://github.com/rootless-containers/rootlesskit/issues/383
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Simplify away tcp_port_rebind()</title>
<updated>2023-11-19T08:08:37+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-15T05:25:33+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=4ccdeecb744d48e0c70386d561d34ced860bfacd'/>
<id>4ccdeecb744d48e0c70386d561d34ced860bfacd</id>
<content type='text'>
tcp_port_rebind() is desgined to be called from NS_CALL() and has two
disjoint cases: one where it enters the namespace (outbound forwards) and
one where it doesn't (inbound forwards).

We only actually need the NS_CALL() framing for the outbound case, for
inbound we can just call tcp_port_do_rebind() directly.  So simplify
tcp_port_rebind() to tcp_port_rebind_outbound(), allowing us to eliminate
an awkward parameters structure.

With that done we can safely rename tcp_port_do_rebind() to
tcp_port_rebind() for brevity.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tcp_port_rebind() is desgined to be called from NS_CALL() and has two
disjoint cases: one where it enters the namespace (outbound forwards) and
one where it doesn't (inbound forwards).

We only actually need the NS_CALL() framing for the outbound case, for
inbound we can just call tcp_port_do_rebind() directly.  So simplify
tcp_port_rebind() to tcp_port_rebind_outbound(), allowing us to eliminate
an awkward parameters structure.

With that done we can safely rename tcp_port_do_rebind() to
tcp_port_rebind() for brevity.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Use common helper for rebinding inbound and outbound ports</title>
<updated>2023-11-19T08:08:32+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-15T05:25:32+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=1776e7af9b25cfbb1e28eca94d9052563dad3724'/>
<id>1776e7af9b25cfbb1e28eca94d9052563dad3724</id>
<content type='text'>
tcp_port_rebind() has two cases with almost but not quite identical code.
Simplify things a bit by factoring this out into a single parameterised
helper, tcp_port_do_rebind().

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tcp_port_rebind() has two cases with almost but not quite identical code.
Simplify things a bit by factoring this out into a single parameterised
helper, tcp_port_do_rebind().

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>clang-tidy: Suppress silly misc-include-cleaner warnings</title>
<updated>2023-11-19T08:08:18+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-15T02:59:45+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=3be9e0010ea7329ae0f3707f67ac4cf0bac13d54'/>
<id>3be9e0010ea7329ae0f3707f67ac4cf0bac13d54</id>
<content type='text'>
clang-tidy from LLVM 17.0.3 (which is in Fedora 39) includes a new
"misc-include-cleaner" warning that tries to make sure that headers
*directly* provide the things that are used in the .c file.  That sounds
great in theory but is in practice unusable:

Quite a few common things in the standard library are ultimately provided
by OS-specific system headers, but for portability should be accessed via
closer-to-standardised library headers.  This will warn constantly about
such cases: e.g. it will want you to include &lt;linux/limits.h&gt; instead of
&lt;limits.h&gt; to get PATH_MAX.

So, suppress this warning globally in the Makefile.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
clang-tidy from LLVM 17.0.3 (which is in Fedora 39) includes a new
"misc-include-cleaner" warning that tries to make sure that headers
*directly* provide the things that are used in the .c file.  That sounds
great in theory but is in practice unusable:

Quite a few common things in the standard library are ultimately provided
by OS-specific system headers, but for portability should be accessed via
closer-to-standardised library headers.  This will warn constantly about
such cases: e.g. it will want you to include &lt;linux/limits.h&gt; instead of
&lt;limits.h&gt; to get PATH_MAX.

So, suppress this warning globally in the Makefile.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tap, pasta: Handle short writes to /dev/tap</title>
<updated>2023-11-10T15:51:33+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-08T03:17:54+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=5ec3634b07215337c2e69d88f9b1d74711897d7d'/>
<id>5ec3634b07215337c2e69d88f9b1d74711897d7d</id>
<content type='text'>
tap_send_frames_pasta() sends frames to the namespace by sending them to
our the /dev/tap device.  If that write() returns an error, we already
handle it.  However we don't handle the case where the write() returns
short, meaning we haven't successfully transmitted the whole frame.

I don't know if this can ever happen with the kernel tap device, but we
should at least report the case so we don't get a cryptic failure.  For
the purposes of the return value for tap_send_frames_pasta() we treat this
case as though it was an error (on the grounds that a partial frame is no
use to the namespace).

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tap_send_frames_pasta() sends frames to the namespace by sending them to
our the /dev/tap device.  If that write() returns an error, we already
handle it.  However we don't handle the case where the write() returns
short, meaning we haven't successfully transmitted the whole frame.

I don't know if this can ever happen with the kernel tap device, but we
should at least report the case so we don't get a cryptic failure.  For
the purposes of the return value for tap_send_frames_pasta() we treat this
case as though it was an error (on the grounds that a partial frame is no
use to the namespace).

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tap, pasta: Handle incomplete tap sends for pasta too</title>
<updated>2023-11-10T15:51:33+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-08T03:17:53+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=f0776eac07cfae76a51b5f55d5f95c8a5c62640f'/>
<id>f0776eac07cfae76a51b5f55d5f95c8a5c62640f</id>
<content type='text'>
Since a469fc39 ("tcp, tap: Don't increase tap-side sequence counter for
dropped frames") we've handled more gracefully the case where we get data
from the socket side, but are temporarily unable to send it all to the tap
side (e.g. due to full buffers).

That code relies on tap_send_frames() returning the number of frames it
successfully sent, which in turn gets it from tap_send_frames_passt() or
tap_send_frames_pasta().

While tap_send_frames_passt() has returned that information since b62ed9ca
("tap: Don't pcap frames that didn't get sent"), tap_send_frames_pasta()
always returns as though it succesfully sent every frame.  However there
certainly are cases where it will return early without sending all frames.
Update it report that properly, so that the calling functions can handle it
properly.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since a469fc39 ("tcp, tap: Don't increase tap-side sequence counter for
dropped frames") we've handled more gracefully the case where we get data
from the socket side, but are temporarily unable to send it all to the tap
side (e.g. due to full buffers).

That code relies on tap_send_frames() returning the number of frames it
successfully sent, which in turn gets it from tap_send_frames_passt() or
tap_send_frames_pasta().

While tap_send_frames_passt() has returned that information since b62ed9ca
("tap: Don't pcap frames that didn't get sent"), tap_send_frames_pasta()
always returns as though it succesfully sent every frame.  However there
certainly are cases where it will return early without sending all frames.
Update it report that properly, so that the calling functions can handle it
properly.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Don't use TCP_WINDOW_CLAMP</title>
<updated>2023-11-10T05:42:19+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-09T09:54:00+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=cf3eeba6c0d7c7b33824b6e1c53f14dcb90437ae'/>
<id>cf3eeba6c0d7c7b33824b6e1c53f14dcb90437ae</id>
<content type='text'>
On the L2 tap side, we see TCP headers and know the TCP window that the
ultimate receiver is advertising.  In order to avoid unnecessary buffering
within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
to advertise that window back to the original sock-side sender using
TCP_WINDOW_CLAMP.

However, TCP_WINDOW_CLAMP just doesn't work like this.  Prior to kernel
commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
had no effect on established sockets.  After that commit, it does affect
established sockets but doesn't behave the way we need:
  * It appears to be designed only to shrink the window, not to allow it to
    re-expand.
  * More importantly, that commit has a serious bug where if the
    setsockopt() is made when the existing kernel advertised window for the
    socket happens to be zero, it will now become locked at zero, stopping
    any further data from being received on the socket.

Since this has never worked as intended, simply remove it.  It might be
possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
so we leave a comment to that effect.

This kernel bug is the underlying cause of both the linked passt bug and
the linked podman bug.  We attempted to fix this before with passt commit
d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
However while that commit masked the bug for some cases, it didn't really
address the problem.

Fixes: d3192f67c492 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
Link: https://github.com/containers/podman/issues/20170
Link: https://bugs.passt.top/show_bug.cgi?id=74
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On the L2 tap side, we see TCP headers and know the TCP window that the
ultimate receiver is advertising.  In order to avoid unnecessary buffering
within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
to advertise that window back to the original sock-side sender using
TCP_WINDOW_CLAMP.

However, TCP_WINDOW_CLAMP just doesn't work like this.  Prior to kernel
commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
had no effect on established sockets.  After that commit, it does affect
established sockets but doesn't behave the way we need:
  * It appears to be designed only to shrink the window, not to allow it to
    re-expand.
  * More importantly, that commit has a serious bug where if the
    setsockopt() is made when the existing kernel advertised window for the
    socket happens to be zero, it will now become locked at zero, stopping
    any further data from being received on the socket.

Since this has never worked as intended, simply remove it.  It might be
possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
so we leave a comment to that effect.

This kernel bug is the underlying cause of both the linked passt bug and
the linked podman bug.  We attempted to fix this before with passt commit
d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
However while that commit masked the bug for some cases, it didn't really
address the problem.

Fixes: d3192f67c492 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
Link: https://github.com/containers/podman/issues/20170
Link: https://bugs.passt.top/show_bug.cgi?id=74
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Rename and small cleanup to tcp_clamp_window()</title>
<updated>2023-11-10T05:42:10+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-11-09T09:53:59+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=930bc3b0f2d2078747172e960807c84a8cd97495'/>
<id>930bc3b0f2d2078747172e960807c84a8cd97495</id>
<content type='text'>
tcp_clamp_window() is _mostly_ about using TCP_WINDOW_CLAMP to control the
sock side advertised window, but it is also responsible for actually
updating the conn-&gt;wnd_from_tap value.

Rename to tcp_tap_window_update() to reflect that broader purpose, and pull
the logic that's not TCP_WINDOW_CLAMP related out to the front.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tcp_clamp_window() is _mostly_ about using TCP_WINDOW_CLAMP to control the
sock side advertised window, but it is also responsible for actually
updating the conn-&gt;wnd_from_tap value.

Rename to tcp_tap_window_update() to reflect that broader purpose, and pull
the logic that's not TCP_WINDOW_CLAMP related out to the front.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
