passt, branch bug165c

bug165 debug 3

2026-01-09T02:45:24+00:00

Instrumentation and possible workaround for bug 165.

tcp: Update EPOLL_TYPE_TCP_TIMER fd

2025-12-23T14:25:14+00:00

For consistency with other epoll events, set the fd field to the file
descriptor actually added by epoll_ctl() (conn->timer), rather than
conn->sock.

This is a no-op change as ref.fd is not currently used in
tcp_timer_handler().

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
Signed-off-by: Stefano Brivio

udp: Rename udp_sock_init() to udp_listen() with small cleanups

2025-12-23T14:25:11+00:00

Despite the name, this functions is specifically for creating
"listening" sockets, not any others.  While we're there remove a redundant
check for (s > FD_REF_MAX).  pif_sock_l4() already checks for this (and
must, in order to properly populate the epoll reference).

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Combine tcp_sock_init_one() and tcp_sock_init() into tcp_listen()

2025-12-23T14:24:56+00:00

Despite the name, these two functions are specifically for creating
listening sockets, not any others.  Recent changes mean that there's
always exactly one call of tcp_sock_init_one() call per call to
tcp_sock_init().  So combine them into tcp_listen().

While we're there remove a redundant check for (s > FD_REF_MAX).
pif_sock_l4() already checks for this (and must, in order to properly
populate the epoll reference).

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

pasta: Warn, disable matching IP version if not supported, in local mode

2025-12-23T14:10:02+00:00

...instead of exiting, but only if local mode is enabled, that is, if
we couldn't find a template interface or if the user didn't specify
one.

With IPv4, we always try to set or copy an address, so check if that
fails.

With IPv6, in local mode, we rely on the link-local address that's
automatically generated inside the target namespace, and only fail
later, as we try to set up routes. Check if that fails, instead.

Otherwise, we'll fail to start if IPv6 support is not built in or
disabled by the kernel ("ipv6.disable=1" on the command line),
because, in that case, we'll try to enable local mode by default, and
then fail to set any address or route.

It would probably be more elegant to check for IP version support in
conf_ip4_local() and conf_ip6_local(), and not even try to enable
connectivity for unsupported versions, but it looks less robust than
trying and failing, as there might be other ways to disable a given
IP version.

Note that there's currently no way to disable IPv4 support on the
kernel command line, that is, there's no such thing as an
ipv4.disable boot parameter. But I guess that's due to be eventually
implemented, one day, so let's cover that case as well, also for
consistency.

Reported-by: Iyan 
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2424192
Fixes: 4ddd59bc6085 ("conf: Separate local mode for each IP version, don't enable disabled IP version")
Signed-off-by: Stefano Brivio

selinux: Enable read and watch permissions on netns directory as well

2025-12-23T00:59:34+00:00

With commit 7aeda16a7818 ("selinux: Transition to pasta_t in
containers"), we need to make sure that pasta can access the target
namespace directory passed by Podman, and, in a general case, we have
all the permissions we need.

But if we now start a container without the Podman changes referenced
by commit fd1bcc30af07 ("selinux: add container_var_run_t type
transition"), or with them, but with the container being created
before those and without a reboot in between, we'll additionally need
'read' and 'watch' permissions on user_tmp_t directory as well, as
user_tmp_t is still the (inconsistent) context of the namespace entry.

Otherwise, on a container start/restart, we'll get SELinux denials:

  type=AVC msg=audit(1766451401.296:184): avc:  denied  { read } for  pid=2159 comm="pasta.avx2" name="netns" dev="tmpfs" ino=60 scontext=unconfined_u:unconfined_r:pasta_t:s0-s0:c0.c1023 tcontext=unconfined_u:obje
ct_r:user_tmp_t:s0 tclass=dir permissive=1
  type=AVC msg=audit(1766451401.298:185): avc:  denied  { watch } for  pid=2159 comm="pasta.avx2" path="/run/user/1001/netns" dev="tmpfs" ino=60 scontext=unconfined_u:unconfined_r:pasta_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=dir permissive=1

This can be reproduced quite simply:

  $ podman create -q --name hello hello
  6c4eaf15a03edf799673a97d84d0331f3a3f34a11015b58c69318101a3232770

  [upgrade passt's SELinux policy to a version including 7aeda16a7818]

  $ podman start hello
  Error: unable to start container "6c4eaf15a03edf799673a97d84d0331f3a3f34a11015b58c69318101a3232770": pasta failed with exit code 1:
  netns dir open: Permission denied, exiting

Reported-by: Tuomo Soini 
Fixes: 7aeda16a7818 ("selinux: Transition to pasta_t in containers")
Signed-off-by: Stefano Brivio

tcp: Use less-than-MSS window on no queued data, or no data sent recently

2025-12-15T07:11:54+00:00

We limit the advertised window to guests and containers to the
available length of the sending buffer, and if it's less than the MSS,
since commit cf1925fb7b77 ("tcp: Don't limit window to less-than-MSS
values, use zero instead"), we approximate that limit to zero.

This way, we'll trigger a window update as soon as we realise that we
can advertise a larger value, just like we do in all other cases where
we advertise a zero-sized window.

By doing that, we don't wait for the peer to send us data before we
update the window. This matters because the guest or container might
be trying to aggregate more data and won't send us anything at all if
the advertised window is too small.

However, this might be problematic in two situations:

1. one, reported by Tyler, where the remote (receiving) peer
   advertises a window that's smaller than what we usually get and
   very close to the MSS, causing the kernel to give us a starting
   size of the buffer that's less than the MSS we advertise to the
   guest or container.

   If this happens, we'll never advertise a non-zero window after
   the handshake, and the container or guest will never send us any
   data at all.

   With a simple 'curl https://cloudflare.com/', we get, with default
   TCP memory parameters, a 65535-byte window from the peer, and 46080
   bytes of initial sending buffer from the kernel. But we advertised
   a 65480-byte MSS, and we'll never actually receive the client
   request.

   This seems to be specific to Cloudflare for some reason, probably
   deriving from a particular tuning of TCP parameters on their
   servers.

2. another one, hypothesised by David, where the peer might only be
   willing to process (and acknowledge) data in batches.

   We might have queued outbound data which is, at the same time, not
   enough to fill one of these batches and be acknowledged and removed
   from the sending queue, but enough to make our available buffer
   smaller than the MSS, and the connection will hang.

Take care of both cases by:

a. not approximating the sending buffer to zero if we have no outboud
   queued data at all, because in that case we don't expect the
   available buffer to increase if we don't send any data, so there's
   no point in waiting for it to grow larger than the MSS.

   This fixes problem 1. above.

b. also using the full sending buffer size if we haven't send data to
   the socket for a while (reported by tcpi_last_data_sent). This part
   was already suggested by David in:

     https://archives.passt.top/passt-dev/aTZzgtcKWLb28zrf@zatzit/

   and I'm now picking ten times the RTT as a somewhat arbitrary
   threshold.

   This is meant to take care of potential problem 2. above, but it
   also happens to fix 1.

Reported-by: Tyler Cloud 
Link: https://bugs.passt.top/show_bug.cgi?id=183
Suggested-by: David Gibson 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

conf, fwd: Move initialisation of auto port scanning out of conf()

2025-12-12T21:38:56+00:00

We call fwd_scan_ports_init() at (almost) the end of conf().  It's a bit
odd to do actual work from a function that's ostensibly about getting our
configuration.  It's not the only instance of this, but to make things a
bit clearer move the call to main(), right after flow_init().

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tcp: Remove extra space from TCP_INFO debug messages (trivial)

2025-12-12T21:38:53+00:00

Debug messages about which tcp_info fields are supported contained an
extra space, always ending with "  supported".

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

pasta: Clean up waiting pasta child on failures

2025-12-12T21:23:14+00:00

When pasta is invoked with a command rather than an existing namespace to
attach to, it spawns a child process to run a shell or other command.  We
create that process during conf(), since we need the namespace to exist for
much of our setup.  However, we don't want the specified command to run
until the pasta network interface is ready for use.  Therefore,
pasta_spawn_cmd() executing in the child waits before exec()ing.  main()
signals the child to continue with SIGUSR1 shortly before entering the
main forwarding loop.

This has the downside that if we exit due to any kind of failure between
conf() and the SIGUSR1, the child process will be around waiting
indefinitely.  The user must manually clean this up.

Make this cleaner, by having the child use PR_SET_PDEATHSIG to have
itself killed if the parent dies during this window.  Technically
speaking this is racy: if the parent dies before the child can call
the prctl() it will be left zombie-like as before.  However, as long
as the parent completes pasta_wait_for_ns() before dying, I wasn't
able to trigger the race.  Since the consequences of this going wrong
are merely a bit ugly, I think that's good enough.

Suggested-by: Paul Holzinger 
Signed-off-by: David Gibson 
Reviewed-by: Paul Holzinger 
Signed-off-by: Stefano Brivio