passt, branch 2024_02_19.ff22a78

pasta: Don't try to watch namespaces in procfs with inotify, use timer instead

2024-02-19T18:58:50+00:00

We watch network namespace entries to detect when we should quit
(unless --no-netns-quit is passed), and these might stored in a tmpfs
typically mounted at /run/user/UID or /var/run/user/UID, or found in
procfs at /proc/PID/ns/.

Currently, we try to use inotify for any possible location of those
entries, but inotify, of course, doesn't work on pseudo-filesystems
(see inotify(7)).

The man page reflects this: the description of --no-netns-quit
implies that we won't quit anyway if the namespace is not "bound to
the filesystem".

Well, we won't quit, but, since commit 9e0dbc894813 ("More
deterministic detection of whether argument is a PID, PATH or NAME"),
we try. And, indeed, this is harmless, as the caveat from that
commit message states.

Now, it turns out that Buildah, a tool to create container images,
sharing its codebase with Podman, passes a procfs entry to pasta, and
expects pasta to exit once the network namespace is not needed
anymore, that is, once the original container process, also spawned
by Buildah, terminates.

Get this to work by using the timer fallback mechanism if the
namespace name is passed as a path belonging to a pseudo-filesystem.
This is expected to be procfs, but I covered sysfs and devpts
pseudo-filesystems as well, because nothing actually prevents
creating this kind of directory structure and links there.

Note that fstatfs(), according to some versions of man pages, was
apparently "deprecated" by the LSB. My reasoning for using it is
essentially this:
  https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u

...that is, there was no such thing as an LSB deprecation, and
anyway there's no other way to get the filesystem type.

Also note that, while it might sound more obvious to detect the
filesystem type using fstatfs() on the file descriptor itself
(c->pasta_netns_fd), the reported filesystem type for it is nsfs, no
matter what path was given to pasta. If we use the parent directory,
we'll typically have either tmpfs or procfs reported.

If the target namespace is given as a PID, or as a PID-based procfs
entry, we don't risk races if this PID is recycled: our handle on
/proc/PID/ns will always refer to the original namespace associated
with that PID, and we don't re-open this entry from procfs to check
it.

There's, however, a remaining race possibility if the parent process
is not the one associated to the network namespace we operate on: in
that case, the parent might pass a procfs entry associated to a PID
that was recycled by the time we parse it. This can't happen if the
namespace PID matches the one of the parent, because we detach from
the controlling terminal after parsing the namespace reference.

To avoid this type of race, if desired, we could add the option for
the parent to pass a PID file descriptor, that the parent obtained
via pidfd_open(). This is beyond the scope of this change.

Update the man page to reflect that, even if the target network
namespace is passed as a procfs path or a PID, we'll now quit when
the procfs entry is gone.

Reported-by: Paul Holzinger 
Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214
Signed-off-by: Stefano Brivio

selinux: Allow pasta to remount procfs

2024-02-16T08:43:12+00:00

Partially equivalent to commit abf5ef6c22d2 ("apparmor: Allow pasta
to remount /proc, access entries under its own copy"): we should
allow pasta to remount /proc. It still works otherwise, but further
UID remapping in nested user namespaces (e.g. pasta in pasta) won't.

Reported-by: Laurent Jacquot 
Link: https://bugs.passt.top/show_bug.cgi?id=79#c3
Signed-off-by: Stefano Brivio

conf: No routable interface for IPv4 or IPv6 is informational, not a warning

2024-02-16T07:47:14+00:00

...Podman users might get confused by the fact that if we can't
find a default route for a given IP version, we'll report that as a
warning message and possibly just before actual error messages.

However, a lack of routable interface for IPv4 or IPv6 can be a
normal circumstance: don't warn about it, just state that as
informational message, if those are displayed (they're not in
non-error paths in Podman, for example).

While at it, make it clear that we're disabling IPv4 or IPv6 if
there's no routable interface for the corresponding IP version.

Reported-by: Paul Holzinger 
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio

pasta: Add fallback timer mechanism to check if namespace is gone

2024-02-16T07:47:14+00:00

We don't know how frequently this happens, but hitting
fs.inotify.max_user_watches or similar sysctl limits is definitely
not out of question, and Paul mentioned that, for example, Podman's
CI environments hit similar issues in the past.

Introduce a fallback mechanism based on a timer file descriptor: we
grab the directory handle at startup, and we can then use openat(),
triggered periodically, to check if the (network) namespace directory
still exists. If openat() fails at some point, exit.

Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707
Reported-by: Paul Holzinger 
Signed-off-by: Stefano Brivio

conf, passt.1: Exit if we can't bind a forwarded port, except for -[tu] all

2024-02-16T07:47:14+00:00

...or similar, that is, if only excluded ranges are given (implying
we'll forward any other available port). In that case, we'll usually
forward large sets of ports, and it might be inconvenient for the
user to skip excluding single ports that are already taken.

The existing behaviour, that is, exiting only if we fail to bind all
the ports for one given forwarding option, turns out to be
problematic for several aspects raised by Paul:

- Podman merges ranges anyway, so we might fail to bind all the ports
  from a specific range given by the user, but we'll not fail anyway
  because Podman merges it with another one where we succeed to bind
  at least one port. At the same time, there should be no semantic
  difference between multiple ranges given by a single option and
  multiple ranges given as multiple options: it's unexpected and
  not documented

- the user might actually rely on a given port to be forwarded to a
  given container or a virtual machine, and if connections are
  forwarded to an unrelated process, this might raise security
  concerns

- given that we can try and fail to bind multiple ports before
  exiting (in case we can't bind any), we don't have a specific error
  code we can return to the user, so we don't give the user helpful
  indication as to why we couldn't bind ports.

Exit as soon as we fail to create or bind a socket for a given
forwarded port, and report the actual error.

Keep the current behaviour, however, in case the user wants to
forward all the (available) ports for a given protocol, or all the
ports with excluded ranges only. There, it's more reasonable that
the user is expecting partial failures, and it's probably convenient
that we continue with the ports we could forward.

Update the manual page to reflect the new behaviour, and the old
behaviour too in the cases where we keep it.

Suggested-by: Paul Holzinger 
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio 
Tested-by: Paul Holzinger 
Reviewed-by: David Gibson

udp: udp_sock_init_ns() partially duplicats udp_port_rebind_outbound()

2024-02-14T02:24:23+00:00

Usually automatically forwarded UDP outbound ports are set up by
udp_port_rebind_outbound() called from udp_timer().  However, the very
first time they're created and bound is by udp_sock_init_ns() called from
udp_init().  udp_sock_init_ns() is essentially an unnecessary cut down
version of udp_port_rebind_outbound(), so we can jusat remove it.

Doing so does require moving udp_init() below udp_port_rebind_outbound()'s
definition.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

udp: Don't prematurely (and incorrectly) set up automatic inbound forwards

2024-02-14T02:24:01+00:00

For automated inbound port forwarding in pasta mode we scan bound ports
within the guest namespace via /proc and bind matching ports on the host to
listen for packets.  For UDP this is usually handled by udp_timer() which
calls port_fwd_scan_udp() followed by udp_port_rebind().  However there's
one initial scan before the the UDP timer is started: we call
port_fwd_scan_udp() from port_fwd_init(), and actually bind the resulting
ports in udp_sock_init_init() called from udp_init().

Unfortunately, the version in udp_sock_init_init() isn't correct.  It
unconditionally opens a new socket for every forwarded port, even if a
socket has already been explicit created with the -u option.  If the
explicitly forwarded ports have particular configuration (such as a
specific bound address address, or one implied by the -o option) those will
not be replicated in the new socket.  We essentially leak the original
correctly configured socket, replacing it with one which might not be
right.

We could make udp_sock_init_init() use udp_port_rebind() to get that right,
but there's actually no point doing so:
 * The initial bind was introduced by ccf6d2a7b48d ("udp: Actually bind
   detected namespace ports in init namespace") at which time we didn't
   periodically scan for bound UDP ports.  Periodic scanning was introduced
   in 457ff122e ("udp,pasta: Periodically scan for ports to automatically
   forward") making the bind from udp_init() redundant.
 * At the time of udp_init(), programs in the guest namespace are likely
   not to have started yet (unless attaching a pre-existing namespace) so
   there's likely not anything to scan for anyway.

So, simply remove the initial, broken socket create/bind, allowing
automatic port forwards to be created the first time udp_timer() runs.

Reported-by: Laurent Jacquot 
Suggested-by: Laurent Jacquot 
Link: https://bugs.passt.top/show_bug.cgi?id=79
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

netlink: Use const rtnh pointer

2024-02-14T00:10:47+00:00

6c7623d07 ("netlink: Add support to fetch default gateway from multipath
routes") inadvertently introduced a new cppcheck warning for a variable
which could be a const pointer but isn't.  This occurs with
cppcheck-2.13.0-1.fc39.x86_64 in Fedora 39 at least.

Fixes: 6c7623d07bbd ("netlink: Add support to fetch default gateway from multipath routes")
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

log: setlogmask(0) can actually result in a system call, don't use it

2024-02-14T00:10:11+00:00

Before commit 32d07f5e59f2 ("passt, pasta: Completely avoid dynamic
memory allocation"), we didn't store the current log mask in a
variable, and we fetched it using setlogmask(0) wherever needed.

But after that commit, we can use our log_mask copy instead. And we
should: with recent glibc versions, setlogmask(0) actually results in
a system call, which causes a substantial overhead with high transfer
rates: we use setlogmask(0) even to decide we don't want to print
debug messages.

Now that we rely on log_mask in early stages, before setlogmask() is
called, we need to initialise that variable to the special LOG_EMERG
mask value right away: define LOG_EARLY to make this clearer, and,
while at it, group conditions in vlogmsg() into something more terse.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Fix subtle bug in fast re-transmit path

2024-02-11T17:30:01+00:00

When a duplicate ack from the tap side triggers a fast re-transmit, we set
both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of
the duplicate ack.  Setting seq_to_tap is correct: this is what triggers
the retransmit from this point onwards.  Setting seq_ack_from_tap is
not correct, though.

In most cases setting seq_ack_from_tap will be redundant but harmless:
it will have already been updated to the same value by
tcp_update_seqack_from_tap() a few lines above.  However that call can
be skipped if tcp_sock_consume() fails, which is rare but possible.  In
that case this update will cause problems.

We use seq_ack_from_tap to track two logically distinct things: how much of
the stream has been acked by the guest, and how much of the stream from the
socket has been read and discarded (as opposed to MSG_PEEKed).  We attempt
to keep those values the same, because we discard data exactly when it is
acked by the guest.  However tcp_sock_consume() failing means we weren't
able to disard the acked data.  To handle that case, we skip the usual
update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get
one which supersedes it soon enough.  Setting seq_ack_from_tap in the
fast retransmit path, however, means we now really will have the
read/discard point in the stream out of sync with seq_ack_from_tap.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio