passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	pasta, util: Align stack area for clones to maximum natural alignment	Stefano Brivio	2024-04-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that we use this stack pointer as a location to store arbitrary data types from the cloned process, we need to guarantee that its alignment matches any of those possible data types. runsisi reports that pasta gets a SIGBUS in pasta_open_ns() on aarch64, where the alignment requirement for stack pointers is a 16 bytes (same as the size of a long double), and similar requirements actually apply to most architectures we run on. Reported-by: runsisi <runsisi@hust.edu.cn> Link: https://bugs.passt.top/show_bug.cgi?id=85 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Compilers' name for armv6l and armv7l is "arm"	Stefano Brivio	2024-04-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the architecture name for, among others, seccomp.sh, I didn't realise that "armv6l" and "armv7l" are just Linux kernel names -- compilers just call that "arm". Fix the "syscalls" annotation we use to define seccomp profiles accordingly, otherwise pasta will be terminated on sigreturn() on armv6l and armv7l. Fixes: 213c397492bd ("passt, pasta: Run-time selection of AVX2 build") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Don't try to watch namespaces in procfs with inotify, use timer instead2024_02_19.ff22a78	Stefano Brivio	2024-02-19	1	-5/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We watch network namespace entries to detect when we should quit (unless --no-netns-quit is passed), and these might stored in a tmpfs typically mounted at /run/user/UID or /var/run/user/UID, or found in procfs at /proc/PID/ns/. Currently, we try to use inotify for any possible location of those entries, but inotify, of course, doesn't work on pseudo-filesystems (see inotify(7)). The man page reflects this: the description of --no-netns-quit implies that we won't quit anyway if the namespace is not "bound to the filesystem". Well, we won't quit, but, since commit 9e0dbc894813 ("More deterministic detection of whether argument is a PID, PATH or NAME"), we try. And, indeed, this is harmless, as the caveat from that commit message states. Now, it turns out that Buildah, a tool to create container images, sharing its codebase with Podman, passes a procfs entry to pasta, and expects pasta to exit once the network namespace is not needed anymore, that is, once the original container process, also spawned by Buildah, terminates. Get this to work by using the timer fallback mechanism if the namespace name is passed as a path belonging to a pseudo-filesystem. This is expected to be procfs, but I covered sysfs and devpts pseudo-filesystems as well, because nothing actually prevents creating this kind of directory structure and links there. Note that fstatfs(), according to some versions of man pages, was apparently "deprecated" by the LSB. My reasoning for using it is essentially this: https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u ...that is, there was no such thing as an LSB deprecation, and anyway there's no other way to get the filesystem type. Also note that, while it might sound more obvious to detect the filesystem type using fstatfs() on the file descriptor itself (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no matter what path was given to pasta. If we use the parent directory, we'll typically have either tmpfs or procfs reported. If the target namespace is given as a PID, or as a PID-based procfs entry, we don't risk races if this PID is recycled: our handle on /proc/PID/ns will always refer to the original namespace associated with that PID, and we don't re-open this entry from procfs to check it. There's, however, a remaining race possibility if the parent process is not the one associated to the network namespace we operate on: in that case, the parent might pass a procfs entry associated to a PID that was recycled by the time we parse it. This can't happen if the namespace PID matches the one of the parent, because we detach from the controlling terminal after parsing the namespace reference. To avoid this type of race, if desired, we could add the option for the parent to pass a PID file descriptor, that the parent obtained via pidfd_open(). This is beyond the scope of this change. Update the man page to reflect that, even if the target network namespace is passed as a procfs path or a PID, we'll now quit when the procfs entry is gone. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Add fallback timer mechanism to check if namespace is gone	Stefano Brivio	2024-02-16	1	-21/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We don't know how frequently this happens, but hitting fs.inotify.max_user_watches or similar sysctl limits is definitely not out of question, and Paul mentioned that, for example, Podman's CI environments hit similar issues in the past. Introduce a fallback mechanism based on a timer file descriptor: we grab the directory handle at startup, and we can then use openat(), triggered periodically, to check if the (network) namespace directory still exists. If openat() fails at some point, exit. Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707 Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Enable format warnings	David Gibson	2023-11-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logmsg() takes printf like arguments, but because it's not a built in, the compiler won't generate warnings if the format string and parameters don't match. Enable those by using the format attribute. Strictly speaking this is a gcc extension, but I believe it is also supported by some other common compilers. We already use some other attributes in various places. For now, just use it and we can worry about compilers that don't support it if it comes up. This exposes some warnings from existing callers, both in gcc and in clang-tidy: - Some are straight out bugs, which we correct - It's occasionally useful to invoke the logging functions with an empty string, which gcc objects to, so disable that specific warning in the Makefile - Strictly speaking the C standard requires that the parameter for a %p be a (void *), not some other pointer type. That's only likely to cause problems in practice on weird architectures with different sized representations for pointers to different types. Nonetheless add the casts to make it happy. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-2/+2
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't pre-fill IPv4 destination address in headers	David Gibson	2023-08-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Always use epoll_ref for the epoll data variable	David Gibson	2023-08-13	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	epoll_ref contains a variety of information useful when handling epoll events on our sockets, and we place it in the epoll_event data field returned by epoll. However, for a few other things we use the 'fd' field in the standard union of types for that data field. This actually introduces a bug which is vanishingly unlikely to hit in practice, but very nasty if it ever did: theoretically if we had a very large file descriptor number for fd_tap or fd_tap_listen it could overflow into bits that overlap with the 'proto' field in epoll_ref. With some very bad luck this could mean that we mistakenly think an event on a regular socket is an event on fd_tap or fd_tap_listen. More practically, using different (but overlapping) fields of the epoll_data means we can't unify dispatch for the various different objects in the epoll. Therefore use the same epoll_ref as the data for the tap fds and the netns quit fd, adding new fd type values to describe them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "dup" operations	David Gibson	2023-08-04	1	-10/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We now detect errors on netlink "set" operations while configuring the pasta namespace with --config-net. However in many cases rather than a simple "set" we use a more complex "dup" function to copy configuration from the host to the namespace. We're not yet properly detecting and reporting netlink errors for that case. Change the "dup" operations to propagate netlink errors to their caller, pasta_ns_conf() and report them there. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "set" operations	David Gibson	2023-08-04	1	-10/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if anything goes wrong while we're configuring the namespace network with --config-net, we'll just ignore it and carry on. This might lead to a silently unconfigured or misconfigured namespace environment. For simple "set" operations based on nl_do() we can now detect failures reported via netlink. Propagate those errors up to pasta_ns_conf() and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Explicitly pass netlink sockets to operations	David Gibson	2023-08-04	1	-19/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	All the netlink operations currently implicitly use one of the two global netlink sockets, sometimes depending on an 'ns' parameter. Change them all to explicitly take the socket to use (or two sockets to use in the case of the *_dup() functions). As well as making these functions strictly more general, it makes the callers easier to follow because we're passing a socket variable with a name rather than an unexplained '0' or '1' for the ns parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_route() into separate operation functions	David Gibson	2023-08-04	1	-6/+10
\| \| \| \| \| \| \| \| \| \|	nl_route() can perform 3 quite different operations based on the 'op' parameter. Split this into separate functions for each one. This requires more lines of code, but makes the internal logic of each operation much easier to follow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_addr() into separate operation functions	David Gibson	2023-08-04	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_addr() can perform three quite different operations based on the 'op' parameter, each of which uses a different subset of the parameters. Split them up into a function for each operation. This does use more lines of code, but the overlap wasn't that great, and the separated logic is much easier to follow. It's also clearer in the callers what we expect the netlink operations to do, and what information it uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting fixes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split up functionality of nl_link()	David Gibson	2023-08-04	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_link() performs a number of functions: it can bring links up, set MAC address and MTU and also retrieve the existing MAC. This makes for a small number of lines of code, but high conceptual complexity: it's quite hard to follow what's going on both in nl_link() itself and it's also not very obvious which function its callers are intending to use. Clarify this, by splitting nl_link() into nl_link_up(), nl_link_set_mac(), and nl_link_get_mac(). The first brings up a link, optionally setting the MTU, the others get or set the MAC address. This fixes an arguable bug in pasta_ns_conf(): it looks as though that was intended to retrieve the guest MAC whether or not c->pasta_conf_ns is set. However, it only actually does so in the !c->pasta_conf_ns case: the fact that we set up==1 means we would only ever set, never get, the MAC in the nl_link() call in the other path. We get away with this because the MAC will quickly be discovered once we receive packets on the tap interface. Still, it's neater to always get the MAC address here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: include errno in error message2023_06_25.32660ce	Paul Holzinger	2023-06-25	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \|	When the open() or setns() calls fails pasta exits early and prints an error. However it did not include the errno so it was impossible to know why the syscall failed. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Split print to fit 80 columns in pasta_open_ns()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, pasta: With --config-net, copy all addresses by default	Stefano Brivio	2023-05-23	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the newly-introduced NL_DUP mode for nl_addr() to copy all the addresses associated to the template interface in the outer namespace, unless --no-copy-addrs (also implied by -a) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. This is done mostly for consistency with routes. It might partially cover the issue at: https://bugs.passt.top/show_bug.cgi?id=47 Support multiple addresses per address family for some use cases, but not the originally intended one: we'll still use a single outbound address (unless the routing table specifies different preferred source addresses depending on the destination), regardless of the address used in the target namespace. Link: https://bugs.passt.top/show_bug.cgi?id=47 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Add functionality to copy addresses from outer namespace	Stefano Brivio	2023-05-23	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similarly to what we've just done with routes, support NL_DUP for addresses (currently not exposed): nl_addr() can optionally copy mulitple addresses to the target namespace, by fixing up data from the dump with appropriate flags and interface index, and repeating it back to the kernel on the socket opened in the target namespace. Link-local addresses are not copied: the family is set to AF_UNSPEC, which means the kernel will ignore them. Same for addresses from a mismatching address (pre-4.19 kernels without support for NETLINK_GET_STRICT_CHK). Ignore IFA_LABEL attributes by changing their type to IFA_UNSPEC, because in general they will report mismatching names, and we don't really need to use labels as we already know the interface index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, pasta: With --config-net, copy all routes by default	Stefano Brivio	2023-05-23	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the newly-introduced NL_DUP mode for nl_route() to copy all the routes associated to the template interface in the outer namespace, unless --no-copy-routes (also implied by -g) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. Otherwise, we can't use default gateways which are not, address-wise, on the same subnet as the container, as reported by Callum. Reported-by: Callum Parsey <callum@neoninteger.au> Link: https://github.com/containers/podman/issues/18539 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Add functionality to copy routes from outer namespace	Stefano Brivio	2023-05-23	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of just fetching the default gateway and configuring a single equivalent route in the target namespace, on 'pasta --config-net', it might be desirable in some cases to copy the whole set of routes corresponding to a given output interface. For instance, in: https://github.com/containers/podman/issues/18539 IPv4 Default Route Does Not Propagate to Pasta Containers on Hetzner VPSes configuring the default gateway won't work without a gateway-less route (specifying the output interface only), because the default gateway is, somewhat dubiously, not on the same subnet as the container. This is a similar case to the one covered by commit 7656a6f88882 ("conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway"), and I'm not exactly proud of that workaround. We also have: https://bugs.passt.top/show_bug.cgi?id=49 pasta does not work with tap-style interface for which, eventually, we should be able to configure a gateway-less route in the target namespace. Introduce different operation modes for nl_route(), including a new NL_DUP one, not exposed yet, which simply parrots back to the kernel the route dump for a given interface from the outer namespace, fixing up flags and interface indices on the way, and requesting to add the same routes in the target namespace, on the interface we manage. For n routes we want to duplicate, send n identical netlink requests including the full dump: routes might depend on each other and the kernel processes RTM_NEWROUTE messages sequentially, not atomically, and repeating the full dump naturally resolves dependencies without the need to actually calculate them. I'm not kidding, it actually works pretty well. Link: https://github.com/containers/podman/issues/18539 Link: https://bugs.passt.top/show_bug.cgi?id=49 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: Improve error handling on failure to join network namespace	Stefano Brivio	2023-05-23	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta_wait_for_ns(), open() failing with ENOENT is expected: we're busy-looping until the network namespace appears. But any other failure is not something we're going to recover from: return right away if we don't get either success or ENOENT. Now that pasta_wait_for_ns() can actually fail, handle that in pasta_start_ns() by reporting the issue and exiting. Looping on EPERM, when pasta doesn't actually have the permissions to join a given namespace, isn't exactly a productive thing to do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Detach mount namespace, (re)mount procfs before spawning command	Stefano Brivio	2023-05-23	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we want /proc contents to be consistent after pasta spawns a child process in a new PID namespace (only for operation without a pre-existing namespace), we need to mount /proc after the clone(2) call with CLONE_NEWPID, and we enable the child to do that by passing, in the same call, the CLONE_NEWNS flag, as described by pid_namespaces(7). This is not really a remount: in fact, passing MS_REMOUNT to mount(2) would make the call fail. We're in another mount namespace now, so it's a fresh mount that has the effect of hiding the existing one. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	convert all remaining err() followed by exit() to die()	Laine Stump	2023-02-16	1	-13/+7
\| \| \| \| \| \| \| \|	This actually leaves us with 0 uses of err(), but someone could want to use it in the future, so we may as well leave it around. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: propagate exit code from child command	Paul Holzinger	2023-02-12	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Exits codes are very useful for scripts, when the pasta child execvp() call fails with ENOENT that parent should also exit with > 0. In short the parent should always exit with the code from the child to make it useful in scripts. It is easy to test with: `pasta -- bash -c "exit 3"; echo $?` Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: correctly exit when execvp() fails	Paul Holzinger	2023-02-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	By default clone() will create a child that does not send SIGCHLD when the child exits. The caller has to specifiy the SIGNAL it should get in the flag bitmask. see clone(2) under "The child termination signal" This fixes the problem where pasta would not exit when the execvp() call failed, i.e. when the command does not exists. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Wait for tap to be set up before spawning command	Stefano Brivio	2023-02-12	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adapted from a patch by Paul Holzinger: when pasta spawns a command, operating without a pre-existing user and network namespace, it needs to wait for the tap device to be configured and its handler ready, before the command is actually executed. Otherwise, something like: pasta --config-net nslookup passt.top usually fails as the nslookup command is issued before the network interface is ready. We can't adopt a simpler approach based on SIGSTOP and SIGCONT here: the child runs in a separate PID namespace, so it can't send SIGSTOP to itself as the kernel sees the child as init process and blocks the delivery of the signal. We could send SIGSTOP from the parent, but this wouldn't avoid the possible condition where the child isn't ready to wait for it when the parent sends it, also raised by Paul -- and SIGSTOP can't be blocked, so it can never be pending. Use SIGUSR1 instead: mask it before clone(), so that the child starts with it blocked, and can safely wait for it. Once the parent is ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the child is waiting for it, the kernel will queue it for us, because it's blocked. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 1392bc5ca002 ("Allow pasta to take a command to execute") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, pasta: Add do_clone() wrapper around __clone2() and clone()	Stefano Brivio	2022-11-16	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spotted in Debian's buildd logs: on ia64, clone(2) is not available: the glibc wrapper is named __clone2() and it takes, additionally, the size of the stack area passed by the caller. Add a do_clone() wrapper handling the different cases, and also taking care of pointing the child's stack in the middle of the allocated area: on PA-RISC (hppa), handled by clone(), the stack grows up, and on ia64 the stack grows down, but the register backing store grows up -- and I think it might be actually used here. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Minor improvements to IPv4 netmask handling	David Gibson	2022-11-04	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several minor problems with our parsing of IPv4 netmasks (-n). First, we don't reject nonsensical netmasks like 0.255.0.255. Address this structurally by using prefix length instead of netmask as the primary variable, only converting (and validating) when we need to. This has the added benefit of making some things more uniform with the IPv6 path. Second, when the user specifies a prefix length, we truncate the output from strtol() to an integer, which means we would treat -n 4294967320 as valid (equivalent to 24). Fix types to check for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Rename pasta_setup_ns() to pasta_spawn_cmd()	David Gibson	2022-10-15	1	-9/+9
\| \| \| \| \| \| \| \| \| \|	pasta_setup_ns() no longer has much to do with setting up a namespace. Instead it's really about starting the shell or other command we want to run with pasta connectivity. Rename it and its argument structure to be less misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Only configure UID/GID mappings in userns when spawning shell	David Gibson	2022-10-15	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When in passt mode, or pasta mode spawning a command, we create a userns for ourselves. This is used both to isolate the pasta/passt process itself and to run the spawned command, if any. Since eed17a47 "Handle userns isolation and dropping root at the same time" we've handled both cases the same, configuring the UID and GID mappings in the new userns to map whichever UID we're running as to root within the userns. This mapping is desirable when spawning a shell or other command, so that the user gets a root shell with reasonably clear abilities within the userns and netns. It's not necessarily essential, though. When not spawning a shell, it doesn't really have any purpose: passt itself doesn't need to be root and can operate fine with an unmapped user (using some of the capabilities we get when entering the userns instead). Configuring the uid_map can cause problems if passt is running with any capabilities in the initial namespace, such as CAP_NET_BIND_SERVICE to allow it to forward low ports. In this case the kernel makes files in /proc/pid owned by root rather than the starting user to prevent the user from interfering with the operation of the capability-enhanced process. This includes uid_map meaning we are not able to write to it. Whether this behaviour is correct in the kernel is debatable, but in any case we might as well avoid problems by only initializing the user mappings when we really want them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Replace FWRITE with a function	David Gibson	2022-10-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a few places we use the FWRITE() macro to open a file, replace it's contents with a given string and close it again. There's no real reason this needs to be a macro rather than just a function though. Turn it into a function 'write_file()' and make some ancillary cleanups while we're there: - Add a return code so the caller can handle giving a useful error message - Handle the case of short write()s (unlikely, but possible) - Add O_TRUNC, to make sure we replace the existing contents entirely Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Remove unhelpful drop_caps() call in pasta_start_ns()	David Gibson	2022-10-15	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	drop_caps() has a number of bugs which mean it doesn't do what you'd expect. However, even if we fixed those, the call in pasta_start_ns() doesn't do anything useful: * In the common case, we're UID 0 at this point. In this case drop_caps() doesn't accomplish anything, because even with capabilities dropped, we are still privileged. * When attaching to an existing namespace with --userns or --netns-only we might not be UID 0. In this case it's too early to drop all capabilities: we need at least CAP_NET_ADMIN to configure the tap device in the namespace. Remove this call - we will still drop capabilities a little later in sandbox(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta_start_ns() always ends in parent context	David Gibson	2022-10-15	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \|	The end of pasta_start_ns() has a test against pasta_child_pid, testing if we're in the parent or the child. However we started the child running the pasta_setup_ns function which always exec()s or exit()s, so if we return from the clone() we are always in the parent, making that test unnecessary. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: More general way of starting spawned shell as a login shell	David Gibson	2022-10-15	1	-12/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When invoked so as to spawn a shell, pasta checks explicitly for the shell being bash and if so, adds a "-l" option to make it a login shell. This is not ideal, since this is a bash specific option and requires pasta to know about specific shell variants. There's a general convention for starting a login shell, which is to prepend a "-" to argv[0]. Use this approach instead, so we don't need bash specific logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Move logging functions to a new file, log.c	Stefano Brivio	2022-10-14	1	-0/+1
\| \| \| \| \| \| \| \|	Logging to file is going to add some further complexity that we don't want to squeeze into util.c. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	clang-tidy: Fix spurious null pointer warning in pasta_start_ns()	David Gibson	2022-09-29	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \|	clang-tidy isn't quite clever enough to figure out that getenv("SHELL") will return the same thing both times here, which makes it conclude that shell could be NULL, causing problems later. It's a bit ugly that we call getenv() twice in any case, so rework this in a way that clang-tidy can figure out shell won't be NULL. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Handle userns isolation and dropping root at the same time	David Gibson	2022-09-13	1	-49/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	passt/pasta can interact with user namespaces in a number of ways: 1) With --netns-only we'll remain in our original user namespace 2) With --userns or a PID option to pasta we'll join either the given user namespace or that of the PID 3) When pasta spawns a shell or command we'll start a new user namespace for the command and then join it 4) With passt we'll create a new user namespace when we sandbox() ourself However (3) and (4) turn out to have essentially the same effect. In both cases we create one new user namespace. The spawned command starts there, and passt/pasta itself will live there from sandbox() onwards. Because of this, we can simplify user namespace handling by moving the userns handling earlier, to the same point we drop root in the original namespace. Extend the drop_user() function to isolate_user() which does both. After switching UID and GID in the original userns, isolate_user() will either join or create the userns we require. When we spawn a command with pasta_start_ns()/pasta_setup_ns() we no longer need to create a userns, because we're already made one. sandbox() likewise no longer needs to create (or join) an userns because we're already in the one we need. We no longer need c->pasta_userns_fd, since the fd is only used locally in isolate_user(). Likewise we can replace c->netns_only with a local in conf(), since it's not used outside there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Correctly handle --netns-only in pasta_start_ns()	David Gibson	2022-09-13	1	-2/+2
\| \| \| \| \| \| \| \| \|	--netns-only is supposed to make pasta use only a network namespace, not a user namespace. However, pasta_start_ns() has this backwards, and if --netns-only is specified it creates a user namespace but not a network namespace. Correct this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Clean up and rename conf_ns_open()	David Gibson	2022-09-13	1	-0/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	conf_ns_open() opens file descriptors for the namespaces pasta needs, but it doesnt really have anything to do with configuration any more. For better clarity, move it to pasta.c and rename it pasta_open_ns(). This makes the symmetry between it and pasta_start_ns() more clear, since these represent the two basic ways that pasta can operate, either attaching to an existing namespace/process or spawning a new one. Since its no longer validating options, the errors it could return shouldn't cause a usage message. Just exit directly with an error instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Move self-isolation code into a separate file	David Gibson	2022-09-13	1	-0/+1
\| \| \| \| \| \| \| \|	passt/pasta contains a number of routines designed to isolate passt from the rest of the system for security. These are spread through util.c and passt.c. Move them together into a new isolation.c file. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Allow pasta to take a command to execute	David Gibson	2022-08-30	1	-10/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When not given an existing PID or network namspace to attach to, pasta spawns a shell. Most commands which can spawn a shell in an altered environment can also run other commands in that same environment, which can be useful in automation. Allow pasta to do the same thing; it can be given an arbitrary command to run in the network and user namespace which pasta creates. If neither a command nor an existing PID or netns to attach to is given, continue to spawn a default shell, as before. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Don't unnecessarily avoid CLOEXEC flags2022_08_24.60ffc5b	David Gibson	2022-08-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several places in the passt code where we have lint overrides because we're not adding CLOEXEC flags to open or other operations. Comments suggest this is because it's before we fork() into the background but we'll need those file descriptors after we're in the background. However, as the name suggests CLOEXEC closes on exec(), not on fork(). The only place we exec() is either super early invoke the avx2 version of the binary, or when we start a shell in pasta mode, which certainly doesn't require the fds in question. Add the CLOEXEC flag in those places, and remove the lint overrides. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Make substructures for IPv4 and IPv6 specific context information	David Gibson	2022-07-30	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	The context structure contains a batch of fields specific to IPv4 and to IPv6 connectivity. Split those out into a sub-structure. This allows the conf_ip4() and conf_ip6() functions, which take the entire context but touch very little of it, to be given more specific parameters, making it clearer what it affects without stepping through the code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Separate IPv4 and IPv6 configuration	David Gibson	2022-07-30	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After recent changes, conf_ip() now has essentially entirely disjoint paths for IPv4 and IPv6 configuration. So, it's cleaner to split them out into different functions conf_ip4() and conf_ip6(). Splitting these out also lets us make the interface a bit nicer, having them return success or failure directly, rather than manipulating c->v4 and c->v6 to indicate success/failure of the two versions. Since these functions may also initialize the interface index for each protocol, it turns out we can then drop c->v4 and c->v6 entirely, replacing tests on those with tests on whether c->ifi4 or c->ifi6 is non-zero (since a 0 interface index is never valid). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Whitespace fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Argument cannot be negative, CWE-687	Stefano Brivio	2022-04-07	1	-17/+8
\| \| \| \| \| \|	Actually harmless. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checks	Stefano Brivio	2022-03-29	1	-7/+9
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	seccomp: Adjust list of allowed syscalls for armv6l, armv7l	Stefano Brivio	2022-02-26	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	It looks like glibc commonly implements clock_gettime(2) with clock_gettime64(), and uses recv() instead of recvfrom(), send() instead of sendto(), and sigreturn() instead of rt_sigreturn() on armv6l and armv7l. Adjust the list of system calls for armv6l and armv7l accordingly. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: By default, quit if filesystem-bound net namespace goes away	Stefano Brivio	2022-02-21	1	-0/+52
\| \| \| \| \| \| \| \| \| \| \| \|	This should be convenient for users managing filesystem-bound network namespaces: monitor the base directory of the namespace and exit if the namespace given as PATH or NAME target is deleted. We can't add an inotify watch directly on the namespace directory, that won't work with nsfs. Add an option to disable this behaviour, --no-netns-quit. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, pasta: Namespace-based sandboxing, defer seccomp policy application	Stefano Brivio	2022-02-21	1	-106/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To reach (at least) a conceptually equivalent security level as implemented by --enable-sandbox in slirp4netns, we need to create a new mount namespace and pivot_root() into a new (empty) mountpoint, so that passt and pasta can't access any filesystem resource after initialisation. While at it, also detach IPC, PID (only for passt, to prevent vulnerabilities based on the knowledge of a target PID), and UTS namespaces. With this approach, if we apply the seccomp filters right after the configuration step, the number of allowed syscalls grows further. To prevent this, defer the application of seccomp policies after the initialisation phase, before the main loop, that's where we expect bad things to happen, potentially. This way, we get back to 22 allowed syscalls for passt and 34 for pasta, on x86_64. While at it, move #syscalls notes to specific code paths wherever it conceptually makes sense. We have to open all the file handles we'll ever need before sandboxing: - the packet capture file can only be opened once, drop instance numbers from the default path and use the (pre-sandbox) PID instead - /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection of bound ports in pasta mode, are now opened only once, before sandboxing, and their handles are stored in the execution context - the UNIX domain socket for passt is also bound only once, before sandboxing: to reject clients after the first one, instead of closing the listening socket, keep it open, accept and immediately discard new connection if we already have a valid one Clarify the (unchanged) behaviour for --netns-only in the man page. To actually make passt and pasta processes run in a separate PID namespace, we need to unshare(CLONE_NEWPID) before forking to background (if configured to do so). Introduce a small daemon() implementation, __daemon(), that additionally saves the PID file before forking. While running in foreground, the process itself can't move to a new PID namespace (a process can't change the notion of its own PID): mention that in the man page. For some reason, fork() in a detached PID namespace causes SIGTERM and SIGQUIT to be ignored, even if the handler is still reported as SIG_DFL: add a signal handler that just exits. We can now drop most of the pasta_child_handler() implementation, that took care of terminating all processes running in the same namespace, if pasta started a shell: the shell itself is now the init process in that namespace, and all children will terminate once the init process exits. Issuing 'echo $$' in a detached PID namespace won't return the actual namespace PID as seen from the init namespace: adapt demo and test setup scripts to reflect that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	seccomp: Add a number of alternate and per-arch syscalls	Stefano Brivio	2022-01-26	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Depending on the C library, but not necessarily in all the functions we use, statx() might be used instead of stat(), getdents() instead of getdents64(), readlinkat() instead of readlink(), openat() instead of open(). On aarch64, it's clone() and not fork(), and dup3() instead of dup2() -- just allow the existing alternative instead of dealing with per-arch selections. Since glibc commit 9a7565403758 ("posix: Consolidate fork implementation"), we need to allow set_robust_list() for fork()/clone(), even in a single-threaded context. On some architectures, epoll_pwait() is provided instead of epoll_wait(), but never both. Same with newfstat() and fstat(), sigreturn() and rt_sigreturn(), getdents64() and getdents(), readlink() and readlinkat(), unlink() and unlinkat(), whereas pipe() might not be available, but pipe2() always is, exclusively or not. Seen on Fedora 34: newfstatat() is used on top of fstat(). syslog() is an actual system call on some glibc/arch combinations, instead of a connect()/send() implementation. On ppc64 and ppc64le, _llseek(), recv(), send() and getuid() are used. For ppc64 only: ugetrlimit() for the getrlimit() implementation, plus sigreturn() and fcntl64(). On s390x, additionally, we need to allow socketcall() (on top of socket()), and sigreturn() also for passt (not just for pasta). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>