passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings	David Gibson	2024-09-06	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Allow additional system calls for i386/i686	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, fwd: Split notion of gateway/router from guest-visible host address	David Gibson	2024-08-21	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host. Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback. Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Rename MAC address fields for clarity	David Gibson	2024-08-21	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Disable neighbour solicitations on device up to prevent DAD	Stefano Brivio	2024-08-18	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As soon as we the kernel notifier for IPv6 address configuration (addrconf_notify()) sees that we bring the target interface up (NETDEV_UP), it will schedule duplicate address detection, so, by itself, setting the nodad flag later is useless, because that won't stop a detection that's already in progress. However, if we disable neighbour solicitations with IFF_NOARP (which is a misnomer for IPv6 interfaces, but there's no possibility of mixing things up), the notifier will not trigger DAD, because it can't be done, of course, without neighbour solicitations. Set IFF_NOARP as we bring up the device, and drop it after we had a chance to set the nodad attribute on the link. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Fetch link-local address from namespace interface once it's up	Stefano Brivio	2024-08-18	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	As soon as we bring up the interface, the Linux kernel will set up a link-local address for it, so we can fetch it and start using right away, if we need a link-local address to communicate to the container before we see any traffic coming from it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Disable DAD for link-local addresses on namespace interface	Stefano Brivio	2024-08-18	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It makes no sense for a container or a guest to try and perform duplicate address detection for their link-local address, as we'll anyway not relay neighbour solicitations with an unspecified source address. While they perform duplicate address detection, the link-local address is not usable, which prevents us from bringing up especially containers and communicate with them right away via IPv6. This is not enough to prevent DAD and reach the container right away: we'll need a couple more patches. As we send NLM_F_REPLACE requests right away, while we still have to read out other addresses on the same socket, we can't use nl_do(): keep track of the last sequence we sent (last address we changed), and deal with the answers to those NLM_F_REPLACE requests in a separate loop, later. Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Turn nl_link_up() into a generic function to set link flags	Stefano Brivio	2024-08-15	1	-2/+2
\| \| \| \| \| \| \|	In the next patches, we'll reuse it to set flags other than IFF_UP. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Split MTU setting functionality out of nl_link_up()	Stefano Brivio	2024-08-15	1	-2/+5
\| \| \| \| \| \| \| \| \|	As we'll use nl_link_up() for more than just bringing up devices, it will become awkward to carry empty MTU values around whenever we call it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, pasta: Make -g and -a skip route/addresses copy for matching IP ↵	Stefano Brivio	2024-08-07	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	version only Paul reports that setting IPv4 address and gateway manually, using --address and --gateway, causes pasta to fail inserting IPv6 routes in a setup where multiple, inter-dependent IPv6 routes are present on the host. That's because, currently, any -g option implies --no-copy-routes altogether, and any -a implies --no-copy-addrs. Limit this implication to the matching IP version, instead, by having two copies of no_copy_routes and no_copy_addrs in the context structure, separately for IPv4 and IPv6. While at it, change them to 'bool': we had them as 'int' because getopt_long() used to set them directly, but it hasn't been the case for a while already. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: Save errno on signal handler entry, restore on return when needed	Stefano Brivio	2024-08-05	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ed reported this: # Error: pasta failed with exit code 1: # Couldn't drop cap 3 from bounding set # : No child processes in a Podman CI run with tests being run in parallel. The error message itself, by the way, is fixed by commit 1cd773081f12 ("log: Drop newlines in the middle of the perror()-like messages"), but how can we possibly get ECHILD as failure code for prctl()? Well, we don't, but if we exit early enough, pasta_child_handler() might run before we're even done with isolation steps, and it calls waitid(), which sets errno. We need to restore it before returning from the signal handler (if we return after calling functions that might set it), as signal-safety(7) also implies: Fetching and setting the value of errno is async-signal-safe provided that the signal handler saves errno on entry and restores its value before returning. Eventually, we'll probably need to switch to signalfd(2) the day we want to implement multithreading, but this will do for the moment. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/23478 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: modify hostname when detaching new namespace	Danish Prakash	2024-07-30	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \|	When invoking pasta without any arguments, it's difficult to tell whether we are in the new namespace or not leaving users a bit confused. This change modifies the host namespace to add a prefix "pasta-" to make it a bit more obvious. Signed-off-by: Danish Prakash <contact@danishpraka.sh> [sbrivio: coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Replace strerror() calls	Stefano Brivio	2024-06-21	1	-10/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Replace perror() calls with calls to logging functions	Stefano Brivio	2024-06-21	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \|	perror() prints directly to standard error, but in many cases standard error might be already closed, or we might want to skip logging, based on configuration. Our logging functions provide all that. While at it, make errors more descriptive, replacing some of the existing basic perror-style messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	cppcheck: Suppress constParameterCallback errors	David Gibson	2024-06-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several functions which are used as callbacks for NS_CALL() which only read their void * parameter, they don't write it. The constParameterCallback warning in cppcheck 2.14.1 complains that this parameter could be const void , also pointing out that that would require casting the function pointer when used as a callback. Casting the function pointers seems substantially uglier than using a non-const void as the parameter, especially since in each case we cast the void * to a const pointer of specific type immediately. So, suppress these errors. I think it would make logical sense to suppress this globally, but that would cause unmatchedSuppression errors on earlier cppcheck versions. So, instead individually suppress it, along with unmatchedSuppression in the relevant places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta, util: Align stack area for clones to maximum natural alignment	Stefano Brivio	2024-04-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that we use this stack pointer as a location to store arbitrary data types from the cloned process, we need to guarantee that its alignment matches any of those possible data types. runsisi reports that pasta gets a SIGBUS in pasta_open_ns() on aarch64, where the alignment requirement for stack pointers is a 16 bytes (same as the size of a long double), and similar requirements actually apply to most architectures we run on. Reported-by: runsisi <runsisi@hust.edu.cn> Link: https://bugs.passt.top/show_bug.cgi?id=85 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Compilers' name for armv6l and armv7l is "arm"	Stefano Brivio	2024-04-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the architecture name for, among others, seccomp.sh, I didn't realise that "armv6l" and "armv7l" are just Linux kernel names -- compilers just call that "arm". Fix the "syscalls" annotation we use to define seccomp profiles accordingly, otherwise pasta will be terminated on sigreturn() on armv6l and armv7l. Fixes: 213c397492bd ("passt, pasta: Run-time selection of AVX2 build") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Don't try to watch namespaces in procfs with inotify, use timer instead2024_02_19.ff22a78	Stefano Brivio	2024-02-19	1	-5/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We watch network namespace entries to detect when we should quit (unless --no-netns-quit is passed), and these might stored in a tmpfs typically mounted at /run/user/UID or /var/run/user/UID, or found in procfs at /proc/PID/ns/. Currently, we try to use inotify for any possible location of those entries, but inotify, of course, doesn't work on pseudo-filesystems (see inotify(7)). The man page reflects this: the description of --no-netns-quit implies that we won't quit anyway if the namespace is not "bound to the filesystem". Well, we won't quit, but, since commit 9e0dbc894813 ("More deterministic detection of whether argument is a PID, PATH or NAME"), we try. And, indeed, this is harmless, as the caveat from that commit message states. Now, it turns out that Buildah, a tool to create container images, sharing its codebase with Podman, passes a procfs entry to pasta, and expects pasta to exit once the network namespace is not needed anymore, that is, once the original container process, also spawned by Buildah, terminates. Get this to work by using the timer fallback mechanism if the namespace name is passed as a path belonging to a pseudo-filesystem. This is expected to be procfs, but I covered sysfs and devpts pseudo-filesystems as well, because nothing actually prevents creating this kind of directory structure and links there. Note that fstatfs(), according to some versions of man pages, was apparently "deprecated" by the LSB. My reasoning for using it is essentially this: https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u ...that is, there was no such thing as an LSB deprecation, and anyway there's no other way to get the filesystem type. Also note that, while it might sound more obvious to detect the filesystem type using fstatfs() on the file descriptor itself (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no matter what path was given to pasta. If we use the parent directory, we'll typically have either tmpfs or procfs reported. If the target namespace is given as a PID, or as a PID-based procfs entry, we don't risk races if this PID is recycled: our handle on /proc/PID/ns will always refer to the original namespace associated with that PID, and we don't re-open this entry from procfs to check it. There's, however, a remaining race possibility if the parent process is not the one associated to the network namespace we operate on: in that case, the parent might pass a procfs entry associated to a PID that was recycled by the time we parse it. This can't happen if the namespace PID matches the one of the parent, because we detach from the controlling terminal after parsing the namespace reference. To avoid this type of race, if desired, we could add the option for the parent to pass a PID file descriptor, that the parent obtained via pidfd_open(). This is beyond the scope of this change. Update the man page to reflect that, even if the target network namespace is passed as a procfs path or a PID, we'll now quit when the procfs entry is gone. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Add fallback timer mechanism to check if namespace is gone	Stefano Brivio	2024-02-16	1	-21/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We don't know how frequently this happens, but hitting fs.inotify.max_user_watches or similar sysctl limits is definitely not out of question, and Paul mentioned that, for example, Podman's CI environments hit similar issues in the past. Introduce a fallback mechanism based on a timer file descriptor: we grab the directory handle at startup, and we can then use openat(), triggered periodically, to check if the (network) namespace directory still exists. If openat() fails at some point, exit. Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707 Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Enable format warnings	David Gibson	2023-11-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logmsg() takes printf like arguments, but because it's not a built in, the compiler won't generate warnings if the format string and parameters don't match. Enable those by using the format attribute. Strictly speaking this is a gcc extension, but I believe it is also supported by some other common compilers. We already use some other attributes in various places. For now, just use it and we can worry about compilers that don't support it if it comes up. This exposes some warnings from existing callers, both in gcc and in clang-tidy: - Some are straight out bugs, which we correct - It's occasionally useful to invoke the logging functions with an empty string, which gcc objects to, so disable that specific warning in the Makefile - Strictly speaking the C standard requires that the parameter for a %p be a (void *), not some other pointer type. That's only likely to cause problems in practice on weird architectures with different sized representations for pointers to different types. Nonetheless add the casts to make it happy. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-2/+2
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't pre-fill IPv4 destination address in headers	David Gibson	2023-08-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Always use epoll_ref for the epoll data variable	David Gibson	2023-08-13	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	epoll_ref contains a variety of information useful when handling epoll events on our sockets, and we place it in the epoll_event data field returned by epoll. However, for a few other things we use the 'fd' field in the standard union of types for that data field. This actually introduces a bug which is vanishingly unlikely to hit in practice, but very nasty if it ever did: theoretically if we had a very large file descriptor number for fd_tap or fd_tap_listen it could overflow into bits that overlap with the 'proto' field in epoll_ref. With some very bad luck this could mean that we mistakenly think an event on a regular socket is an event on fd_tap or fd_tap_listen. More practically, using different (but overlapping) fields of the epoll_data means we can't unify dispatch for the various different objects in the epoll. Therefore use the same epoll_ref as the data for the tap fds and the netns quit fd, adding new fd type values to describe them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "dup" operations	David Gibson	2023-08-04	1	-10/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We now detect errors on netlink "set" operations while configuring the pasta namespace with --config-net. However in many cases rather than a simple "set" we use a more complex "dup" function to copy configuration from the host to the namespace. We're not yet properly detecting and reporting netlink errors for that case. Change the "dup" operations to propagate netlink errors to their caller, pasta_ns_conf() and report them there. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "set" operations	David Gibson	2023-08-04	1	-10/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if anything goes wrong while we're configuring the namespace network with --config-net, we'll just ignore it and carry on. This might lead to a silently unconfigured or misconfigured namespace environment. For simple "set" operations based on nl_do() we can now detect failures reported via netlink. Propagate those errors up to pasta_ns_conf() and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Explicitly pass netlink sockets to operations	David Gibson	2023-08-04	1	-19/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	All the netlink operations currently implicitly use one of the two global netlink sockets, sometimes depending on an 'ns' parameter. Change them all to explicitly take the socket to use (or two sockets to use in the case of the *_dup() functions). As well as making these functions strictly more general, it makes the callers easier to follow because we're passing a socket variable with a name rather than an unexplained '0' or '1' for the ns parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_route() into separate operation functions	David Gibson	2023-08-04	1	-6/+10
\| \| \| \| \| \| \| \| \| \|	nl_route() can perform 3 quite different operations based on the 'op' parameter. Split this into separate functions for each one. This requires more lines of code, but makes the internal logic of each operation much easier to follow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_addr() into separate operation functions	David Gibson	2023-08-04	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_addr() can perform three quite different operations based on the 'op' parameter, each of which uses a different subset of the parameters. Split them up into a function for each operation. This does use more lines of code, but the overlap wasn't that great, and the separated logic is much easier to follow. It's also clearer in the callers what we expect the netlink operations to do, and what information it uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting fixes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split up functionality of nl_link()	David Gibson	2023-08-04	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_link() performs a number of functions: it can bring links up, set MAC address and MTU and also retrieve the existing MAC. This makes for a small number of lines of code, but high conceptual complexity: it's quite hard to follow what's going on both in nl_link() itself and it's also not very obvious which function its callers are intending to use. Clarify this, by splitting nl_link() into nl_link_up(), nl_link_set_mac(), and nl_link_get_mac(). The first brings up a link, optionally setting the MTU, the others get or set the MAC address. This fixes an arguable bug in pasta_ns_conf(): it looks as though that was intended to retrieve the guest MAC whether or not c->pasta_conf_ns is set. However, it only actually does so in the !c->pasta_conf_ns case: the fact that we set up==1 means we would only ever set, never get, the MAC in the nl_link() call in the other path. We get away with this because the MAC will quickly be discovered once we receive packets on the tap interface. Still, it's neater to always get the MAC address here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: include errno in error message2023_06_25.32660ce	Paul Holzinger	2023-06-25	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \|	When the open() or setns() calls fails pasta exits early and prints an error. However it did not include the errno so it was impossible to know why the syscall failed. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Split print to fit 80 columns in pasta_open_ns()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, pasta: With --config-net, copy all addresses by default	Stefano Brivio	2023-05-23	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the newly-introduced NL_DUP mode for nl_addr() to copy all the addresses associated to the template interface in the outer namespace, unless --no-copy-addrs (also implied by -a) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. This is done mostly for consistency with routes. It might partially cover the issue at: https://bugs.passt.top/show_bug.cgi?id=47 Support multiple addresses per address family for some use cases, but not the originally intended one: we'll still use a single outbound address (unless the routing table specifies different preferred source addresses depending on the destination), regardless of the address used in the target namespace. Link: https://bugs.passt.top/show_bug.cgi?id=47 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Add functionality to copy addresses from outer namespace	Stefano Brivio	2023-05-23	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similarly to what we've just done with routes, support NL_DUP for addresses (currently not exposed): nl_addr() can optionally copy mulitple addresses to the target namespace, by fixing up data from the dump with appropriate flags and interface index, and repeating it back to the kernel on the socket opened in the target namespace. Link-local addresses are not copied: the family is set to AF_UNSPEC, which means the kernel will ignore them. Same for addresses from a mismatching address (pre-4.19 kernels without support for NETLINK_GET_STRICT_CHK). Ignore IFA_LABEL attributes by changing their type to IFA_UNSPEC, because in general they will report mismatching names, and we don't really need to use labels as we already know the interface index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, pasta: With --config-net, copy all routes by default	Stefano Brivio	2023-05-23	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the newly-introduced NL_DUP mode for nl_route() to copy all the routes associated to the template interface in the outer namespace, unless --no-copy-routes (also implied by -g) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. Otherwise, we can't use default gateways which are not, address-wise, on the same subnet as the container, as reported by Callum. Reported-by: Callum Parsey <callum@neoninteger.au> Link: https://github.com/containers/podman/issues/18539 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Add functionality to copy routes from outer namespace	Stefano Brivio	2023-05-23	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of just fetching the default gateway and configuring a single equivalent route in the target namespace, on 'pasta --config-net', it might be desirable in some cases to copy the whole set of routes corresponding to a given output interface. For instance, in: https://github.com/containers/podman/issues/18539 IPv4 Default Route Does Not Propagate to Pasta Containers on Hetzner VPSes configuring the default gateway won't work without a gateway-less route (specifying the output interface only), because the default gateway is, somewhat dubiously, not on the same subnet as the container. This is a similar case to the one covered by commit 7656a6f88882 ("conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway"), and I'm not exactly proud of that workaround. We also have: https://bugs.passt.top/show_bug.cgi?id=49 pasta does not work with tap-style interface for which, eventually, we should be able to configure a gateway-less route in the target namespace. Introduce different operation modes for nl_route(), including a new NL_DUP one, not exposed yet, which simply parrots back to the kernel the route dump for a given interface from the outer namespace, fixing up flags and interface indices on the way, and requesting to add the same routes in the target namespace, on the interface we manage. For n routes we want to duplicate, send n identical netlink requests including the full dump: routes might depend on each other and the kernel processes RTM_NEWROUTE messages sequentially, not atomically, and repeating the full dump naturally resolves dependencies without the need to actually calculate them. I'm not kidding, it actually works pretty well. Link: https://github.com/containers/podman/issues/18539 Link: https://bugs.passt.top/show_bug.cgi?id=49 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: Improve error handling on failure to join network namespace	Stefano Brivio	2023-05-23	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta_wait_for_ns(), open() failing with ENOENT is expected: we're busy-looping until the network namespace appears. But any other failure is not something we're going to recover from: return right away if we don't get either success or ENOENT. Now that pasta_wait_for_ns() can actually fail, handle that in pasta_start_ns() by reporting the issue and exiting. Looping on EPERM, when pasta doesn't actually have the permissions to join a given namespace, isn't exactly a productive thing to do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Detach mount namespace, (re)mount procfs before spawning command	Stefano Brivio	2023-05-23	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we want /proc contents to be consistent after pasta spawns a child process in a new PID namespace (only for operation without a pre-existing namespace), we need to mount /proc after the clone(2) call with CLONE_NEWPID, and we enable the child to do that by passing, in the same call, the CLONE_NEWNS flag, as described by pid_namespaces(7). This is not really a remount: in fact, passing MS_REMOUNT to mount(2) would make the call fail. We're in another mount namespace now, so it's a fresh mount that has the effect of hiding the existing one. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	convert all remaining err() followed by exit() to die()	Laine Stump	2023-02-16	1	-13/+7
\| \| \| \| \| \| \| \|	This actually leaves us with 0 uses of err(), but someone could want to use it in the future, so we may as well leave it around. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: propagate exit code from child command	Paul Holzinger	2023-02-12	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Exits codes are very useful for scripts, when the pasta child execvp() call fails with ENOENT that parent should also exit with > 0. In short the parent should always exit with the code from the child to make it useful in scripts. It is easy to test with: `pasta -- bash -c "exit 3"; echo $?` Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: correctly exit when execvp() fails	Paul Holzinger	2023-02-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	By default clone() will create a child that does not send SIGCHLD when the child exits. The caller has to specifiy the SIGNAL it should get in the flag bitmask. see clone(2) under "The child termination signal" This fixes the problem where pasta would not exit when the execvp() call failed, i.e. when the command does not exists. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Wait for tap to be set up before spawning command	Stefano Brivio	2023-02-12	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adapted from a patch by Paul Holzinger: when pasta spawns a command, operating without a pre-existing user and network namespace, it needs to wait for the tap device to be configured and its handler ready, before the command is actually executed. Otherwise, something like: pasta --config-net nslookup passt.top usually fails as the nslookup command is issued before the network interface is ready. We can't adopt a simpler approach based on SIGSTOP and SIGCONT here: the child runs in a separate PID namespace, so it can't send SIGSTOP to itself as the kernel sees the child as init process and blocks the delivery of the signal. We could send SIGSTOP from the parent, but this wouldn't avoid the possible condition where the child isn't ready to wait for it when the parent sends it, also raised by Paul -- and SIGSTOP can't be blocked, so it can never be pending. Use SIGUSR1 instead: mask it before clone(), so that the child starts with it blocked, and can safely wait for it. Once the parent is ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the child is waiting for it, the kernel will queue it for us, because it's blocked. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 1392bc5ca002 ("Allow pasta to take a command to execute") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, pasta: Add do_clone() wrapper around __clone2() and clone()	Stefano Brivio	2022-11-16	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spotted in Debian's buildd logs: on ia64, clone(2) is not available: the glibc wrapper is named __clone2() and it takes, additionally, the size of the stack area passed by the caller. Add a do_clone() wrapper handling the different cases, and also taking care of pointing the child's stack in the middle of the allocated area: on PA-RISC (hppa), handled by clone(), the stack grows up, and on ia64 the stack grows down, but the register backing store grows up -- and I think it might be actually used here. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Minor improvements to IPv4 netmask handling	David Gibson	2022-11-04	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several minor problems with our parsing of IPv4 netmasks (-n). First, we don't reject nonsensical netmasks like 0.255.0.255. Address this structurally by using prefix length instead of netmask as the primary variable, only converting (and validating) when we need to. This has the added benefit of making some things more uniform with the IPv6 path. Second, when the user specifies a prefix length, we truncate the output from strtol() to an integer, which means we would treat -n 4294967320 as valid (equivalent to 24). Fix types to check for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Rename pasta_setup_ns() to pasta_spawn_cmd()	David Gibson	2022-10-15	1	-9/+9
\| \| \| \| \| \| \| \| \| \|	pasta_setup_ns() no longer has much to do with setting up a namespace. Instead it's really about starting the shell or other command we want to run with pasta connectivity. Rename it and its argument structure to be less misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Only configure UID/GID mappings in userns when spawning shell	David Gibson	2022-10-15	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When in passt mode, or pasta mode spawning a command, we create a userns for ourselves. This is used both to isolate the pasta/passt process itself and to run the spawned command, if any. Since eed17a47 "Handle userns isolation and dropping root at the same time" we've handled both cases the same, configuring the UID and GID mappings in the new userns to map whichever UID we're running as to root within the userns. This mapping is desirable when spawning a shell or other command, so that the user gets a root shell with reasonably clear abilities within the userns and netns. It's not necessarily essential, though. When not spawning a shell, it doesn't really have any purpose: passt itself doesn't need to be root and can operate fine with an unmapped user (using some of the capabilities we get when entering the userns instead). Configuring the uid_map can cause problems if passt is running with any capabilities in the initial namespace, such as CAP_NET_BIND_SERVICE to allow it to forward low ports. In this case the kernel makes files in /proc/pid owned by root rather than the starting user to prevent the user from interfering with the operation of the capability-enhanced process. This includes uid_map meaning we are not able to write to it. Whether this behaviour is correct in the kernel is debatable, but in any case we might as well avoid problems by only initializing the user mappings when we really want them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Replace FWRITE with a function	David Gibson	2022-10-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a few places we use the FWRITE() macro to open a file, replace it's contents with a given string and close it again. There's no real reason this needs to be a macro rather than just a function though. Turn it into a function 'write_file()' and make some ancillary cleanups while we're there: - Add a return code so the caller can handle giving a useful error message - Handle the case of short write()s (unlikely, but possible) - Add O_TRUNC, to make sure we replace the existing contents entirely Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Remove unhelpful drop_caps() call in pasta_start_ns()	David Gibson	2022-10-15	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	drop_caps() has a number of bugs which mean it doesn't do what you'd expect. However, even if we fixed those, the call in pasta_start_ns() doesn't do anything useful: * In the common case, we're UID 0 at this point. In this case drop_caps() doesn't accomplish anything, because even with capabilities dropped, we are still privileged. * When attaching to an existing namespace with --userns or --netns-only we might not be UID 0. In this case it's too early to drop all capabilities: we need at least CAP_NET_ADMIN to configure the tap device in the namespace. Remove this call - we will still drop capabilities a little later in sandbox(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta_start_ns() always ends in parent context	David Gibson	2022-10-15	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \|	The end of pasta_start_ns() has a test against pasta_child_pid, testing if we're in the parent or the child. However we started the child running the pasta_setup_ns function which always exec()s or exit()s, so if we return from the clone() we are always in the parent, making that test unnecessary. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: More general way of starting spawned shell as a login shell	David Gibson	2022-10-15	1	-12/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When invoked so as to spawn a shell, pasta checks explicitly for the shell being bash and if so, adds a "-l" option to make it a login shell. This is not ideal, since this is a bash specific option and requires pasta to know about specific shell variants. There's a general convention for starting a login shell, which is to prepend a "-" to argv[0]. Use this approach instead, so we don't need bash specific logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Move logging functions to a new file, log.c	Stefano Brivio	2022-10-14	1	-0/+1
\| \| \| \| \| \| \| \|	Logging to file is going to add some further complexity that we don't want to squeeze into util.c. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>