passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	netlink: Propagate errors for "set" operations	David Gibson	2023-08-04	3	-24/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if anything goes wrong while we're configuring the namespace network with --config-net, we'll just ignore it and carry on. This might lead to a silently unconfigured or misconfigured namespace environment. For simple "set" operations based on nl_do() we can now detect failures reported via netlink. Propagate those errors up to pasta_ns_conf() and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add nl_foreach_oftype to filter response message types	David Gibson	2023-08-04	1	-15/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In most cases where processing response messages, we expect only one type of message (excepting NLMSG_DONE or NLMSG_ERROR), and so we need a test and continue to skip anything else. Add a helper macro to do this. This also fixes a bug in nl_get_ext_if() where we didn't have such a test and if we got a message other than RTM_NEWROUTE we would have parsed its contents as nonsense. Also add a warning message if we get such an unexpected message type, which could be useful for debugging if we ever hit it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_req() to allow processing multiple response datagrams	David Gibson	2023-08-04	1	-68/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently nl_req() sends the request, and receives a single response datagram which we then process. However, a single request can result in multiple response datagrams. That happens nearly all the time for DUMP requests, where the 'DONE' message usually comes in a second datagram after the NEW{LINK\|ADDR\|ROUTE} messages. It can also happen if there are just too many objects to dump in a single datagram. Allow our netlink code to process multiple response datagrams by splitting nl_req() into three different helpers: nl_send() just sends a request, without getting a response. nl_status() checks a single message to see if it indicates the end of the reponses for our request. nl_next() moves onto the next response message, whether it's in a datagram we already received or we need to recv() a new one. We also add a 'for'-style macro to use these to step through every response message to a request across multiple datagrams. While we're at it, be more thourough with checking that our sequence numbers are in sync. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Clearer reasoning about the netlink response buffer size	David Gibson	2023-08-04	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set NLBUFSIZ large enough for 8192 netlink headers (128kiB in total), and reference netlink(7). However netlink(7) says nothing about reponse buffer sizes, and the documents which do reference 8192 bytes not 8192 headers. Update NLBUFSIZ to 64kiB with a more detailed rationale. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add nl_do() helper for simple operations with error checking	David Gibson	2023-08-04	1	-12/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	So far we never checked for errors reported on netlink operations via NLMSG_ERROR messages. This has led to several subtle and tricky to debug situations which would have been obvious if we knew that certain netlink operations had failed. Introduce a nl_do() helper that performs netlink "do" operations (that is making a single change without retreiving complex information) with much more thorough error checking. As well as returning an error code if we get an NLMSG_ERROR message, we also check for unexpected behaviour in several places. That way if we've made a mistake in our assumptions about how netlink works it should result in a clear error rather than some subtle misbehaviour. We update those calls to nl_req() that can use the new wrapper to do so. We will extend those to better handle errors in future. We don't touch non-"do" operations for now, those are a bit trickier. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fill in netlink header fields from nl_req()	David Gibson	2023-08-04	1	-84/+42
\| \| \| \| \| \| \| \| \| \|	Currently netlink functions need to fill in a full netlink header, as well as a payload then call nl_req() to submit that to the kernel. It makes things a bit terser if we just give the relevant header fields as parameters to nl_req() and have it complete the header. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Treat send() or recv() errors as fatal	David Gibson	2023-08-04	1	-19/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Errors on send() or recv() calls on a netlink socket don't indicate errors with the netlink operations we're attempting, but rather that something's gone wrong with the mechanics of netlink itself. We don't really expect this to ever happen, and if it does, it's not clear what we could to to recover. So, treat errors from these calls as fatal, rather than returning the error up the stack. This makes handling failures in the callers of nl_req() simpler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Start sequence number from 1 instead of 0	David Gibson	2023-08-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Netlink messages have a sequence number that's used to match requests to responses. It mostly doesn't matter what it is as long as it monotonically increases, so we just use a global counter which we advance with each request. However, we start this counter at 0, so our very first request has sequence number 0, which is usually reserved for asynchronous messages from the kernel which aren't in response to a specific request. Since we don't (for now) use such async messages, this doesn't really matter, but it's not good practce. So start the sequence at 1 instead. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Make nl_*_dup() use a separate datagram for each request	David Gibson	2023-08-04	1	-23/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_req() is designed to handle a single netlink request message: it only receives a single reply datagram for the request, and only waits for a single NLMSG_DONE or NLMSG_ERROR message at the beginning to clear out things from previous requests. However, in both nl_addr_dup() and nl_route_dup() we can send multiple request messages as a single datagram, with a single nl_req() call. This can easily mean that the replies nl_req() collects get out of sync with requests. We only get away with this because after we call these functions we don't make any netlink calls where we need to parse the replies. This is fragile, so alter nl_*_dup() to make an nl_req() call for each address it is adding in the target namespace. For nl_route_dup() this fixes an additional minor problem: because routes can have dependencies, some of the route add requests might fail on the first attempt, so we need to repeat the requests a number of times. When we did that, we weren't updating the sequence number on each new attempt. This works, but not updating the sequence number for each new request isn't ideal. Now that we're making the requests one at a time, it's easier to make sure we update the sequence number each time. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Explicitly pass netlink sockets to operations	David Gibson	2023-08-04	4	-76/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	All the netlink operations currently implicitly use one of the two global netlink sockets, sometimes depending on an 'ns' parameter. Change them all to explicitly take the socket to use (or two sockets to use in the case of the *_dup() functions). As well as making these functions strictly more general, it makes the callers easier to follow because we're passing a socket variable with a name rather than an unexplained '0' or '1' for the ns parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t	David Gibson	2023-08-04	1	-6/+6
\| \| \| \| \| \| \| \|	This improves consistency with IPv6 and makes it harder to misuse these as some other sort of value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_route() into separate operation functions	David Gibson	2023-08-04	4	-105/+163
\| \| \| \| \| \| \| \| \| \|	nl_route() can perform 3 quite different operations based on the 'op' parameter. Split this into separate functions for each one. This requires more lines of code, but makes the internal logic of each operation much easier to follow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_addr() into separate operation functions	David Gibson	2023-08-04	4	-108/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_addr() can perform three quite different operations based on the 'op' parameter, each of which uses a different subset of the parameters. Split them up into a function for each operation. This does use more lines of code, but the overlap wasn't that great, and the separated logic is much easier to follow. It's also clearer in the callers what we expect the netlink operations to do, and what information it uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting fixes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split up functionality of nl_link()	David Gibson	2023-08-04	4	-66/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_link() performs a number of functions: it can bring links up, set MAC address and MTU and also retrieve the existing MAC. This makes for a small number of lines of code, but high conceptual complexity: it's quite hard to follow what's going on both in nl_link() itself and it's also not very obvious which function its callers are intending to use. Clarify this, by splitting nl_link() into nl_link_up(), nl_link_set_mac(), and nl_link_get_mac(). The first brings up a link, optionally setting the MTU, the others get or set the MAC address. This fixes an arguable bug in pasta_ns_conf(): it looks as though that was intended to retrieve the guest MAC whether or not c->pasta_conf_ns is set. However, it only actually does so in the !c->pasta_conf_ns case: the fact that we set up==1 means we would only ever set, never get, the MAC in the nl_link() call in the other path. We get away with this because the MAC will quickly be discovered once we receive packets on the tap interface. Still, it's neater to always get the MAC address here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Remove unnecessary global tun_ns_fd	David Gibson	2023-08-04	1	-7/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_ns_tun(), which runs in an ephemeral thread puts the fd it opens into the global variable tun_ns_fd to communicate it back to the main thread in tap_sock_tun_init(). However, the only thing tap_sock_tun_init() does with it is copies it to c->fd_tap and everything else uses it from there. tap_ns_tun() already has access to the context structure, so we might as well store the value directly in there rather than having a global as an intermediate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: More detailed error reporting in tap_ns_tun()	David Gibson	2023-08-04	1	-9/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several possible failure points in tap_ns_tun(), but if anything goes wrong, we just set tun_ns_fd to -1 resulting in the same error message. Add more detailed error reporting to the various failure points. At the same time, we know this is only called from tap_sock_tun_init() which will terminate pasta if we fail, so we can simplify things a little because we don't need to close() the fd on the failure paths. Link: https://bugs.passt.top/show_bug.cgi?id=69 Link: https://github.com/containers/podman/issues/19428 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make ns_enter() a void function and report setns() errors	David Gibson	2023-08-04	5	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ns_enter() returns an integer... but it's always zero. If we actually fail the function doesn't return. Therefore it makes more sense for this to be a function returning void, and we can remove the cases where we pointlessly checked its return value. In addition ns_enter() is usually called from an ephemeral thread created by NS_CALL(). That means that the exit(EXIT_FAILURE) there usually won't be reported (since NS_CALL() doesn't wait() for the thread). So, use die() instead to print out some information in the unlikely event that our setns() here does fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use static assertion to verify that union epoll_ref is the right size	David Gibson	2023-08-04	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	union epoll_ref is used to subdivide the 64-bit data field in struct epoll_event. Thus it must fit within that field or we're likely to get very subtle and nasty bugs. C11 introduces the notion of static assertions which we can use to verify this is the case at compile time. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use C11 anonymous members to make poll refs less verbose to use	David Gibson	2023-08-04	10	-78/+73
\| \| \| \| \| \| \| \| \| \| \| \|	union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Allow C11 code, not just C99 code	David Gibson	2023-08-04	1	-2/+2
\| \| \| \| \| \| \| \|	C11 has some features that will allow us to make some things a bit cleaner. Alter the Makefile to tell the compiler to allow us to use C11 code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Revert "MAKE: Fix parallel builds; .o files; .gitignore; new makedocs"	Stefano Brivio	2023-07-10	3	-45/+26
\| \| \| \| \| \| \| \|	This reverts commit cc2a6bec3cf2ff6ed0c043ada93d352466614373: I applied that patch by mistake. Fixes: cc2a6bec3cf2 ("MAKE: Fix parallel builds; .o files; .gitignore; new makedocs") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	MAKE: Fix parallel builds; .o files; .gitignore; new makedocs	KuhnChris	2023-07-07	3	-26/+45
\|
*	tap: Explicitly drop IPv4 fragments, and give a warning	David Gibson	2023-07-07	1	-0/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We don't handle defragmentation of IP packets coming from the tap side, and we're unlikely to any time soon (with our large MTU, it's not useful for practical use cases). Currently, however, we simply ignore the fragmentation flags and treat fragments as though they were whole IP packets. This isn't ideal and can lead to rather cryptic behaviour if we do receive IP fragments. Change the code to explicitly drop fragmented packets, and print a rate limited warning if we do encounter them. Link: https://bugs.passt.top/show_bug.cgi?id=62 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Correct length checking of interface names in conf_ports()	David Gibson	2023-06-28	1	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When interface names are specified in forwarding specs, we need to check the length of the given interface name against the limit of IFNAMSIZ - 1 (15) characters. However, we managed to have 3 separate off-by-one errors here meaning we only accepted interface names up to 12 characters. 1. At the point of the check 'ifname' was still on the '%' character, not the first character of the name, meaning we overestimated the length by one 2. At the point of the check 'spec' had been advanced one character past the '/' which terminates the interface name, meaning we overestimated the length by another one 3. We checked if the (miscalculated) length was >= IFNAMSIZ - 1, that is >= 15, whereas lengths equal to 15 should be accepted. Correct all 3 errors. Link: https://bugs.passt.top/show_bug.cgi?id=61 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Fix size checking of -I interface name	David Gibson	2023-06-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Network interface names must fit in a buffer of IFNAMSIZ bytes, including the terminating \0. IFNAMSIZ is 16 on Linux, so interface names can be up to (and including) 15 characters long. We validate this for the -I option, but we have an off by one error. We pass (IFNAMSIZ - 1) as the buffer size to snprintf(), but that buffer size already includes the terminating \0, so this actually truncates the value to 14 characters. The return value returned from snprintf() however, is the number of characters that would have been printed excluding the terminating \0, so by comparing it >= IFNAMSIZ - 1 we are giving an error on names >= 15 characters rather than strictly > 15 characters. Link: https://bugs.passt.top/show_bug.cgi?id=61 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Use correct interface index in NL_SET mode2023_06_27.289301b	David Gibson	2023-06-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_addr() and nl_route() take an 'op' selector which affects a number of parameters to the netlink call. Unfortunately when we introduced this option a bug was introduced so that we always use the interface index for the host side, rather than the one for the pasta namespace. Really, the entire interface to nl_addr() and nl_route() is pretty bad: it's tightly coupled with the use cases of its callers. This is a minimal fix which doesn't address that, but also doesn't make it significantly worse. Link: https://bugs.passt.top/show_bug.cgi?id=59 Fixes: 2fe046185634 ("netlink: Add functionality to copy routes from outer namespace") Fixes: e89da3cf03b2 ("netlink: Add functionality to copy addresses from outer namespace") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: include errno in error message2023_06_25.32660ce	Paul Holzinger	2023-06-25	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \|	When the open() or setns() calls fails pasta exits early and prints an error. However it did not include the errno so it was impossible to know why the syscall failed. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Split print to fit 80 columns in pasta_open_ns()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: keep CAP_SYS_PTRACE when required	Paul Holzinger	2023-06-25	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When pasta is started from an existing userns and tries to join the netns from another process it fails to open /proc/$pid/ns/net due the missing CAP_SYS_PTRACE capability in the --netns-only case. A simple reproducer for this. First create a userns: $ unshare -r Then create a new netns inside it and try to join that netns with pasta. $ unshare -n sleep inf & $ pasta --config-net --netns /proc/$!/ns/net Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Accept -a and -g without --config-net in pasta mode	Stefano Brivio	2023-06-25	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	While --no-copy-addrs and --no-copy-routes only make sense with --config-net, and they are implied on -g and -a, respectively, that doesn't mean we should refuse -a or -g without --config-net: they are still relevant for a number of things (including DHCP/DHCPv6/NDP configuration). Reported-by: Gianluca Stivan <me@yawnt.com> Fixes: cc9d16758be6 ("conf, pasta: With --config-net, copy all addresses by default") Fixes: da54641f140e ("conf, pasta: With --config-net, copy all routes by default") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Make -a/--address really imply --no-copy-addrs	Stefano Brivio	2023-06-25	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I wrote it in commit message and man page, but not in conf()... Note that -g/--gateway correctly implies --no-copy-routes already. This fixes Podman's tests: podman networking with pasta(1) - IPv4 address assignment podman networking with pasta(1) - IPv4 default route assignment where we pass -a and -g to assign an address and a default gateway that's compatible with it, but -a doesn't disable the copy of addresses, so we ignore -a, and the default gateway is incompatible with the addresses we copy -- hence no routes in the container. Link: https://github.com/containers/podman/pull/18612 Fixes: cc9d16758be6 ("conf, pasta: With --config-net, copy all addresses by default") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	seccomp: Make seccomp.sh re-entrancy safe	David Gibson	2023-06-25	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	seccomp.sh generates seccomp.h piece by piece using >> directives. This means that if two instances of seccomp.h are run concurrently a corrupted version of seccomp.h will be generated. Amongst other problems this can cause spurious failures on clang-tidy. Alter seccomp.sh to build the output in a temporary file and atomic move it to seccomp.h, so concurrent invocations will still result in valud output. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, log: On -h / --help, print usage to stdout, not stderr	Stefano Brivio	2023-06-23	3	-8/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Erik suggests that this makes it easier to grep for options, and with --help we're anyway printing usage information as expected, not as part of an error report. While at it: on -h, we should exit with 0. Reported-by: Erik Sjölund <erik.sjolund@gmail.com> Link: https://bugs.passt.top/show_bug.cgi?id=52 Link: https://bugs.passt.top/show_bug.cgi?id=53 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: With pasta, don't reset on tap errors, handle write failures	Stefano Brivio	2023-06-23	1	-5/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), it makes no sense to close and reopen the tap device on error: we don't have access to /dev/net/tun after the initial setup phase. If we hit ENOBUFS while writing (as reported: in one case because the kernel actually ran out of memory, with another case under investigation), or ENOSPC, we're supposed to drop whatever data we were trying to send: there's no room for it. Handle EINTR just like we handled EAGAIN/EWOULDBLOCK: there's no particular reason why sending the same data should fail again. Anything else I can think of would be an unrecoverable error: exit with failure then. While at it, drop a useless cast on the write() call: it takes a const void * anyway. Reported-by: Gianluca Stivan <me@yawnt.com> Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu> Fixes: 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Fix erroneous check of ip6->gw2023_06_03.429e1a7	David Gibson	2023-06-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	a7359f094898 ("conf: Don't exit if sourced default route has no gateway") was supposed to allow passt/pasta to run even if given a template interface which has no default gateway. However a mistake in the patch means it still requires the gateway, but doesn't require a global address for the guest which we really do need. This is one part (but not the only part) of the problem seen in https://bugs.passt.top/show_bug.cgi?id=50. Reported-by: Justin Jereza <justinjereza@gmail.com> Fixes: a7359f094898 ("conf: Don't exit if sourced default route has no gateway") Link: https://bugs.passt.top/show_bug.cgi?id=50 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/nstool: Fix fd leak in accept() loop	David Gibson	2023-05-23	1	-0/+2
\| \| \| \| \| \| \| \| \|	nstool loops on accept(), but failed to close the accepted socket fds before continuing on. So, with repeated commands it would eventually die with an EMFILE. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/nstool: Provide useful error if given a path that's too long	David Gibson	2023-05-23	1	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	Normal filesystem paths can be very long (PATH_MAX is around 8k), however Unix domain sockets can only use relatively short paths (UNIX_PATH_MAX is 108 on Linux). Currently nstool will simply truncate paths that are too long, leading to difficult to understand failures. Make such failures clearer, with an explicit error message if given a path that's too long. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt.h: Fix description of pasta_ifi in struct ctx	Stefano Brivio	2023-05-23	1	-1/+1
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, pasta: With --config-net, copy all addresses by default	Stefano Brivio	2023-05-23	4	-4/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the newly-introduced NL_DUP mode for nl_addr() to copy all the addresses associated to the template interface in the outer namespace, unless --no-copy-addrs (also implied by -a) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. This is done mostly for consistency with routes. It might partially cover the issue at: https://bugs.passt.top/show_bug.cgi?id=47 Support multiple addresses per address family for some use cases, but not the originally intended one: we'll still use a single outbound address (unless the routing table specifies different preferred source addresses depending on the destination), regardless of the address used in the target namespace. Link: https://bugs.passt.top/show_bug.cgi?id=47 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Add functionality to copy addresses from outer namespace	Stefano Brivio	2023-05-23	4	-24/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similarly to what we've just done with routes, support NL_DUP for addresses (currently not exposed): nl_addr() can optionally copy mulitple addresses to the target namespace, by fixing up data from the dump with appropriate flags and interface index, and repeating it back to the kernel on the socket opened in the target namespace. Link-local addresses are not copied: the family is set to AF_UNSPEC, which means the kernel will ignore them. Same for addresses from a mismatching address (pre-4.19 kernels without support for NETLINK_GET_STRICT_CHK). Ignore IFA_LABEL attributes by changing their type to IFA_UNSPEC, because in general they will report mismatching names, and we don't really need to use labels as we already know the interface index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Don't exit if sourced default route has no gateway	Stefano Brivio	2023-05-23	2	-5/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we use a template interface without a gateway on the default route, we can still offer almost complete functionality, except that, of course, we can't map the gateway address to the outer namespace or host, and that we have no obvious server address or identifier for use in DHCP's siaddr and option 54 (Server identifier, mandatory). Continue, if we have a default route but no default gateway, and imply --no-map-gw and --no-dhcp in that case. NDP responder and DHCPv6 should be able to work as usual because we require a link-local address to be present, and we'll fall back to that. Together with the previous commits implementing an actual copy of routes from the outer namespace, this should finally fix the operation of 'pasta --config-net' for cases where we have a default route on the host, but no default gateway, as it's the case for tap-style routes, including typical Wireguard endpoints. Reported-by: me@yawnt.com Link: https://bugs.passt.top/show_bug.cgi?id=49 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and ↵	Stefano Brivio	2023-05-23	1	-24/+1
\| \| \| \| \| \| \| \| \| \| \| \|	gateway" This reverts commit 7656a6f8888237b9e23d63666e921528b6aaf950: now, by default, we copy all the routes associated to the outbound interface into the routing table of the container, so there's no need for this horrible workaround anymore. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, pasta: With --config-net, copy all routes by default	Stefano Brivio	2023-05-23	4	-3/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the newly-introduced NL_DUP mode for nl_route() to copy all the routes associated to the template interface in the outer namespace, unless --no-copy-routes (also implied by -g) is given. This option is introduced as deprecated right away: it's not expected to be of any use, but it's helpful to keep it around for a while to debug any suspected issue with this change. Otherwise, we can't use default gateways which are not, address-wise, on the same subnet as the container, as reported by Callum. Reported-by: Callum Parsey <callum@neoninteger.au> Link: https://github.com/containers/podman/issues/18539 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: --config-net option is for pasta mode only	Stefano Brivio	2023-05-23	1	-1/+7
\| \| \| \| \| \|	Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Add functionality to copy routes from outer namespace	Stefano Brivio	2023-05-23	4	-22/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of just fetching the default gateway and configuring a single equivalent route in the target namespace, on 'pasta --config-net', it might be desirable in some cases to copy the whole set of routes corresponding to a given output interface. For instance, in: https://github.com/containers/podman/issues/18539 IPv4 Default Route Does Not Propagate to Pasta Containers on Hetzner VPSes configuring the default gateway won't work without a gateway-less route (specifying the output interface only), because the default gateway is, somewhat dubiously, not on the same subnet as the container. This is a similar case to the one covered by commit 7656a6f88882 ("conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway"), and I'm not exactly proud of that workaround. We also have: https://bugs.passt.top/show_bug.cgi?id=49 pasta does not work with tap-style interface for which, eventually, we should be able to configure a gateway-less route in the target namespace. Introduce different operation modes for nl_route(), including a new NL_DUP one, not exposed yet, which simply parrots back to the kernel the route dump for a given interface from the outer namespace, fixing up flags and interface indices on the way, and requesting to add the same routes in the target namespace, on the interface we manage. For n routes we want to duplicate, send n identical netlink requests including the full dump: routes might depend on each other and the kernel processes RTM_NEWROUTE messages sequentially, not atomically, and repeating the full dump naturally resolves dependencies without the need to actually calculate them. I'm not kidding, it actually works pretty well. Link: https://github.com/containers/podman/issues/18539 Link: https://bugs.passt.top/show_bug.cgi?id=49 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: Improve error handling on failure to join network namespace	Stefano Brivio	2023-05-23	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta_wait_for_ns(), open() failing with ENOENT is expected: we're busy-looping until the network namespace appears. But any other failure is not something we're going to recover from: return right away if we don't get either success or ENOENT. Now that pasta_wait_for_ns() can actually fail, handle that in pasta_start_ns() by reporting the issue and exiting. Looping on EPERM, when pasta doesn't actually have the permissions to join a given namespace, isn't exactly a productive thing to do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fix comment about response buffer size for nl_req()	Stefano Brivio	2023-05-23	1	-1/+1
\| \| \| \| \| \|	Fixes: fde8004ab0b4 ("netlink: Use 8 KiB * netlink message header size as response buffer") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	isolation: Initially Keep CAP_SETFCAP if running as UID 0 in non-init	Stefano Brivio	2023-05-23	1	-3/+14
\| \| \| \| \| \| \| \| \| \| \|	If pasta spawns a child process while running as UID 0, which is only allowed from a non-init namespace, we need to keep CAP_SETFCAP before pasta_start_ns() is called: otherwise, starting from Linux 5.12, we won't be able to update /proc/self/uid_map with the intended mapping (from 0 to 0). See user_namespaces(7). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: Detach mount namespace, (re)mount procfs before spawning command	Stefano Brivio	2023-05-23	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we want /proc contents to be consistent after pasta spawns a child process in a new PID namespace (only for operation without a pre-existing namespace), we need to mount /proc after the clone(2) call with CLONE_NEWPID, and we enable the child to do that by passing, in the same call, the CLONE_NEWNS flag, as described by pid_namespaces(7). This is not really a remount: in fact, passing MS_REMOUNT to mount(2) would make the call fail. We're in another mount namespace now, so it's a fresh mount that has the effect of hiding the existing one. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util, conf: Add and use ns_is_init() helper	Stefano Brivio	2023-05-23	3	-15/+28
\| \| \| \| \| \| \| \| \| \|	We'll need this in isolate_initial(). While at it, don't rely on BUFSIZ: the earlier issue we had with musl reminded me it's not a magic "everything will fit" value. Size the read buffer to what we actually need from uid_map, and check for the final newline too, because uid_map is organised in lines. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't update ip6.addr_seen to ::	David Gibson	2023-05-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we receive packets from the tap side, we update the addr_seen fields to reflect the last known address of the guest or ns. For ip4.addr_seen we, sensibly, only update if the address we've just seen isn't 0 (0.0.0.0). This case can occur during early DHCP transactions. We have no equivalent case for IPv6. We're less likely to hit this, because DHCPv6 uses link-local addresses, however we can see an source address of :: with certain multicast operations. This can bite us if we try to make an incoming connection very early after starting pasta with --config-net: we may have only seen some of those multicast packets, updated addr_seen to :: and not had any "real" packets to update it to a global address. I've seen this with some of the avocado test conversions. In any case, it can never make sense to update addr_seen to ::, so explicitly exclude that case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>