passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	netlink: Use IFA_F_NODAD also while duplicating addresses from the hostHEAD 2024_04_26.d03c4e2 master	Stefano Brivio	6 days	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...not just for a single set address (legacy operation with --no-copy-addrs). I forgot to add this to nl_addr_dup(). Note that we can have two version of flags: the 8-bit ifa_flags in ifaddrmsg, and the newer 32-bit version as IFA_FLAGS attribute, which is given priority if present. Make sure IFA_F_NODAD is set in both. Without this, a Podman user reports, something on the lines of: pasta --config-net -- ping -c1 -6 passt.top would fail as the kernel would start Duplicate Address Detection once we configure the address, which can't really work (and doesn't make sense) in this case. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: For IPv4, IFA_LOCAL is the interface address, not IFA_ADDRESS	Stefano Brivio	6 days	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	See the comment to the unnamed enum in linux/if_addr.h, which currently states: /* * Important comment: * IFA_ADDRESS is prefix address, rather than local interface address. * It makes no difference for normally configured broadcast interfaces, * but for point-to-point IFA_ADDRESS is DESTINATION address, * local address is supplied in IFA_LOCAL attribute. * * [...] */ if we fetch IFA_ADDRESS, and we have a point-to-point link with a peer address configured, we'll source the peer address as "our" address, and refuse to resolve it in arp(). This was reported with pasta and a tun upstream interface configured by OpenVPN in "p2p" topology: the target namespace will have similar addresses and routes as the host, which is fine, and will try to resolve the point-to-point peer address (because it's the default gateway). Given that we configure it as our address (only internally, not visibly in the namespace), we'll fail to resolve that and traffic doesn't go anywhere. Note that this is not the case for IPv6: there, IFA_ADDRESS is the actual, local address of the interface, and IFA_LOCAL is not necessarily present, so the comment in linux/if_addr.h doesn't apply either. Link: https://github.com/containers/podman/issues/22320 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Ignore routes to link-local addresses for selecting interface	David Gibson	2024-04-05	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since f919dc7a4b1c ("conf, netlink: Don't require a default route to start"), and since 639fdf06ede ("netlink: Fix selection of template interface") less buggily, we haven't required a default route on the host in order to operate. Instead, if we lack a default route we'll pick an interface with any route, as long as there's only one such interface. If there's more than one, we don't have a good criterion to pick, so we give up with an informational message. Paul Holzinger pointed out that this code considers it ambiguous even if all but one of the interfaces has only routes to link-local addresses (fe80::/10). A route to link-local addresses isn't really useful from pasta's point of view, so ignore them instead. This removes a misleading message in many cases, and a spurious failure in some cases. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add helper to return name of address family	David Gibson	2024-04-05	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	We have a few places where we want to include the name of the internet protocol version (IPv4 or IPv6) in a message, which we handle with an open-coded ?: expression. This seems like something that might be more widely useful, so make a trivial helper to return the correct string based on the address family. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Adjust interface index inside copied nexthop objects too	Stefano Brivio	2024-04-05	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As pasta duplicates host routes into the target namespaces, interface indices might not match, so we go through RTA_OIF attributes and fix them up to match the identifier in the namespace. But RTA_OIF is not the ony attribute specifying interfaces for routes: multipath routes use RTA_MULTIPATH attributes with nexthop objects, which contain in turn interface indices. Fix them up as well. If we don't, and we have at least two host interfaces, and the host interface we use as template isn't the first one (hence the mismatching indices), we'll fail to insert multipath routes with nexthop objects, and ultimately refuse to start as the kernel unexpectedly gives us ENODEV. Link: https://github.com/containers/podman/issues/22192 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Fix selection of template interface	David Gibson	2024-03-20	1	-24/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since f919dc7a4b1c ("conf, netlink: Don't require a default route to start"), if there is only one host interface with routes, we will pick that as the template interface, even if there are no default routes for an IP version. Unfortunately this selection had a serious flaw: in some cases it would 'return' in the middle of an nl_foreach() loop, meaning we wouldn't consume all the netlink responses for our query. This could cause later netlink operations to fail as we read leftover responses from the aborted query. Rewrite the interface detection to avoid this problem. While we're there: * Perform detection of both default and non-default routes in a single pass, avoiding an ugly goto * Give more detail on error and working but unusual paths about the situation (no suitable interface, multiple possible candidates, etc.). Fixes: f919dc7a4b1c ("conf, netlink: Don't require a default route to start") Link: https://bugs.passt.top/show_bug.cgi?id=83 Link: https://github.com/containers/podman/issues/22052 Link: https://bugzilla.redhat.com/show_bug.cgi?id=2270257 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Use info(), not warn() for somewhat expected cases where one IP version has no default routes, or no routes at all] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fix handling of NLMSG_DONE in nl_route_dup()2024_03_19.d35bcbe	David Gibson	2024-03-19	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A recent kernel change 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families") changed netlink behaviour so that the NLMSG_DONE terminating a bunch of responses can go in the same datagram as those responses, rather than in a separate one. Our netlink code is supposed to handle that behaviour, and indeed does so for most cases, using the nl_foreach() macro. However, there was a subtle error in nl_route_dup() which doesn't work with this change. f00b1534 ("netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE") attempted to fix this, but has its own subtle error. The problem arises because nl_route_dup(), unlike other cases doesn't just make a single pass through all the responses to a netlink request. It needs to get all the routes, then make multiple passes through them. We don't really have anywhere to buffer multiple datagrams, so we only support the case where all the routes fit in a single datagram - but we need to fail gracefully when that's not the case. After receiving the first datagram of responses (with nl_next()) we have a first loop scanning them. It needs to exit when either we run out of messages in the datagram (!NLMSG_OK()) or when we get a message indicating the last response (nl_status() <= 0). What we do after the loop depends on which exit case we had. If we saw the last response, we're done, but otherwise we need to receive more datagrams to discard the rest of the responses. We attempt to check for that second case by re-checking NLMSG_OK(nh, status). However in the got-last-response case, we've altered status from the number of remaining bytes to the error code (usually 0). That means NLMSG_OK() now returns false even if it didn't during the loop check. To fix this we need separate variables for the number of bytes left and the final status code. We also checked status after the loop, but this was redundant: we can only exit the loop with NLMSG_OK() == true if status <= 0. Reported-by: Martin Pitt <mpitt@redhat.com> Fixes: f00b153414b1 ("netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE") Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Link: https://github.com/containers/podman/issues/22052 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, netlink: Don't require a default route to start	Stefano Brivio	2024-03-18	1	-3/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There might be isolated testing environments where default routes and global connectivity are not needed, a single interface has all non-loopback addresses and routes, and still passt and pasta are expected to work. In this case, it's pretty obvious what our upstream interface should be, so go ahead and select the only interface with at least one route, disabling DHCP and implying --no-map-gw as the documentation already states. If there are multiple interfaces with routes, though, refuse to start, because at that point it's really not clear what we should do. Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://github.com/containers/podman/issues/21896 Signed-off-by: Stefano brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE	Stefano Brivio	2024-03-18	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Use const rtnh pointer	David Gibson	2024-02-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	6c7623d07 ("netlink: Add support to fetch default gateway from multipath routes") inadvertently introduced a new cppcheck warning for a variable which could be a const pointer but isn't. This occurs with cppcheck-2.13.0-1.fc39.x86_64 in Fedora 39 at least. Fixes: 6c7623d07bbd ("netlink: Add support to fetch default gateway from multipath routes") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add support to fetch default gateway from multipath routes	Stefano Brivio	2024-02-09	1	-3/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the default route for a given IP version is a multipath one, instead of refusing to start because there's no RTA_GATEWAY attribute in the set returned by the kernel, we can just pick one of the paths. To make this somewhat less arbitrary, pick the path with the highest weight, if weights differ. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/20927 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Fetch most specific (longest prefix) address in nl_addr_get()2023_12_30.f091893	Stefano Brivio	2023-12-30	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This happened in most cases implicitly before commit eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions"): while going through results from netlink, we would only copy an address into the provided return buffer if no address had been picked yet. Because of the insertion logic in the kernel (ipv6_link_dev_addr()), the first returned address would also be the one added last, and, in case of a Linux guest using a DHCPv6 client as well as SLAAC, that would be the address assigned via DHCPv6, because SLAAC happens before the DHCPv6 exchange. The effect of, instead, picking the last returned address (first assigned) is visible when passt or pasta runs nested, given that, by default, they advertise a prefix for SLAAC usage, plus an address via DHCPv6. The first level (L1 guest) would get a /64 address by means of SLAAC, and a /128 address via DHCPv6, the latter matching the address on the host. The second level (L2 guest) would also get two addresses: a /64 via SLAAC (same prefix as the host), and a /128 via DHCPv6, matching the the L1 SLAAC-assigned address, not the one obtained via DHCPv6. That is, none of the L2 addresses would match the address on the host. The whole point of having a DHCPv6 server is to avoid (implicit) NAT when possible, though. Fix this in a more explicit way than the behaviour we initially had: pick the first address among the set of most specific ones, by comparing prefix lengths. Do this for IPv4 and for link-local addresses, too, to match in any case the implementation of the default source address selection. Reported-by: Yalan Zhang <yalzhang@redhat.com> Fixes: eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use 'z' length modifier for size_t/ssize_t conversions	Stefano Brivio	2023-12-02	1	-1/+1
\| \| \| \| \| \| \| \|	Types size_t and ssize_t are not necessarily long, it depends on the architecture. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Sequence numbers are actually 32 bits wide	Stefano Brivio	2023-11-07	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Harmless, as we use sequence numbers monotonically anyway, but now clang-tidy reports: /home/sbrivio/passt/netlink.c:155:7: error: format specifies type 'unsigned short' but the argument has type '__u32' (aka 'unsigned int') [clang-diagnostic-format,-warnings-as-errors] nh->nlmsg_seq, seq); ^ /home/sbrivio/passt/log.h:26:7: note: expanded from macro 'die' err(__VA_ARGS__); \ ^~~~~~~~~~~ /home/sbrivio/passt/log.h:19:34: note: expanded from macro 'err' ^~~~~~~~~~~ Suppressed 222820 warnings (222816 in non-user code, 4 NOLINT). Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well. 1 warning treated as error make: *** [Makefile:255: clang-tidy] Error 1 Fixes: 9d4ab98d538f ("netlink: Add nl_do() helper for simple operations with error checking") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-3/+3
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Strip RTA_PREFSRC when copying routes to the namespace2023_08_23.a7e4bfb	David Gibson	2023-08-23	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Host routes can include a preferred source address (RTA_PREFSRC), which must be one of the host's addresses. However when using pasta with -a the namespace might be given a different address, not on the host. This seems to occur pretty routinely depending on the network configuration systems in place on the host. With --config-net we will try to copy host routes to the namespace. If one of those includes an RTA_PREFSRC, but the namespace doesn't have the host address, this will fail with -EINVAL, causing pasta to fail. Fix this by stripping off RTA_PREFSRC attributes from routes as we copy them to the namespace. This is by no means infallible, bit it should at least handle common cases for the time being. Link: https://bugs.passt.top/show_bug.cgi?id=71 Link: https://github.com/containers/podman/pull/19699#issuecomment-1688769287 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Set IFA_ADDRESS, not just IFA_LOCAL, while adding IPv4 addresses	Stefano Brivio	2023-08-23	1	-0/+1
\| \| \| \| \| \| \| \| \|	Otherwise, we actually configure the address, but it's not usable because no local route is added by the kernel. Link: https://github.com/containers/podman/pull/19699 Fixes: cfe7509e5c16 ("netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Don't propagate host address expiry to the container	David Gibson	2023-08-16	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we copy addresses from the host to the container in nl_addr_dup(), we copy all the address's attributes, including IFA_CACHEINFO, which controls the address's lifetime. If the host address is managed by, for example, DHCP, it will typically have a finite lifetime. When we copy that lifetime to the pasta container, that lifetime will remain, meaning the kernel will eventually remove the address, typically some hours later. The container, however, won't have the DHCP client or whatever was managing and maintaining the address in the host, so it will just lose connectivity. Long term, we may want to monitor host address changes and reflect them to the guest. But for now, we just want to take a snapshot of the host's address and set those in the container permanently. We can accomplish that by stripping off the IFA_CACHEINFO attribute as we copy addresses. Link: https://github.com/containers/podman/issues/19405 Link: https://bugs.passt.top/show_bug.cgi?id=70 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Correctly calculate attribute length for address messages	David Gibson	2023-08-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	In nl_addr_get() and nl_addr_dup() we step the attributes attached to each RTM_NEWADDR message with a loop initialised with IFA_RTA() and RTM_PAYLOAD() macros. RTM_PAYLOAD(), however is for RTM_NEWROUTE messages (struct rtmsg), not RTM_NEWADDR messages (struct ifaddrmsg). Consequently it miscalculates the size and means we can skip some attributes. Switch to IFA_PAYLOAD() which we should be using here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Remove redundant check on nlmsg_type	David Gibson	2023-08-16	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \|	In the loop within nl_addr_dup() we check and skip for any messages that aren't of type RTM_NEWADDR. This is a leftover that was missed in the recent big netlink cleanup. In fact we already check for the message type in the nl_foreach_oftype() macro, so the explicit test is redudant. Remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "dup" operations	David Gibson	2023-08-04	1	-12/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We now detect errors on netlink "set" operations while configuring the pasta namespace with --config-net. However in many cases rather than a simple "set" we use a more complex "dup" function to copy configuration from the host to the namespace. We're not yet properly detecting and reporting netlink errors for that case. Change the "dup" operations to propagate netlink errors to their caller, pasta_ns_conf() and report them there. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "dump" operations	David Gibson	2023-08-04	1	-4/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if we receive any netlink errors while discovering network configuration from the host, we'll just ignore it and carry on. This might lead to cryptic error messages later on, or even silent misconfiguration. We now have the mechanisms to detect errors from get/dump netlink operations. Propgate these errors up to the callers and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Always process all responses to a netlink request	David Gibson	2023-08-04	1	-24/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A single netlink request can result in multiple response datagrams. We process multiple response datagrams in some circumstances, but there are cases where we exit early and will leave remaining datagrams in the queue. These will be flushed in nl_send() before we send another request. This is confusing, and not what we need to reliably check for errors from netlink operations. So, instead, make sure we always process all the response datagrams whenever we send a request (excepting fatal errors). In most cases this is just a matter of avoiding early exits from nl_foreach loops. nl_route_dup() is a bit trickier, because we need to retain all the routes we're going to try to copy in a single buffer. Here we instead use a secondary buffer to flush any remaining datagrams, and report an error if there are any additional routes in those datagrams . Link: https://bugs.passt.top/show_bug.cgi?id=67 Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "set" operations	David Gibson	2023-08-04	1	-9/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if anything goes wrong while we're configuring the namespace network with --config-net, we'll just ignore it and carry on. This might lead to a silently unconfigured or misconfigured namespace environment. For simple "set" operations based on nl_do() we can now detect failures reported via netlink. Propagate those errors up to pasta_ns_conf() and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add nl_foreach_oftype to filter response message types	David Gibson	2023-08-04	1	-15/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In most cases where processing response messages, we expect only one type of message (excepting NLMSG_DONE or NLMSG_ERROR), and so we need a test and continue to skip anything else. Add a helper macro to do this. This also fixes a bug in nl_get_ext_if() where we didn't have such a test and if we got a message other than RTM_NEWROUTE we would have parsed its contents as nonsense. Also add a warning message if we get such an unexpected message type, which could be useful for debugging if we ever hit it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_req() to allow processing multiple response datagrams	David Gibson	2023-08-04	1	-68/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently nl_req() sends the request, and receives a single response datagram which we then process. However, a single request can result in multiple response datagrams. That happens nearly all the time for DUMP requests, where the 'DONE' message usually comes in a second datagram after the NEW{LINK\|ADDR\|ROUTE} messages. It can also happen if there are just too many objects to dump in a single datagram. Allow our netlink code to process multiple response datagrams by splitting nl_req() into three different helpers: nl_send() just sends a request, without getting a response. nl_status() checks a single message to see if it indicates the end of the reponses for our request. nl_next() moves onto the next response message, whether it's in a datagram we already received or we need to recv() a new one. We also add a 'for'-style macro to use these to step through every response message to a request across multiple datagrams. While we're at it, be more thourough with checking that our sequence numbers are in sync. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Clearer reasoning about the netlink response buffer size	David Gibson	2023-08-04	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set NLBUFSIZ large enough for 8192 netlink headers (128kiB in total), and reference netlink(7). However netlink(7) says nothing about reponse buffer sizes, and the documents which do reference 8192 bytes not 8192 headers. Update NLBUFSIZ to 64kiB with a more detailed rationale. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add nl_do() helper for simple operations with error checking	David Gibson	2023-08-04	1	-12/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	So far we never checked for errors reported on netlink operations via NLMSG_ERROR messages. This has led to several subtle and tricky to debug situations which would have been obvious if we knew that certain netlink operations had failed. Introduce a nl_do() helper that performs netlink "do" operations (that is making a single change without retreiving complex information) with much more thorough error checking. As well as returning an error code if we get an NLMSG_ERROR message, we also check for unexpected behaviour in several places. That way if we've made a mistake in our assumptions about how netlink works it should result in a clear error rather than some subtle misbehaviour. We update those calls to nl_req() that can use the new wrapper to do so. We will extend those to better handle errors in future. We don't touch non-"do" operations for now, those are a bit trickier. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fill in netlink header fields from nl_req()	David Gibson	2023-08-04	1	-84/+42
\| \| \| \| \| \| \| \| \| \|	Currently netlink functions need to fill in a full netlink header, as well as a payload then call nl_req() to submit that to the kernel. It makes things a bit terser if we just give the relevant header fields as parameters to nl_req() and have it complete the header. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Treat send() or recv() errors as fatal	David Gibson	2023-08-04	1	-19/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Errors on send() or recv() calls on a netlink socket don't indicate errors with the netlink operations we're attempting, but rather that something's gone wrong with the mechanics of netlink itself. We don't really expect this to ever happen, and if it does, it's not clear what we could to to recover. So, treat errors from these calls as fatal, rather than returning the error up the stack. This makes handling failures in the callers of nl_req() simpler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Start sequence number from 1 instead of 0	David Gibson	2023-08-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Netlink messages have a sequence number that's used to match requests to responses. It mostly doesn't matter what it is as long as it monotonically increases, so we just use a global counter which we advance with each request. However, we start this counter at 0, so our very first request has sequence number 0, which is usually reserved for asynchronous messages from the kernel which aren't in response to a specific request. Since we don't (for now) use such async messages, this doesn't really matter, but it's not good practce. So start the sequence at 1 instead. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Make nl_*_dup() use a separate datagram for each request	David Gibson	2023-08-04	1	-23/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_req() is designed to handle a single netlink request message: it only receives a single reply datagram for the request, and only waits for a single NLMSG_DONE or NLMSG_ERROR message at the beginning to clear out things from previous requests. However, in both nl_addr_dup() and nl_route_dup() we can send multiple request messages as a single datagram, with a single nl_req() call. This can easily mean that the replies nl_req() collects get out of sync with requests. We only get away with this because after we call these functions we don't make any netlink calls where we need to parse the replies. This is fragile, so alter nl_*_dup() to make an nl_req() call for each address it is adding in the target namespace. For nl_route_dup() this fixes an additional minor problem: because routes can have dependencies, some of the route add requests might fail on the first attempt, so we need to repeat the requests a number of times. When we did that, we weren't updating the sequence number on each new attempt. This works, but not updating the sequence number for each new request isn't ideal. Now that we're making the requests one at a time, it's easier to make sure we update the sequence number each time. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Explicitly pass netlink sockets to operations	David Gibson	2023-08-04	1	-39/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	All the netlink operations currently implicitly use one of the two global netlink sockets, sometimes depending on an 'ns' parameter. Change them all to explicitly take the socket to use (or two sockets to use in the case of the *_dup() functions). As well as making these functions strictly more general, it makes the callers easier to follow because we're passing a socket variable with a name rather than an unexplained '0' or '1' for the ns parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t	David Gibson	2023-08-04	1	-6/+6
\| \| \| \| \| \| \| \|	This improves consistency with IPv6 and makes it harder to misuse these as some other sort of value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_route() into separate operation functions	David Gibson	2023-08-04	1	-89/+148
\| \| \| \| \| \| \| \| \| \|	nl_route() can perform 3 quite different operations based on the 'op' parameter. Split this into separate functions for each one. This requires more lines of code, but makes the internal logic of each operation much easier to follow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_addr() into separate operation functions	David Gibson	2023-08-04	1	-93/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_addr() can perform three quite different operations based on the 'op' parameter, each of which uses a different subset of the parameters. Split them up into a function for each operation. This does use more lines of code, but the overlap wasn't that great, and the separated logic is much easier to follow. It's also clearer in the callers what we expect the netlink operations to do, and what information it uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting fixes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split up functionality of nl_link()	David Gibson	2023-08-04	1	-59/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_link() performs a number of functions: it can bring links up, set MAC address and MTU and also retrieve the existing MAC. This makes for a small number of lines of code, but high conceptual complexity: it's quite hard to follow what's going on both in nl_link() itself and it's also not very obvious which function its callers are intending to use. Clarify this, by splitting nl_link() into nl_link_up(), nl_link_set_mac(), and nl_link_get_mac(). The first brings up a link, optionally setting the MTU, the others get or set the MAC address. This fixes an arguable bug in pasta_ns_conf(): it looks as though that was intended to retrieve the guest MAC whether or not c->pasta_conf_ns is set. However, it only actually does so in the !c->pasta_conf_ns case: the fact that we set up==1 means we would only ever set, never get, the MAC in the nl_link() call in the other path. We get away with this because the MAC will quickly be discovered once we receive packets on the tap interface. Still, it's neater to always get the MAC address here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Use correct interface index in NL_SET mode2023_06_27.289301b	David Gibson	2023-06-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nl_addr() and nl_route() take an 'op' selector which affects a number of parameters to the netlink call. Unfortunately when we introduced this option a bug was introduced so that we always use the interface index for the host side, rather than the one for the pasta namespace. Really, the entire interface to nl_addr() and nl_route() is pretty bad: it's tightly coupled with the use cases of its callers. This is a minimal fix which doesn't address that, but also doesn't make it significantly worse. Link: https://bugs.passt.top/show_bug.cgi?id=59 Fixes: 2fe046185634 ("netlink: Add functionality to copy routes from outer namespace") Fixes: e89da3cf03b2 ("netlink: Add functionality to copy addresses from outer namespace") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add functionality to copy addresses from outer namespace	Stefano Brivio	2023-05-23	1	-15/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similarly to what we've just done with routes, support NL_DUP for addresses (currently not exposed): nl_addr() can optionally copy mulitple addresses to the target namespace, by fixing up data from the dump with appropriate flags and interface index, and repeating it back to the kernel on the socket opened in the target namespace. Link-local addresses are not copied: the family is set to AF_UNSPEC, which means the kernel will ignore them. Same for addresses from a mismatching address (pre-4.19 kernels without support for NETLINK_GET_STRICT_CHK). Ignore IFA_LABEL attributes by changing their type to IFA_UNSPEC, because in general they will report mismatching names, and we don't really need to use labels as we already know the interface index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Add functionality to copy routes from outer namespace	Stefano Brivio	2023-05-23	1	-17/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of just fetching the default gateway and configuring a single equivalent route in the target namespace, on 'pasta --config-net', it might be desirable in some cases to copy the whole set of routes corresponding to a given output interface. For instance, in: https://github.com/containers/podman/issues/18539 IPv4 Default Route Does Not Propagate to Pasta Containers on Hetzner VPSes configuring the default gateway won't work without a gateway-less route (specifying the output interface only), because the default gateway is, somewhat dubiously, not on the same subnet as the container. This is a similar case to the one covered by commit 7656a6f88882 ("conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway"), and I'm not exactly proud of that workaround. We also have: https://bugs.passt.top/show_bug.cgi?id=49 pasta does not work with tap-style interface for which, eventually, we should be able to configure a gateway-less route in the target namespace. Introduce different operation modes for nl_route(), including a new NL_DUP one, not exposed yet, which simply parrots back to the kernel the route dump for a given interface from the outer namespace, fixing up flags and interface indices on the way, and requesting to add the same routes in the target namespace, on the interface we manage. For n routes we want to duplicate, send n identical netlink requests including the full dump: routes might depend on each other and the kernel processes RTM_NEWROUTE messages sequentially, not atomically, and repeating the full dump naturally resolves dependencies without the need to actually calculate them. I'm not kidding, it actually works pretty well. Link: https://github.com/containers/podman/issues/18539 Link: https://bugs.passt.top/show_bug.cgi?id=49 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Fix comment about response buffer size for nl_req()	Stefano Brivio	2023-05-23	1	-1/+1
\| \| \| \| \| \|	Fixes: fde8004ab0b4 ("netlink: Use 8 KiB * netlink message header size as response buffer") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Work around weird false positives with cppcheck-2.9.1	David Gibson	2023-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 89e38f55 "treewide: Fix header includes to build with musl" added extra #includes to work with musl. Unfortunately with the cppcheck version I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false positives: specifically cppcheck seems to hit a #error in <bits/unistd.h> complaining about including it directly instead of via <unistd.h> (which is not something we're doing). I have no idea why that would be happening; but I'm guessing it has to be a bug in the cpp implementation in that cppcheck version. In any case, it's possible to work around this by moving the include of <unistd.h> before the include of <signal.h>. So, do that. Fixes: 89e38f55405d ("treewide: Fix header includes to build with musl") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Fix header includes to build with musl	Chris Kuhn	2023-03-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Roughly inspired from a patch by Chris Kuhn: fix up includes so that we can build against musl: glibc is more lenient as headers generally include a larger amount of other headers. Compared to the original patch, I only included what was needed directly in C files, instead of adding blanket includes in local header files. It's a bit more involved, but more consistent with the current (not ideal) situation. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Use 8 KiB * netlink message header size as response buffer	Stefano Brivio	2023-03-09	1	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...instead of BUFSIZ. On musl, BUFSIZ is 1024, so we'll typically truncate the response to the request we send in nl_link(). It's usually 8192 or more with glibc. There doesn't seem to be any macro defining the rtnetlink maximum message size, and iproute2 just hardcodes 1024 * 1024 for the receive buffer, but the example in netlink(7) makes somewhat sense, looking at the kernel implementation. It's not very clean, but we're very unlikely to hit that limit, and if we do, we'll find out painlessly, because NLA_OK() will tell us right away. Reported-by: Chris Kuhn <kuhnchris+passt@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	convert all remaining err() followed by exit() to die()	Laine Stump	2023-02-16	1	-2/+1
\| \| \| \| \| \| \| \|	This actually leaves us with 0 uses of err(), but someone could want to use it in the future, so we may as well leave it around. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: do not leak netlink sock into child	Paul Holzinger	2023-02-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When spawning a child command with pasta command... pasta should not leak fds that it opened. Only the fds that were already open should be given to the child. Run `pasta --config-net -- ls -l /proc/self/fd` from a terminal where only stdin/out/err are open. The fd 3 was opend by ls to read the /proc/self/fd dir. But fd 5 is the netlink socket that was opend in pasta. To prevent such a leak we will open the socket with SOCK_CLOEXEC. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()	Stefano Brivio	2022-10-15	1	-19/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in the target user namespace as we isolate the process, which means we're unable to bind to low ports at that point. Bind inbound ports, and only those, before isolate_user(). Keep the handling of outbound ports (for pasta mode only) after the setup of the namespace, because that's where we'll bind them. To this end, initialise the netlink socket for the init namespace before isolate_user() as well, as we actually need to know the addresses of the upstream interface before binding ports, in case they're not explicitly passed by the user. As we now call nl_sock_init() twice, checking its return code from conf() twice looks a bit heavy: make it exit(), instead, as we can't do much if we don't have netlink sockets. While at it: - move the v4_only && v6_only options check just after the first option processing loop, as this is more strictly related to option parsing proper - update the man page, explaining that CAP_NET_BIND_SERVICE is not the preferred way to bind ports, because passt and pasta can be abused to allow other processes to make effective usage of it. Add a note about the recommended sysctl instead - simplify nl_sock_init_do() now that it's called once for each case Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Disable duplicate address detection for configured IPv6 address	Stefano Brivio	2022-10-15	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With default options, when we pass --config-net, the IPv6 address is actually going to be recycled from the init namespace, so it is in fact duplicated, but duplicate address detection has no way to find out. With a different configured address, that's not the case, but anyway duplicate address detection will be unable to see this. In both cases, we're wasting time for nothing. Pass the IFA_F_NODAD flag as we configure globally scoped IPv6 addresses via netlink. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Move logging functions to a new file, log.c	Stefano Brivio	2022-10-14	1	-0/+1
\| \| \| \| \| \| \| \|	Logging to file is going to add some further complexity that we don't want to squeeze into util.c. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>