passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	test: Reconfigure IPv6 address after changing MTU	David Gibson	2024-08-21	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the TCP throughput tests, we adjust the guest's MTU in order to test various packet sizes. Some of those are below 1280 which causes IPv6 to be deconfigured on the guest interface. When we increase it above 1280 again, IPv6 is re-enabled and we get an address in the right prefix with NDP, but we don't get exactly the expected address back - that's only communicated with --config-net or DHCPv6. With changes to how we handle NAT this can cause some of the IPv6 tests to fail, because they don't use the address that passt/pasta expects, and the guest doesn't initiate any traffic which allows us to learn what the new address is. Work around this by re-invoking dhclient -6 between adjusting the MTU and running IPv6 test cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, fwd: Split notion of gateway/router from guest-visible host address	David Gibson	2024-08-21	5	-42/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host. Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback. Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Don't take "our" MAC address from the host	David Gibson	2024-08-21	3	-35/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When sending frames to the guest over the tap link, we need a source MAC address. Currently we take that from the MAC address of the main interface on the host, but that doesn't actually make much sense: * We can't preserve the real MAC address of packets from anywhere external so there's no transparency case here * In fact, it's confusingly different from how we handle IP addresses: whereas we give the guest the same IP as the host, we're making the host's MAC the one MAC that the guest can't use for itself. * We already need a fallback case if the host doesn't have an Ethernet like MAC (e.g. if it's connected via a point to point interface, such as a wireguard VPN). Change to just just use an arbitrary fixed MAC address - I've picked 9a:55:9a:55:9a:55. It's simpler and has the small advantage of making the fact that passt/pasta is in use typically obvious from guest side packet dumps. This can still, of course, be overridden with the -M option. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Split notion of "our tap address" from gateway for IPv4	David Gibson	2024-08-21	3	-7/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense Case 3 occurs in two situations: a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address). (b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it. For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well. Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here (e.g. if the host is disconnected or only has a point to point link with no gateway address). In that case we have to disable forwarding of inbound connections with guest-inaccessible source addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Helpers to clarify what host addresses aren't guest accessible	David Gibson	2024-08-21	1	-11/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view. Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future. While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Initialise our_tap_ll to ip6.gw when suitable	David Gibson	2024-08-21	4	-12/+6
\| \| \| \| \| \| \| \| \|	In every place we use our_tap_ll, we only use it as a fallback if the IPv6 gateway address is not link-local. We can avoid that conditional at use time by doing it at initialisation of our_tap_ll instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Clarify which addresses in ip[46]_ctx are meaningful where	David Gibson	2024-08-21	1	-4/+10
\| \| \| \| \| \| \| \| \|	Some are guest visible addresses and may not be valid on the host, others are host visible addresses and may not be valid on the guest. Rearrange and comment the ip[46]_ctx definitions to make it clearer which is which. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Change misleading 'addr_ll' name	David Gibson	2024-08-21	5	-8/+9
\| \| \| \| \| \| \| \| \|	c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Correct sock_l4() binding for link local addresses	David Gibson	2024-08-21	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the host interface's link local address, the actual purpose of it is to provide a link local address for passt's private use on the tap interface. Instead set the scope id for any link-local address we're binding to. We're going to need something and this is what makes sense for sockets on the host. It doesn't make sense for PIF_SPLICE sockets, but those should always have loopback, not link-local addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Remove incorrect initialisation of addr_ll_seen	David Gibson	2024-08-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Despite the names, addr_ll_seen does not relate to addr_ll the same way addr_seen relates to addr. addr_ll_seen is an observed address from the guest, whereas addr_ll is our link-local address for use on the tap link when we can't use an external endpoint address. It's used both for passt provided services (DHCPv6, NDP) and in some cases for connections from addresses the guest can't access. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Treat --dns addresses as guest visible addresses	David Gibson	2024-08-21	2	-50/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although it's not 100% explicit in the man page, addresses given to the --dns option are intended to be addresses as seen by the guest. This differs from addresses taken from the host's /etc/resolv.conf, which must be translated to guest accessible versions in some cases. Our implementation is currently inconsistent on this: when using --dns-forward, you must usually also give --dns with the matching address, which is meaningful only in the guest's address view. However if you give --dns with a loopback addres, it will be translated like a host view address. Move the remapping logic for DNS addresses out of add_dns4() and add_dns6() into add_dns_resolv() so that it is only applied for host nameserver addresses, not for nameservers given explicitly with --dns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Correct setting of dns_match address in add_dns6()	David Gibson	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to the given address, rather than the gateway address. This is doubly wrong: - We've just established the given address is a host loopback address the guest can't access - We've just set ip6.dns[] to tell the guest to use the gateway address, so it won't use the dns_match address we're setting Correct this to use the gateway address, like IPv4. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Move adding of a nameserver from resolv.conf into subfunction	David Gibson	2024-08-21	1	-7/+26
\| \| \| \| \| \| \| \| \|	get_dns() is already quite deeply nested, and future changes I have in mind will add more complexity. Prepare for this by splitting out the adding of a single nameserver to the configuration into its own function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Move DNS array bounds checks into add_dns[46]	David Gibson	2024-08-21	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	Every time we call add_dns[46] we need to first check if there's space in the c->ip[46].dns array for the new entry. We might as well make that check in add_dns[46]() itself. In fact it looks like the calls in get_dns() had an off by one error, not allowing the last entry of the array to be filled. So, that bug is also fixed by the change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: More accurately count entries added in get_dns()	David Gibson	2024-08-21	1	-8/+3
\| \| \| \| \| \| \| \| \| \|	get_dns() counts the number of guest DNS servers it adds, and gives an error if it couldn't add any. However, this count ignores the fact that add_dns[46]() may in some cases not add an entry. Use the array indices we're already tracking to get an accurate count. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Use array indices rather than pointers for DNS array slots	David Gibson	2024-08-21	1	-32/+41
\| \| \| \| \| \| \| \| \| \| \| \|	Currently add_dns[46]() take a somewhat awkward double pointer to the entry in the c->ip[46].dns array to update. It turns out to be easier to work with indices into that array instead. This diff does add some lines, but it's comments, and will allow some future code reductions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use struct assignment instead of memcpy() for IP addresses	David Gibson	2024-08-21	2	-10/+12
\| \| \| \| \| \| \| \| \| \| \|	We rely on C11 already, so we can use clearer and more type-checkable struct assignment instead of mempcy() for copying IP addresses around. This exposes some "pointer could be const" warnings from cppcheck, so address those too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Rename MAC address fields for clarity	David Gibson	2024-08-21	8	-26/+28
\| \| \| \| \| \| \| \| \| \| \|	c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Helper for formatting MAC addresses	David Gibson	2024-08-21	4	-7/+27
\| \| \| \| \| \| \| \| \|	There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use "our address" instead of "forwarding address"	David Gibson	2024-08-21	7	-105/+106
\| \| \| \| \| \| \| \| \| \| \| \|	The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead. (While we're there correct an error in flow_initiate_af()s comments where we referred to parameters by the wrong name). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fix typo in function comment for nl_addr_set()	Stefano Brivio	2024-08-18	1	-1/+1
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: Disable neighbour solicitations on device up to prevent DAD	Stefano Brivio	2024-08-18	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As soon as we the kernel notifier for IPv6 address configuration (addrconf_notify()) sees that we bring the target interface up (NETDEV_UP), it will schedule duplicate address detection, so, by itself, setting the nodad flag later is useless, because that won't stop a detection that's already in progress. However, if we disable neighbour solicitations with IFF_NOARP (which is a misnomer for IPv6 interfaces, but there's no possibility of mixing things up), the notifier will not trigger DAD, because it can't be done, of course, without neighbour solicitations. Set IFF_NOARP as we bring up the device, and drop it after we had a chance to set the nodad attribute on the link. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Fetch link-local address from namespace interface once it's up	Stefano Brivio	2024-08-18	3	-0/+55
\| \| \| \| \| \| \| \| \| \|	As soon as we bring up the interface, the Linux kernel will set up a link-local address for it, so we can fetch it and start using right away, if we need a link-local address to communicate to the container before we see any traffic coming from it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Disable DAD for link-local addresses on namespace interface	Stefano Brivio	2024-08-18	3	-0/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It makes no sense for a container or a guest to try and perform duplicate address detection for their link-local address, as we'll anyway not relay neighbour solicitations with an unspecified source address. While they perform duplicate address detection, the link-local address is not usable, which prevents us from bringing up especially containers and communicate with them right away via IPv6. This is not enough to prevent DAD and reach the container right away: we'll need a couple more patches. As we send NLM_F_REPLACE requests right away, while we still have to read out other addresses on the same socket, we can't use nl_do(): keep track of the last sequence we sent (last address we changed), and deal with the answers to those NLM_F_REPLACE requests in a separate loop, later. Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Turn nl_link_up() into a generic function to set link flags	Stefano Brivio	2024-08-15	3	-7/+11
\| \| \| \| \| \| \|	In the next patches, we'll reuse it to set flags other than IFF_UP. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Split MTU setting functionality out of nl_link_up()	Stefano Brivio	2024-08-15	3	-13/+32
\| \| \| \| \| \| \| \| \|	As we'll use nl_link_up() for more than just bringing up devices, it will become awkward to carry empty MTU values around whenever we call it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Fix typo in function comment for nl_addr_get()	Stefano Brivio	2024-08-15	1	-1/+1
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Speed up by cutting on eye candy and performance test duration	Stefano Brivio	2024-08-15	8	-54/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have a number of delays when we switch to new layouts that were added to make the tests visually easier to follow, together with blinking status bars. Shorten the delays and avoid blinking the status bar if $FAST is set to 1 (no demo mode). Shorten delays in busy loops to 10ms, instead of 100ms, and skip the one-second fixed delay when we wait for the status of a command. Cut the duration of throughput and latency tests to one second, down from ten. Somewhat surprisingly, the results we get are rather consistent, and not significantly different from what we'd get with 10 seconds. This, together with Podman's commit 20f3e8909e3a ("test/system: pasta_test_do add explicit port check"), cuts the time needed on my setup for full test run from approximately 37 minutes to...: $ time ./run [exited] PASS: 165, FAIL: 0 Log at /home/sbrivio/passt/test/test_logs/test.log real 15m34.253s user 0m0.011s sys 0m0.011s Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	flow: Don't crash if guest attempts to connect to port 02024_08_14.61c0b0d	David Gibson	2024-08-14	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using a zero port on TCP or UDP is dubious, and we can't really deal with forwarding such a flow within the constraints of the socket API. Hence we ASSERT()ed that we had non-zero ports in flow_hash(). The intention was to make sure that the protocol code sanitizes such ports before completing a flow entry. Unfortunately, flow_hash() is also called on new packets to see if they have an existing flow, so the unsanitized guest packet can crash passt with the assert. Correct this by moving the assert from flow_hash() to flow_sidx_hash() which is only used on entries already in the table, not on unsanitized data. Reported-by: Matt Hamilton <matt@thmail.io> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Don't ignore -t and -u options after -D	David Gibson	2024-08-14	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	f6d5a5239264 moved handling of -D into a later loop. However as a side effect it moved this from a switch block to an if block. I left a couple of 'break' statements that don't make sense in the new context. They should be 'continue' so that we go onto the next option, rather than leaving the loop entirely. Fixes: f6d5a5239264 ("conf: Delay handling -D option until after addresses are configured") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp.c: Turn NDP responder into more declarative implementation	AbdAlRahman Gad	2024-08-13	3	-79/+246
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Add structs for NA, RA, NS, MTU, prefix info, option header, link-layer address, RDNSS, DNSSL and link-layer for RA message. - Turn NA message from purely imperative, going byte by byte, to declarative by filling it's struct. - Turn part of RA message into declarative. - Move packet_add() to be before the call of ndp() in tap6_handler() if the protocol of the packet is ICMPv6. - Add a pool of packets as an additional parameter to ndp(). - Check the size of NS packet with packet_get() before sending an NA packet. - Add documentation for the structs. - Add an enum for NDP option types. Link: https://bugs.passt.top/show_bug.cgi?id=21 Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> [sbrivio: Minor coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Delay handling -D option until after addresses are configured	David Gibson	2024-08-12	1	-39/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	add_dns[46]() rely on the gateway address and c->no_map_gw being already initialised, in order to properly handle DNS servers which need NAT to be accessed from the guest. Usually these are called from get_dns() which is well after the addresses are configured, so that's fine. However, they can also be called earlier if an explicit -D command line option is given. In this case no_map_gw and/or c->ip[46].gw may not get be initialised properly, leading to this doing the wrong thing. Luckily we already have a second pass of option parsing for things which need addresses to already be configured. Move handling of -D to there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Correct inaccurate comments on ip[46]_ctx::addr	David Gibson	2024-08-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	These fields are described as being an address for an external, routable interface. That's not necessarily the case when using -a. But, more importantly, saying where the value comes from is not as useful as what it's used for. The real purpose of this field is as the address which we assign to the guest via DHCP or --config-net. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Don't prefix message with timestamp on --debug if it's a continuation	Stefano Brivio	2024-08-12	3	-18/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we prefix the second part of messages printed through logmsg_perror() by the timestamp, on debug, we'll have two timestamps and a weird separator in the result, such as this beauty: 0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted Add a parameter to logmsg() and vlogmsg() which indicates a message continuation. If that's set, don't print the timestamp in vlogmsg(). Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Stop parsing options at first non-option argument	Stefano Brivio	2024-08-08	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that pasta supports specifying a command to be executed on the command line, even without the usual -- separator as long as there's no ambiguity, we shouldn't eat up options that are not meant for us. Paul reports, for instance, that with: pasta --config-net ip -6 route -6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'. That's because getopt_long(), by default, shuffles the argument list to shift non-option arguments at the end. Avoid that by adding '+' at the beginning of 'optstring'. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt, util: Close any open file that the parent might have leaked	Stefano Brivio	2024-08-08	6	-7/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any. This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul. Use close_range(2) to close all open files except for standard streams and the one from --fd. Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple. As we need to check that the passed --fd option doesn't exceed INT_MAX, because we'll parse it with strtol() but file descriptor indices are signed ints (regardless of the arguments close_range() take), extend the existing check in the actual --fd parsing in conf(), also rejecting file descriptors numbers that match standard streams, while at it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>
*	nstool: Propagate SIGTERM to processes executed in the namespace	David Gibson	2024-08-07	1	-2/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Particularly in shell it's sometimes natural to save the pid from a process run and later kill it. If doing this with nstool exec, however, it will kill nstool itself, not the program it is running, which isn't usually what you want or expect. Address this by having nstool propagate SIGTERM to its child process. It may make sense to propagate some other signals, but some introduce extra complications, so we'll worry about them when and if it seems useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	nstool: Fix some trivial typos	David Gibson	2024-08-07	1	-2/+2
\| \| \| \| \|	Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Avoid duplicate calls to logtime()	David Gibson	2024-08-07	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We use logtime() to get a timestamp for the log in two places: - in vlogmsg(), which is used only for debug_print messages - in logfile_write() which is only used messages to the log file These cases are mutually exclusive, so we don't ever print the same message with different timestamps, but that's not particularly obvious to see. It's possible future tweaks to logging logic could mean we log to two different places with different timestamps, which would be confusing. Refactor to have a single logtime() call in vlogmsg() and use it for all the places we need it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Handle errors from clock_gettime()	David Gibson	2024-08-07	1	-15/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clock_gettime() can, theoretically, fail, although it probably won't until 2038 on old 32-bit systems. Still, it's possible someone could run with a wildly out of sync clock, or new errors could be added, or it could fail due to a bug in libc or the kernel. We don't handle this well. In the debug_print case in vlogmsg we'll just ignore the failure, and print a timestamp based on uninitialised garbage. In logfile_write() we exit early and won't log anything at all, which seems like a good way to make an already weird situation undebuggable. Add some helpers to instead handle this by using "<error>" in place of a timestamp if something goes wrong with clock_gettime(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Correct formatting of timestamps	David Gibson	2024-08-07	1	-12/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logtime_fmt_and_arg() is a rather odd macro, producing both a format string and an argument, which can only be used in quite specific printf() like formulations. It also has a significant bug: it tries to display 4 digits after the decimal point (so down to tenths of milliseconds) using %04i. But the field width in printf() is always a minimum not maximum field width, so this will not truncate the given value, but will redisplay the entire tenth-of-milliseconds difference again after the decimal point. Replace the macro with an snprintf() like function which will format the timestamp, and use an explicit % to correct the display. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Make logtime_fmt() static] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Some corrections for timespec_diff_us	David Gibson	2024-08-07	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The comment for timespec_diff_us() claims it will wrap after 2^64µs. This is incorrect for two reasons: * It returns a long long, which is probably 64-bits, but might not be * It returns a signed value, so even if it is 64 bits it will wrap after 2^63µs Correct the comment and use an explicitly 64-bit type to avoid that imprecision. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, pasta: Make -g and -a skip route/addresses copy for matching IP ↵	Stefano Brivio	2024-08-07	4	-22/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	version only Paul reports that setting IPv4 address and gateway manually, using --address and --gateway, causes pasta to fail inserting IPv6 routes in a setup where multiple, inter-dependent IPv6 routes are present on the host. That's because, currently, any -g option implies --no-copy-routes altogether, and any -a implies --no-copy-addrs. Limit this implication to the matching IP version, instead, by having two copies of no_copy_routes and no_copy_addrs in the context structure, separately for IPv4 and IPv6. While at it, change them to 'bool': we had them as 'int' because getopt_long() used to set them directly, but it hasn't been the case for a while already. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, passt: Keep printing to stderr when passt is running in foreground2024_08_06.ee36266	Stefano Brivio	2024-08-06	3	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two cases where we want to stop printing to stderr: if it's closed, and if pasta spawned a shell (and --debug wasn't given). But if passt is running in foreground, we currently stop to report any message, even error messages, once we're ready, as reported by Laurent, because we set the log_runtime flag, which we use to indicate we're ready, regardless of whether we're running in foreground or not. Turn that flag (back) to log_stderr, and set it only when we really want to stop printing to stderr. Reported-by: Laurent Vivier <lvivier@redhat.com> Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Fix side in OUT_WAIT flag setting	Stefano Brivio	2024-08-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the "from" (input) side for a given transfer is 0, and we can't complete the write right away, what we need to be waiting for is for output readiness on side 1, not 0, and the other way around as well. This causes random transfer failures for local TCP connections, depending if we ever need to wait for output readiness. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/23517 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com>
*	util: Use unsigned (size_t) value for iov length	David Gibson	2024-08-06	2	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	The "correct" type for the length of an IOV is unclear: writev() and readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the unsigned size_t has some advantages, though, and it makes more sense for the case of write_remainder. Using size_t throughout here means we don't have a signed vs. unsigned comparison, and we don't have to deal with the case of iov_skip_bytes() returning a value which becomes negative when assigned to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: move all udp_flow functions to udp_flow.c	Laurent Vivier	2024-08-05	4	-261/+284
\| \| \| \| \| \| \| \| \| \| \|	No code change. They need to be exported to be available by the vhost-user version of passt. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock()	Laurent Vivier	2024-08-05	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \|	To be used with the vhost-user version of udp.c, we need to export the udp_flow functions. To avoid to export udp_meta_t too that is specific to the socket version of udp.c, don't pass udp_meta_t to it, but the only needed field, s_in. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Make logfile_write() private	David Gibson	2024-08-05	2	-171/+171
\| \| \| \| \| \| \| \| \| \|	logfile_write() is not used outside log.c, nor should it be. It should only be used externall via the general logging functions. Make it static in log.c. To avoid forward declarations this requires moving a bunch of functions earlier in the file. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Save errno on signal handler entry, restore on return when needed	Stefano Brivio	2024-08-05	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ed reported this: # Error: pasta failed with exit code 1: # Couldn't drop cap 3 from bounding set # : No child processes in a Podman CI run with tests being run in parallel. The error message itself, by the way, is fixed by commit 1cd773081f12 ("log: Drop newlines in the middle of the perror()-like messages"), but how can we possibly get ECHILD as failure code for prctl()? Well, we don't, but if we exit early enough, pasta_child_handler() might run before we're even done with isolation steps, and it calls waitid(), which sets errno. We need to restore it before returning from the signal handler (if we return after calling functions that might set it), as signal-safety(7) also implies: Fetching and setting the value of errno is async-signal-safe provided that the signal handler saves errno on entry and restores its value before returning. Eventually, we'll probably need to switch to signalfd(2) the day we want to implement multithreading, but this will do for the moment. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/23478 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>