passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	util: Rename write_pidfile() to pidfile_write()	Stefano Brivio	2024-05-23	3	-5/+5
\| \| \| \| \| \| \| \|	As I'm adding pidfile_open() in the next patch. The function comment didn't match, by the way. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	tap: Split tap_sock_unix_init() into opening and listening parts	Stefano Brivio	2024-05-23	1	-12/+27
\| \| \| \| \| \| \| \| \|	We'll need to open and bind the socket a while before listening to it, so split that into two different functions. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	passt, tap: Don't use -1 as uninitialised value for fd_tap_listen	Stefano Brivio	2024-05-23	2	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a remnant from the time we kept access to the original filesystem and we could reinitialise the listening AF_UNIX socket. Since commit 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), however, we can't re-bind the listening socket once we're up and running. Drop the -1 initalisation and the corresponding check. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Move all-ones initialisation of mac_guest to tap_sock_init()	Stefano Brivio	2024-05-23	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It has nothing to do with tap_sock_unix_init(). It used to be there as that function could be called multiple times per passt instance, but it's not the case anymore. This also takes care of the fact that, with --fd, we wouldn't set the initial MAC address, so we would need to wait for the guest to send us an ARP packet before we could exchange data. Fixes: 6b4e68383c66 ("passt, tap: Add --fd option") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	conf: Don't lecture user about starting us as root	Stefano Brivio	2024-05-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	libguestfs tools have a good reason to run as root: if the guest image is owned by root, it would be counterproductive to encourage users to invoke them as non-root, as it would require changing permissions or ownership of the image file. And if they run as root, we'll start as root, too. Warn users we'll switch to 'nobody', but don't tell them what to do. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	netlink, test: Ignore deprecated addresses	David Gibson	2024-05-22	6	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we retrieve or copy host addresses we can include deprecated addresses, which is not what we want. Adjust our logic to exclude them. Similarly our tests can retrieve deprecated addresses, so exclude them there too. I hit this in practice because my router sometimes temporarily advertises an fd00:: prefix before the real delegated IPv6 prefix. The deprecated address can hang around for some time messing up my tests. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove interim 'tapside' field from connection	David Gibson	2024-05-22	2	-8/+6
\| \| \| \| \| \| \| \| \| \|	We recently introduced this field to keep track of which side of a TCP flow is the guest/tap facing one. Now that we generically record which pif each side of each flow is connected to, we can easily derive that, and no longer need to keep track of it explicitly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Record the pifs for each side of each flow	David Gibson	2024-05-22	7	-16/+109
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have no generic information flows apart from the type and state, everything else is specific to the flow type. Start introducing generic flow information by recording the pifs which the flow connects. To keep track of what information is valid, introduce new flow states: INI for when the initiating side information is complete, and TGT for when both sides information is complete, but we haven't chosen the flow type yet. For now, these states don't do an awful lot, but they'll become more important as we add more generic information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make side 0 always be the initiating side	David Gibson	2024-05-22	7	-27/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each flow in the flow table has two sides, 0 and 1, representing the two interfaces between which passt/pasta will forward data for that flow. Which side is which is currently up to the protocol specific code: TCP uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except for spliced connections where it uses 0 for the initiating side and 1 for the target side. ICMP also uses 0 for the host/"sock" side and 1 for the guest/"tap" side, but in its case the latter is always also the initiating side. Make this generically consistent by always using side 0 for the initiating side and 1 for the target side. This doesn't simplify a lot for now, and arguably makes TCP slightly more complex, since we add an extra field to the connection structure to record which is the guest facing side. This is an interim change, which we'll be able to remove later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify and enforce flow state transitions	David Gibson	2024-05-22	6	-69/+183
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flows move over several different states in their lifetime. The rules for these are documented in comments, but they're pretty complex and a number of the transitions are implicit, which makes this pretty fragile and error prone. Change the code to explicitly track the states in a field. Make all transitions explicit and logged. To the extent that it's practical in C, enforce what can and can't be done in various states with ASSERT()s. While we're at it, tweak the docs to clarify the restrictions on each state a bit. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Better helpers for using inany and specific family addrs together	David Gibson	2024-05-22	3	-37/+106
\| \| \| \| \| \| \| \| \|	This adds some extra inany helpers for comparing an inany address to addresses of a specific family (including special addresses), and building an inany from an IPv4 address (either statically or at runtime). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Properly type callbacks to protocol specific handlers	David Gibson	2024-05-22	6	-25/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The flow dispatches deferred and timer handling for flows centrally, but needs to call into protocol specific code for the handling of individual flows. Currently this passes a general union flow *. It makes more sense to pass the specific relevant flow type structure. That brings the check on the flow type adjacent to casting to the union variant which it tags. Arguably, this is a slight abstraction violation since it involves the generic flow code using protocol specific types. It's already calling into protocol specific functions, so I don't think this really makes any difference. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, tcp: Add helper to display socket addresses	David Gibson	2024-05-22	3	-14/+79
\| \| \| \| \| \| \| \| \| \| \|	When reporting errors, we sometimes want to show a relevant socket address. Doing so by extracting the various relevant fields can be pretty awkward, so introduce a sockaddr_ntop() helper to make it simpler. For now we just have one user in tcp.c, but I have further upcoming patches which can make use of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Fix passt abstraction	Maxime Bélair	2024-05-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit b686afa2 introduced the invalid apparmor rule `mount options=(rw, runbindable) /,` since runbindable mount rules cannot have a source. Therefore running aa-logprof/aa-genprof will trigger errors (see https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685) $ sudo aa-logprof ERROR: Operation {'runbindable'} cannot have a source. Source = AARE('/') This patch fixes it to the intended behavior. Link: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685 Fixes: b686afa23e85 ("apparmor: Explicitly pass options we use while remounting root filesystem") Signed-off-by: Maxime Bélair <maxime.belair@canonical.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: allow netns paths on /tmp	Paul Holzinger	2024-05-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some unknown reason "owner" makes it impossible to open bind mounted netns references as apparmor denies it. In the kernel denied log entry we see ouid=0 but it is not clear why that is as the actual file is owned by the real (rootless) user id. In abstractions/pasta there is already `@{run}/user/@{uid}/**` without owner set for the same reason as this path contains the netns path by default when running under Podman. Fixes: 72884484b00d ("apparmor: allow read access on /tmp for pasta") Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	clang-tidy: Suppress macro to enum conversion warnings	David Gibson	2024-05-13	1	-1/+8
\| \| \| \| \| \| \| \| \| \|	clang-tidy 18.1.1 in Fedora 40 complains about every #define of an integral value, suggesting it be converted to an enum. Although that's certainly possible, it's of dubious value and results in some awkward arrangements on out codebase. Suppress it globally. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Fix clang-tidy warning about using an undefined enum value	David Gibson	2024-05-13	2	-2/+3
\| \| \| \| \| \| \| \| \| \|	In conf() we temporarily set the forwarding mode variables to 0 - an invalid value, so that we can check later if they've been set by the intervening logic. clang-tidy 18.1.1 in Fedora 40 now complains about this. Satisfy it by giving an name in the enum to the 0 value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt.c: explicitly include libgen.h for basename	lemmi	2024-05-13	1	-0/+1
\| \| \| \| \| \| \| \|	fixes implicit declaration warning on musl Signed-off-by: lemmi <lemmi@nerd2nerd.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Don't duplicate routes referring to unrelated host interfaces	Stefano Brivio	2024-05-11	1	-6/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We take care of this in nl_addr_dup(): if the interface index associated to an address doesn't match the selected host interface (ifa->ifa_index != ifi_src), we don't copy that address. But for routes, we just unconditionally update the interface index to match the index in the target namespace, even if the source interface didn't match. This might happen in two cases: with a pre-4.20 kernel without support for NETLINK_GET_STRICT_CHK, which won't filter routes based on the interface we pass in the request, as reported by runsisi, and any kernel with support for multipath routes where any of the nexthops refers to an unrelated host interface. In both cases, check the index of the source interface, and avoid copying unrelated routes. Reported-by: runsisi <runsisi@hust.edu.cn> Link: https://bugs.passt.top/show_bug.cgi?id=86 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: runsisi <runsisi@hust.edu.cn> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	apparmor: allow read access on /tmp for pasta2024_05_10.7288448	Paul Holzinger	2024-05-10	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	The podman CI on debian runs tests based on /tmp but pasta is failing there because it is unable to open the netns path as the open for read access is denied. Link: https://github.com/containers/podman/issues/22625 Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Set OUT_WAIT_ flag whenever pipe isn't emptied	Stefano Brivio	2024-05-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_splice_sock_handler(), if we get EAGAIN on the second splice(), from pipe to receiving socket, that doesn't necessarily mean that the pipe is empty: the receiver buffer might be full instead. Hence, we can't use the 'never_read' flag to decide that there's nothing to wait for: even if we didn't read anything from the sending side in a given iteration, we might still have data to send in the pipe. Use read/written counters, instead. This fixes an issue where large bulk transfers would occasionally hang. From a corresponding strace: 0.000061 epoll_wait(4, [{events=EPOLLOUT, data={u32=29442, u64=12884931330}}], 8, 1000) = 1 0.005003 epoll_ctl(4, EPOLL_CTL_MOD, 211, {events=EPOLLIN\|EPOLLRDHUP, data={u32=54018, u64=8589988610}}) = 0 0.000089 epoll_ctl(4, EPOLL_CTL_MOD, 115, {events=EPOLLIN\|EPOLLRDHUP, data={u32=29442, u64=12884931330}}) = 0 0.000081 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000073 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = 1048576 0.000087 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000045 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = 520415 0.000060 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000044 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000044 epoll_wait(4, [], 8, 1000) = 0 we're reading from socket 211 into to the pipe end numbered 151, which connects to pipe end 150, and from there we're writing into socket 115. We initially drop EPOLLOUT from the set of monitored flags for socket 115, because it already signaled it's ready for output. Then we read nothing from socket 211 (the sender had nothing to send), and we keep emptying the pipe into socket 115 (first 1048576 bytes, then 520415 bytes). This call of tcp_splice_sock_handler() ends with EAGAIN on the writing side, and we just exit this function without setting the OUT_WAIT_1 flag (and, in turn, EPOLLOUT for socket 115). However, it turns out, the pipe wasn't actually emptied, and while socket 211 had nothing more to send, we should have waited on socket 115 to be ready for output again. As a further step, we could consider not clearing EPOLLOUT at all, unless the read/written counters match, but I'm first trying to fix this ugly issue with a minimal patch. Link: https://github.com/containers/podman/issues/22575 Link: https://github.com/containers/podman/issues/22593 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Single buffer for IPv4, IPv6 headers and metadata	David Gibson	2024-05-02	1	-77/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have separate arrays for IPv4 and IPv6 which contain the headers for guest-bound packets, and also the originating socket address. We can combine these into a single array of "metadata" structures with space for both pre-cooked IPv4 and IPv6 headers, as well as shared space for the tap specific header and socket address (using sockaddr_inany). Because we're using IOVs to separately address the pieces of each frame, these structures don't need to be packed to keep the headers contiguous so we can more naturally arrange for the alignment we want. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use the same buffer for the L2 header for all frames	David Gibson	2024-05-02	1	-21/+15
\| \| \| \| \| \| \| \| \| \|	Currently each tap-bound frame buffer has room for its own ethernet header. However the ethernet header is always the same for such frames, so now that we're indirectly referencing the ethernet header via iov, we can use a single buffer for all of them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Share payload buffers between IPv4 and IPv6	David Gibson	2024-05-02	1	-59/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the IPv4 and IPv6 paths unnecessarily use different buffers for the UDP payload. Now that we're handling the various pieces of the UDP packets with an iov, we can split the payload part of the buffers off into its own array shared between IPv4 and IPv6. As well as saving a little memory, this allows the payload buffers to be neatly page aligned. With the buffers merged, udp[46]_l2_iov_sock contain exactly the same thing as each other and can also be merged. Likewise udp[46]_iov_splice can be merged together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Explicitly set checksum in guest-bound UDP headers	David Gibson	2024-05-02	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For IPv4, UDP checksums are optional and can just be set to 0. udp_update_hdr4() ignores the checksum field entirely. Since these are set to 0 during startup, this works as intended for now. However, we'd like to share payload and UDP header buffers betweem IPv4 and IPv6, which does calculate UDP checksums. Therefore, for robustness, we should explicitly set the checksum field to 0 for guest-bound UDP packets. In the tap_udp4_send() slow path, however, we do allow IPv4 UDP checksums to be calculated as a compile time option. For consistency, use the same thing in the udp_update_hdr4() path, which will typically initialize to 0, but calculate a real checksum if configured to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Combine initialisation of IPv4 and IPv6 iovs	David Gibson	2024-05-02	1	-61/+53
\| \| \| \| \| \| \| \| \| \|	We're going to introduce more sharing between the IPv4 and IPv6 buffer structures. Prepare for this by combinng the initialisation functions. While we're at it remove the misleading "sock" from the name since these initialise both tap side and sock side structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split tap-bound UDP packets into multiple buffers using io vector	David Gibson	2024-05-02	2	-63/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When sending to the tap device, currently we assemble the headers and payload into a single contiguous buffer. Those are described by a single struct iovec, then a batch of frames is sent to the device with tap_send_frames(). In order to better integrate the IPv4 and IPv6 paths, we want the IP header in a different buffer that might not be contiguous with the payload. To prepare for that, split the UDP packet into an iovec of buffers. We use the same split that Laurent recently introduced for TCP for convenience. This removes the last use of tap_hdr_len_(), tap_frame_base() and tap_frame_len(), so remove those too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Allow sftp via vsock-ssh in tests	David Gibson	2024-05-02	1	-2/+4
\| \| \| \| \| \| \| \| \| \|	During some debugging recently, I wanted to extact a file from a test guest and found it was tricky, since the ssh-over-vsock setup we had didn't allow sftp/scp. We can fix this by adding a line to the guest side sshd config from mbuto. While we're there correct an inaccurate comment. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Update tap specific header too in tcp_fill_headers[46]()	David Gibson	2024-05-02	1	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_fill_headers[46]() fill most of the headers, but the tap specific header (the frame length for qemu sockets) is filled in afterwards. Filling this as well: * Removes a little redundancy between the tcp_send_flag() and tcp_data_to_tap() path * Makes calculation of the correct length a little easier * Removes the now misleadingly named 'vnet_len' variable in tcp_send_flag() Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: Helper macro to construct iovs covering existing variables or fields	David Gibson	2024-05-02	4	-21/+16
\| \| \| \| \| \| \| \| \| \| \|	Laurent's recent changes mean we use IO vectors much more heavily in the TCP code. In many of those cases, and few others around the code base, individual iovs of these vectors are constructed to exactly cover existing variables or fields. We can make initializing such iovs shorter and clearer with a macro for the purpose. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap, tcp: (Re-)abstract TAP specific header handling	David Gibson	2024-05-02	2	-25/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent changes to the TCP code (reworking of the buffer handling) have meant that it now (again) deals explicitly with the MODE_PASST specific vnet_len field, instead of using the (partial) abstractions provided by the tap layer. The abstractions we had don't work for the new TCP structure, so make some new ones that do: tap_hdr_iov() which constructs an iovec suitable for containing (just) the TAP specific header and tap_hdr_update() which updates it as necessary per-packet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify packet length calculation when preparing headers	David Gibson	2024-05-02	1	-16/+10
\| \| \| \| \| \| \| \| \|	tcp_fill_headers[46]() compute the L3 packet length from the L4 packet length, then their caller tcp_l2_buf_fill_headers() converts it back to the L4 packet length. We can just use the L4 length throughout. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise variable names for various packet lengths	David Gibson	2024-05-02	12	-193/+194
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At various points we need to track the lengths of a packet including or excluding various different sets of headers. We don't always use the same variable names for doing so. Worse in some places we use the same name for different things: e.g. tcp_fill_headers[46]() use ip_len for the length including the IP headers, but then tcp_send_flag() which calls it uses it to mean the IP payload length only. To improve clarity, standardise on these names: dlen: L4 protocol payload length ("data length") l4len: plen + length of L4 protocol header l3len: l4len + length of IPv4/IPv6 header l2len: l3len + length of L2 (ethernet) header Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Make csum_ip4_header() take a host endian length	David Gibson	2024-05-02	4	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	csum_ip4_header() takes the packet length as a network endian value. In general it's very error-prone to pass non-native-endian values as a raw integer. It's particularly bad here because this differs from other checksum functions (e.g. proto_ipv4_header_psum()) which take host native lengths. It turns out all the callers have easy access to the native endian value, so switch it to use host order like everything else. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Remove misleading and redundant endianness notes	David Gibson	2024-05-02	3	-16/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, it's much less error-prone to have the endianness of values implied by the type, rather than just noting it in comments. We can't always easily avoid it, because C, but we can do so when possible. struct in_addr and in6_addr are always encoded network endian, so noting it explicitly isn't useful. Remove them. In some cases we also have endianness notes on uint8_t parameters, which doesn't make sense: for a single byte endianness is irrelevant. Remove those too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Remove unused structs tap_msg, tap_l4_msg	David Gibson	2024-05-02	1	-20/+0
\| \| \| \| \| \| \| \| \|	Use of these structures was removed in bb708111833e ("treewide: Packet abstraction with mandatory boundary checks"). Remove the stale declarations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split tap specific and L2 (ethernet) headers	David Gibson	2024-05-02	2	-21/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In some places (well, actually only UDP now) we use struct tap_hdr to represent both tap backend specific and L2 ethernet headers. Handling these together seemed like a good idea at the time, but Laurent's changes in the TCP code working towards vhost-user support suggest that treating them separately is more useful, more often. Alter struct tap_hdr to represent only the TAP backend specific headers. Updated related helpers and the UDP code to match. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Use proto_ipv6_header_psum() for ICMPv6 as well	David Gibson	2024-05-02	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	7df624e79 ("checksum: introduce functions to compute the header part checksum for TCP/UDP") introduced a helper to compute the partial checksum for the IPv6 pseudo-header used in L4 protocol checksums. It used it in csum_udp6() for UDP packets, but not in csum_icmp6() for the identical calculation in csum_icmp6() for ICMPv6 packets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fix iterations over nexthop objects	Stefano Brivio	2024-05-02	1	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Somewhat confusingly, RTNH_NEXT(), as defined by <linux/rtnetlink.h>, doesn't take an attribute length parameter like RTA_NEXT() does, and I just modelled loops over nexthops after RTA loops, forgetting to decrease the remaining length we pass to RTNH_OK(). In practice, this didn't cause issue in any of the combinations I checked, at least without the next patch. We seem to be the only user of RTNH_OK(): even iproute2 has an open-coded version of it in print_rta_multipath() (ip/iproute.c). Introduce RTNH_NEXT_AND_DEC(), similar to RTA_NEXT(), and use it. Fixes: 6c7623d07bbd ("netlink: Add support to fetch default gateway from multipath routes") Fixes: f4e38b5cd232 ("netlink: Adjust interface index inside copied nexthop objects too") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Use IFA_F_NODAD also while duplicating addresses from the host2024_04_26.d03c4e2	Stefano Brivio	2024-04-26	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...not just for a single set address (legacy operation with --no-copy-addrs). I forgot to add this to nl_addr_dup(). Note that we can have two version of flags: the 8-bit ifa_flags in ifaddrmsg, and the newer 32-bit version as IFA_FLAGS attribute, which is given priority if present. Make sure IFA_F_NODAD is set in both. Without this, a Podman user reports, something on the lines of: pasta --config-net -- ping -c1 -6 passt.top would fail as the kernel would start Duplicate Address Detection once we configure the address, which can't really work (and doesn't make sense) in this case. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: For IPv4, IFA_LOCAL is the interface address, not IFA_ADDRESS	Stefano Brivio	2024-04-26	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	See the comment to the unnamed enum in linux/if_addr.h, which currently states: /* * Important comment: * IFA_ADDRESS is prefix address, rather than local interface address. * It makes no difference for normally configured broadcast interfaces, * but for point-to-point IFA_ADDRESS is DESTINATION address, * local address is supplied in IFA_LOCAL attribute. * * [...] */ if we fetch IFA_ADDRESS, and we have a point-to-point link with a peer address configured, we'll source the peer address as "our" address, and refuse to resolve it in arp(). This was reported with pasta and a tun upstream interface configured by OpenVPN in "p2p" topology: the target namespace will have similar addresses and routes as the host, which is fine, and will try to resolve the point-to-point peer address (because it's the default gateway). Given that we configure it as our address (only internally, not visibly in the namespace), we'll fail to resolve that and traffic doesn't go anywhere. Note that this is not the case for IPv6: there, IFA_ADDRESS is the actual, local address of the interface, and IFA_LOCAL is not necessarily present, so the comment in linux/if_addr.h doesn't apply either. Link: https://github.com/containers/podman/issues/22320 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Make log truncation test more robust	David Gibson	2024-04-25	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	test/pasta_options/log_to_file checks that pasta truncates its log file when started. It does that by starting pasta with a log file once, then starting it again and checking that after the second round, the log file has only one line: the startup banner from the second invocation. However, this test will break if the second invocation logs any additional messages at startup. This can easily happen on a host with multiple network interfaces due to the "Multiple default route" informational messages added in 639fdf06e ("netlink: Fix selection of template interface"). I believe it could also happen on a host without IPv6 connectivity due to the "Couldn't pick external interface" messages, though I haven't confirmed this. Make the log file test more robust, by not testing for a single line, but instead explicitly testing for the PID of the second pasta invocation in the banner line. Link: https://bugs.passt.top/show_bug.cgi?id=88 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Slight simplification to pasta log tests	David Gibson	2024-04-25	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	test/pasta_options/log_to_file contains a couple of rudimentary tests where we start pasta with an interactive shell, then immediately exit it. We can achieve the same thing by using /bin/true as the command to pasta. This also means that waiting for pasta to start, waiting for the executed command to complete and for pasta to clean up are all handled by simply waiting for pasta to complete in the foreground, so there's no need for an additional sleep. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Correctly look up outbound socket with port remappings	David Gibson	2024-04-25	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit bb9bf0bb ("tcp, udp: Don't precompute port remappings in epoll references") changed the epoll reference for UDP sockets to include the bound port as seen by the socket itself, rather than the bound port it would be translated to on the guest side. As a side effect, it also means that udp_tap_map[] is indexed by the bound port on the host side, rather than on the guest side. This is consistent and a good idea, however we forgot to account for it when finding the correct outgoing socket for packets originating in the guest. This means that if forwarding UDP inbound with a port number change, reply packets would be misdirected. Fix this by applying the reverse mapping before looking up the socket in udp_tap_handler(). While we're at it, use 'port' directly instead of 'uref.port' in udp_sock_init(). Those now always have the same value - failing to realise that is the same error as above. Reported-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=87 Fixes: bb9bf0bb8f57 ("tcp, udp: Don't precompute port remappings in epoll references") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Replace TCP buffer structure by an iovec array	Laurent Vivier	2024-04-19	2	-268/+264
\| \| \| \| \| \| \| \| \| \| \| \| \|	To be able to provide pointers to TCP headers and IP headers without worrying about alignment in the structure, split the structure into several arrays and point to each part of the frame using an iovec array. Using iovec also allows us to simply ignore the first entry when the vnet length header is not needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Don't fail if the template interface doesn't have a MAC address	Stefano Brivio	2024-04-19	2	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...simply resort to using locally-administered address (LAA) as host-side source, instead. Pick 02:00:00:00:00:00, to make it clear that we don't actually care about that address, and also to match the 00 (Administratively Assigned Identifier) quadrant of SLAP (RFC 8948). Otherwise, pasta refuses to start if the template is a tun or Wireguard interface. Link: https://bugs.passt.top/show_bug.cgi?id=49 Link: https://github.com/containers/podman/issues/22320 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: We're interested in the MAC address, not in the MAC itself	Stefano Brivio	2024-04-19	1	-2/+2
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta, util: Align stack area for clones to maximum natural alignment	Stefano Brivio	2024-04-19	2	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that we use this stack pointer as a location to store arbitrary data types from the cloned process, we need to guarantee that its alignment matches any of those possible data types. runsisi reports that pasta gets a SIGBUS in pasta_open_ns() on aarch64, where the alignment requirement for stack pointers is a 16 bytes (same as the size of a long double), and similar requirements actually apply to most architectures we run on. Reported-by: runsisi <runsisi@hust.edu.cn> Link: https://bugs.passt.top/show_bug.cgi?id=85 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Compilers' name for armv6l and armv7l is "arm"	Stefano Brivio	2024-04-11	5	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the architecture name for, among others, seccomp.sh, I didn't realise that "armv6l" and "armv7l" are just Linux kernel names -- compilers just call that "arm". Fix the "syscalls" annotation we use to define seccomp profiles accordingly, otherwise pasta will be terminated on sigreturn() on armv6l and armv7l. Fixes: 213c397492bd ("passt, pasta: Run-time selection of AVX2 build") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Verify that podman tests are using the pasta binary we expect2024_04_05.954589b	David Gibson	2024-04-05	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \|	Paul Holzinger pointed out that when we invoke the podman tests inside the passt testsuite, the way we point podman at the newly built pasta binary is kind of indirect. It's therefore prudent to check that podman is actually using the binary we expect it to - in particular that it is using the binary built in this tree, not some system installed pasta binary. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>