passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range2024_09_06.6b38f07	Stefano Brivio	13 days	1	-0/+2
\| \| \| \| \| \| \|	...for both passt and pasta: use passt's abstraction for this. Fixes: eedc81b6ef55 ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range	Stefano Brivio	13 days	2	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit eedc81b6ef55 ("fwd, conf: Probe host's ephemeral ports"), we might need to read from /proc/sys/net/ipv4/ip_local_port_range in both passt and pasta. While pasta was already allowed to open and write /proc/sys/net entries, read access was missing in SELinux's type enforcement: add that. In passt, instead, this is the first time we need to access an entry there: add everything we need. Fixes: eedc81b6ef55 ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't risk truncating frames on full buffer in tap_pasta_input()	David Gibson	14 days	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Restructure in tap_pasta_input()	David Gibson	14 days	1	-26/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Improve handling of EINTR in tap_passt_input()	David Gibson	14 days	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split out handling of EPOLLIN events	David Gibson	14 days	1	-14/+36
\| \| \| \| \| \| \| \| \| \|	Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Fix order of operands and carry of one second in timespec_diff_us()	Stefano Brivio	14 days	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the nanoseconds of the minuend timestamp are less than the nanoseconds of the subtrahend timestamp, we need to carry one second in the subtraction. I subtracted this second from the minuend, but didn't actually carry it in the subtraction of nanoseconds, and logged timestamps would jump back whenever we switched to the first branch of timespec_diff_us() from the second one. Most likely, the reason why I didn't carry the second is that I instinctively thought that swapping the operands would have the same effect. But it doesn't, in general: that only happens with arithmetic in modulo powers of 2. Undo the swap as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings	David Gibson	14 days	2	-7/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use EPOLLET for any state of not established connections	Stefano Brivio	14 days	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, for not established connections, we monitor sockets with edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state (outbound connection being established) but not in the TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK to the container/guest). While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted another possibility for a short EPOLLRDHUP storm (10 seconds), which doesn't seem to happen in actual use cases, but I could reproduce it: start a connection from a container, while dropping (using netfilter) ACK segments coming out of the container itself. On the server side, outside the container, accept the connection and shutdown the writing side of it immediately. At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't have any reasonable way to handle it other than waiting for the tap side to complete the three-way handshake. So we'll just keep getting this EPOLLRDHUP until the SYN_TIMEOUT kicks in. Always enable EPOLLET when EPOLLRDHUP is the only epoll event we subscribe to: in this case, getting multiple EPOLLRDHUP reports is totally useless. In the only remaining non-established state, SOCK_ACCEPTED, for inbound connections, we're anyway discarding EPOLLRDHUP events until we established the conection, because we don't know what to do with them until we get an answer from the tap side, so it's safe to enable EPOLLET also in that case. Link: https://bugs.passt.top/show_bug.cgi?id=94 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle more error conditions in udp_sock_errs()	David Gibson	14 days	1	-1/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Treat errors getting errors as unrecoverable	David Gibson	14 days	1	-10/+17
\| \| \| \| \| \| \| \| \| \|	We can get network errors, usually transient, reported via the socket error queue. However, at least theoretically, we could get errors trying to read the queue itself. Since we have no idea how to clear an error condition in that case, treat it as unrecoverable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split socket error handling out from udp_sock_recv()	David Gibson	14 days	1	-6/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently udp_sock_recv() both attempts to clear socket errors and read a batch of datagrams for forwarding. That made sense initially, since both listening and reply sockets need to do this. However, we have certain error cases which will add additional complexity to the error processing. Furthermore, if we ever wanted to more thoroughly handle errors received here - e.g. by synthesising ICMP messages on the tap device - it will likely require different handling for the listening and reply socket cases. So, split handling of error events into its own udp_sock_errs() function. While we're there, allow it to report "unrecoverable errors". We don't have any of these so far, but some cases we're working on might require it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Helpers to log details of a flow	David Gibson	14 days	2	-17/+38
\| \| \| \| \| \| \| \| \| \|	The details of a flow - endpoints, interfaces etc. - can be pretty important for debugging. We log this on flow state transitions, but it can also be useful to log this when we report specific conditions. Add some helper functions and macros to make it easy to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Allow UDP flows to be prematurely closed	David Gibson	14 days	3	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unlike TCP, UDP has no in-band signalling for the end of a flow. So the only way we remove flows is on a timer if they have no activity for 180s. However, we've started to investigate some error conditions in which we want to prematurely abort / abandon a UDP flow. We can call udp_flow_close(), which will make the flow inert (sockets closed, no epoll events, can't be looked up in hash). However it will still wait 3 minutes to clear away the stale entry. Clean this up by adding an explicit 'closed' flag which will cause a flow to be more promptly cleaned up. We also publish udp_flow_close() so it can be called from other places to abort UDP flows(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Fix incorrect hash probe in flowside_lookup()	David Gibson	14 days	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our flow hash table uses linear probing in which we step backwards through clusters of adjacent hash entries when we have near collisions. Usually that's implemented by flow_hash_probe(). However, due to some details we need a second implementation in flowside_lookup(). An embarrassing oversight in rebasing from earlier versions has mean that version is incorrect, trying to step forward through clusters rather than backward. In situations with the right sorts of has near-collisions this can lead to us not associating an ACK from the tap device with the right flow, leaving it in a not-quite-established state. If the remote peer does a shutdown() at the right time, this can lead to a storm of EPOLLRDHUP events causing high CPU load. Fixes: acca4235c46f ("flow, tcp: Generalise TCP hash table to general flow hash table") Link: https://bugs.passt.top/show_bug.cgi?id=94 Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Don't prefix log file messages with time and severity if they're ↵	Stefano Brivio	14 days	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	continuations In fecb1b65b1ac ("log: Don't prefix message with timestamp on --debug if it's a continuation"), I fixed this for --debug on standard error, but not for log files: if messages are continuations, they shouldn't be prefixed by timestamp and severity. Otherwise, we'll print stuff like this: 0.0028: ERROR: Receive error on guest connection, reset0.0028: ERROR: : Bad file descriptor Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Makefile: Enable _FORTIFY_SOURCE iff needed	Michal Privoznik	2024-08-29	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On some systems source fortification is enabled whenever code optimization is enabled (e.g. with -O2). Since code fortification is explicitly enabled too (with possibly different value than the system wants, there are three levels [1]), distros are required to patch our Makefile, e.g. [2]. Detect whether fortification is not already enabled and enable it explicitly only if really needed. 1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html 2: https://github.com/gentoo/gentoo/commit/edfeb8763ac56112c59248c62c9cda13e5d01c97 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd, conf: Probe host's ephemeral ports	David Gibson	2024-08-29	3	-2/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we forward "all" ports (-t all or -u all), or use an exclude-only range, we don't actually forward all ports - that wouln't leave local ports to use for outgoing connections. Rather we forward all non-ephemeral ports - those that won't be used for outgoing connections or datagrams. Currently we assume the range of ephemeral ports is that recommended by RFC 6335, 49152-65535. However, that's not the range used by default on Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range sysctl. We can't really know what range the guest will consider ephemeral, but if it differs too much from the host it's likely to cause problems we can't avoid anyway. So, using the host's ephemeral range is a better guess than using the RFC 6335 range. Therefore, add logic to probe the host's ephemeral range, falling back to the RFC 6335 range if that fails. This has the bonus advantage of reducing the number of ports bound by -t all -u all on most Linux machines thereby reducing kernel memory usage. Specifically this reduces kernel memory usage with -t all -u all from ~380MiB to ~289MiB. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, fwd: Don't attempt to forward port 0	David Gibson	2024-08-29	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using -t all, -u all or exclude-only ranges, we'll attempt to forward all non-ephemeral port numbers, including port 0. However, this won't work as intended: bind() treats a zero port not as literal port 0, but as "pick a port for me". Because of the special meaning of port 0, we mostly outright exclude it in our handling. Do the same for setting up forwards, not attempting to forward for port 0. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, fwd: Make ephemeral port logic more flexible	David Gibson	2024-08-29	4	-7/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"Ephemeral" ports are those which the kernel may allocate as local port numbers for outgoing connections or datagrams. Because of that, they're generally not good choices for listening servers to bind to. Thefore when using -t all, -u all or exclude-only ranges, we map only non-ephemeral ports. Our logic for this is a bit rigid though: we assume the ephemeral ports are always a fixed range at the top of the port number space. We also assume PORT_EPHEMERAL_MIN is a multiple of 8, or we won't set the forward bitmap correctly. Make the logic in conf.c more flexible, using a helper moved into fwd.[ch], although we don't change which ports we consider ephemeral (yet). The new handling is undoubtedly more computationally expensive, but since it's a once-off operation at start off, I don't think it really matters. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	seccomp.sh: Try to account for terminal width while formatting list of ↵	Stefano Brivio	2024-08-27	1	-1/+4
\| \| \| \| \| \| \| \| \| \|	system calls Avoid excess lines on wide terminals, but make sure we don't fail if we can't fetch the number of columns for any reason, as it's not a fundamental feature and we don't want to break anything with it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use dual stack sockets for port forwarding when possible	David Gibson	2024-08-27	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as well as native IPv6 connections. By doing this we halve the number of listening sockets we need (assuming passt/pasta is listening on the same ports for IPv4 and IPv6). When forwarding many ports (e.g. -u all) this can significantly reduce the amount of kernel memory that passt consumes. We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual stack sockets for port forwarding when possible". Add similar support for UDP "listening" sockets. Since UDP sockets don't use as much kernel memory as TCP sockets this isn't as big a saving, but it's still significant. When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all), this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version 6.10.6 on Fedora 40, x86_64). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove unnnecessary local from udp_sock_init()	David Gibson	2024-08-27	1	-15/+15
\| \| \| \| \| \| \| \|	The 's' variable is always redundant with either 'r4' or 'r6', so remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Merge udp[46]_mh_recv arrays	David Gibson	2024-08-27	2	-39/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We've already gotten rid of most of the IPv4/IPv6 specific data structures in udp.c by merging them with each other. One significant one remains: udp[46]_mh_recv. This was a bit awkward to remove because of a subtle interaction. We initialise the msg_namelen fields to represent the total size we have for a socket address, but when we receive into the arrays those are modified to the actual length of the sockaddr we received. That meant that naively merging the arrays meant that if we received IPv4 datagrams, then IPv6 datagrams, the addresses for the latter would be truncated. In this patch address that by resetting the received msg_namelen as soon as we've found a flow for the datagram. Finding the flow is the only thing that might use the actual sockaddr length, although we in fact don't need it for the time being. This also removes the last use of the 'v6' field from udp_listen_epoll_ref, so remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Look for possible sshd-session paths (if it's there at all) in mbuto's ↵	Stefano Brivio	2024-08-27	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	profile Some distributions already have OpenSSH 9.8, which introduces split sshd/sshd-session binaries, and there we need to copy the binary from the host, which can be /usr/libexec/openssh/sshd-session (Fedora Rawhide), /usr/lib/ssh/sshd-session (Arch Linux), /usr/lib/openssh/sshd-session (Debian), and possibly other paths. Add at least those three, and, if we don't find sshd-session, assume we don't need it: it could very well be an older version of OpenSSH, as reported by David for Fedora 40, or perhaps another daemon (would Dropbear even work? I'm not sure). Reported-by: David Gibson <david@gibson.dropbear.id.au> Fixes: d6817b3930be ("test/passt.mbuto: Install sshd-session OpenSSH's split process") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	README: pasta is indeed a supported back-end for rootless Docker2024_08_21.1d6142f	Stefano Brivio	2024-08-21	1	-1/+3
\| \| \| \| \| \| \|	...https://github.com/moby/moby/issues/48257 just reminded me. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Don't stop on unrelated values when looking for --fd in close_open_files()	Stefano Brivio	2024-08-21	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Seen with krun: we get a file descriptor via --fd, but we close it and happily use the same number for TCP files. The issue is that if we also get other options before --fd, with arguments, getopt_long() stops parsing them because it sees them as non-option values. Use the - modifier at the beginning of optstring (before :, which is needed to avoid printing errors) instead of +, which means we'll continue parsing after finding unrelated option values, but getopt_long() won't reorder them anyway: they'll be passed with option value '1', which we can ignore. By the way, we also need to add : after F in the optstring, so that we're able to parse the option when given as short name as well. Now that we change the parsing mode between close_open_files() and conf(), we need to reset optind to 0, not to 1, whenever we call getopt_long() again in conf(), so that the internal initialisation of getopt_long() evaluating GNU extensions is re-triggered. Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828 Fixes: baccfb95ce0e ("conf: Stop parsing options at first non-option argument") Fixes: 09603cab28f9 ("passt, util: Close any open file that the parent might have leaked") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Update list of dependencies in README.md	Stefano Brivio	2024-08-21	1	-4/+5
\| \| \| \| \| \| \|	Mostly packages we now need to run Podman-based tests. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf)	Stefano Brivio	2024-08-21	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	These system calls are needed after the conversion of time_t to 64-bit types on 32-bit architectures. Tested by running some transfer tests with passt and pasta on Debian Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l. Suggested-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Provide own version of close_range(), and no-op fallback	Stefano Brivio	2024-08-21	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial) close_range() implementation. This will probably be added to musl soon, by the way: https://www.openwall.com/lists/musl/2024/08/01/9 Add a weakly-aliased implementation, if it's supported by the kernel. If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31 file descriptors calling close() on them is probably not a good idea. Reported-by: lemmi <lemmi@nerd2nerd.org> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp_flow: Add missing unistd.h include for close()	Stefano Brivio	2024-08-21	1	-0/+1
\| \| \| \| \| \| \| \|	For some reason, this is reported only with musl, and older glibc versions (2.31, at least). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Duplicate existing recvfrom() valgrind suppression for recv()	Stefano Brivio	2024-08-21	1	-0/+9
\| \| \| \| \| \| \| \| \|	Some architectures, including i686, actually have a recv() system call, not just a recvfrom(), and we need to cover the recv() with MSG_TRUNC into a NULL buffer for them as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/passt.mbuto: Install sshd-session OpenSSH's split process	Stefano Brivio	2024-08-21	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	OpenSSH now ships a per-session binary, sshd-session, with sshd acting as mere listener. It's typically not found in $PATH, so specify the whole path at which it's commonly installed in $PROGS. Link: https://www.openssh.com/releasenotes.html#9.8p1 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/passt.mbuto: Run sshd from vsock proxy with absolute path	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	...OpenSSH >= 9.8 otherwise complains that: sshd requires execution with an absolute path Link: https://bugs.gentoo.org/936041 Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078429 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib/setup: Transform i686 kernel architecture name into QEMU name (i386)	Stefano Brivio	2024-08-21	1	-4/+6
\| \| \| \| \| \| \| \|	It's qemu-system-i386, but uname -m reports i686. I didn't test i486 and i586. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Allow additional system calls for i386/i686	Stefano Brivio	2024-08-21	8	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	fwd, conf: Allow NAT of the guest's assigned address	David Gibson	2024-08-21	4	-17/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --map-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable. Add a new --map-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback. If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Distinguish translatable from untranslatable addresses on inbound	David Gibson	2024-08-21	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \|	fwd_nat_from_host() needs to adjust the source address for new flows coming from an address which is not accessible to the guest. Currently we always use our_tap_addr or our_tap_ll. However in cases where the address is accessible to the guest via translation (i.e. via --map-host-loopback) then it makes more sense to use that translation, rather than the fallback mapping of our_tap_*. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Allow address remapped to host to be configured	David Gibson	2024-08-21	11	-95/+237
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback. Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --map-host-loopback option. In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Reconfigure IPv6 address after changing MTU	David Gibson	2024-08-21	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the TCP throughput tests, we adjust the guest's MTU in order to test various packet sizes. Some of those are below 1280 which causes IPv6 to be deconfigured on the guest interface. When we increase it above 1280 again, IPv6 is re-enabled and we get an address in the right prefix with NDP, but we don't get exactly the expected address back - that's only communicated with --config-net or DHCPv6. With changes to how we handle NAT this can cause some of the IPv6 tests to fail, because they don't use the address that passt/pasta expects, and the guest doesn't initiate any traffic which allows us to learn what the new address is. Work around this by re-invoking dhclient -6 between adjusting the MTU and running IPv6 test cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, fwd: Split notion of gateway/router from guest-visible host address	David Gibson	2024-08-21	5	-42/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host. Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback. Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Don't take "our" MAC address from the host	David Gibson	2024-08-21	3	-35/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When sending frames to the guest over the tap link, we need a source MAC address. Currently we take that from the MAC address of the main interface on the host, but that doesn't actually make much sense: * We can't preserve the real MAC address of packets from anywhere external so there's no transparency case here * In fact, it's confusingly different from how we handle IP addresses: whereas we give the guest the same IP as the host, we're making the host's MAC the one MAC that the guest can't use for itself. * We already need a fallback case if the host doesn't have an Ethernet like MAC (e.g. if it's connected via a point to point interface, such as a wireguard VPN). Change to just just use an arbitrary fixed MAC address - I've picked 9a:55:9a:55:9a:55. It's simpler and has the small advantage of making the fact that passt/pasta is in use typically obvious from guest side packet dumps. This can still, of course, be overridden with the -M option. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Split notion of "our tap address" from gateway for IPv4	David Gibson	2024-08-21	3	-7/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense Case 3 occurs in two situations: a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address). (b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it. For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well. Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here (e.g. if the host is disconnected or only has a point to point link with no gateway address). In that case we have to disable forwarding of inbound connections with guest-inaccessible source addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Helpers to clarify what host addresses aren't guest accessible	David Gibson	2024-08-21	1	-11/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view. Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future. While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Initialise our_tap_ll to ip6.gw when suitable	David Gibson	2024-08-21	4	-12/+6
\| \| \| \| \| \| \| \| \|	In every place we use our_tap_ll, we only use it as a fallback if the IPv6 gateway address is not link-local. We can avoid that conditional at use time by doing it at initialisation of our_tap_ll instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Clarify which addresses in ip[46]_ctx are meaningful where	David Gibson	2024-08-21	1	-4/+10
\| \| \| \| \| \| \| \| \|	Some are guest visible addresses and may not be valid on the host, others are host visible addresses and may not be valid on the guest. Rearrange and comment the ip[46]_ctx definitions to make it clearer which is which. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Change misleading 'addr_ll' name	David Gibson	2024-08-21	5	-8/+9
\| \| \| \| \| \| \| \| \|	c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Correct sock_l4() binding for link local addresses	David Gibson	2024-08-21	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the host interface's link local address, the actual purpose of it is to provide a link local address for passt's private use on the tap interface. Instead set the scope id for any link-local address we're binding to. We're going to need something and this is what makes sense for sockets on the host. It doesn't make sense for PIF_SPLICE sockets, but those should always have loopback, not link-local addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Remove incorrect initialisation of addr_ll_seen	David Gibson	2024-08-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Despite the names, addr_ll_seen does not relate to addr_ll the same way addr_seen relates to addr. addr_ll_seen is an observed address from the guest, whereas addr_ll is our link-local address for use on the tap link when we can't use an external endpoint address. It's used both for passt provided services (DHCPv6, NDP) and in some cases for connections from addresses the guest can't access. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Treat --dns addresses as guest visible addresses	David Gibson	2024-08-21	2	-50/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although it's not 100% explicit in the man page, addresses given to the --dns option are intended to be addresses as seen by the guest. This differs from addresses taken from the host's /etc/resolv.conf, which must be translated to guest accessible versions in some cases. Our implementation is currently inconsistent on this: when using --dns-forward, you must usually also give --dns with the matching address, which is meaningful only in the guest's address view. However if you give --dns with a loopback addres, it will be translated like a host view address. Move the remapping logic for DNS addresses out of add_dns4() and add_dns6() into add_dns_resolv() so that it is only applied for host nameserver addresses, not for nameservers given explicitly with --dns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>