passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	udp: Don't attempt to get dual-stack sockets in nonsensical cases	David Gibson	2024-09-25	1	-12/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To save some kernel memory we try to use "dual stack" sockets (that listen to both IPv4 and IPv6 traffic) when possible. However udp_sock_init() attempts to do this in some cases that can't work. Specifically we can only do this when listening on any address. That's never true for the ns (splicing) case, because we always listen on loopback. For the !ns case and AF_UNSPEC case, addr should always be NULL, but add an assert to verify. This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll fall back to the other path. But, it's messy and makes some upcoming changes harder, so avoid attempting this in cases we know can't work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Allow checksum to be disabled	Laurent Vivier	2024-09-18	3	-25/+38
\| \| \| \| \| \| \| \| \|	We can need not to set TCP checksum. Add a parameter to tcp_fill_headers4() and tcp_fill_headers6() to disable it. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Allow checksum to be disabled	Laurent Vivier	2024-09-18	1	-18/+40
\| \| \| \| \| \| \| \|	We can need not to set the UDP checksum. Add a parameter to udp_update_hdr4() and udp_update_hdr6() to disable it. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Remove possible quadratic behaviour from write_remainder()	David Gibson	2024-09-18	1	-10/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	write_remainder() steps through the buffers in an IO vector writing out everything past a certain byte offset. However, on each iteration it rescans the buffer from the beginning to find out where we're up to. With an unfortunate set of write sizes this could lead to quadratic behaviour. In an even less likely set of circumstances (total vector length > maximum size_t) the 'skip' variable could overflow. This is one factor in a longstanding Coverity error we've seen (although I still can't figure out the remainder of its complaint). Rework write_remainder() to always work out our new position in the vector relative to our old/current position, rather than starting from the beginning each time. As a bonus this seems to fix the Coverity error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add helper to write() all of a buffer	David Gibson	2024-09-18	3	-2/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	write(2) might not write all the data it is given. Add a write_all_buf() helper to keep calling it until all the given data is written, or we get an error. Currently we use write_remainder() to do this operation in pcap_frame(). That's a little awkward since it requires constructing an iovec, and future changes we want to make to write_remainder() will be easier in terms of this single buffer helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean	David Gibson	2024-09-18	3	-5/+5
\| \| \| \| \| \| \| \|	This parameter is already treated as a boolean internally. Make it a 'bool' type for clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify ifdef logic in tcp_update_seqack_wnd()	David Gibson	2024-09-18	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	This function has a block conditional on !snd_wnd_cap shortly before an snd_wnd_cap is statically false). Therefore, simplify this down to a single conditional with an else branch. While we're there, fix some improperly indented closing braces. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Clean up tcpi_snd_wnd probing	David Gibson	2024-09-18	5	-44/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When available, we want to retrieve our socket peer's advertised window and forward that to the guest. That information has been available from the kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035. Currently our probing for this is a bit odd. The HAS_SND_WND define determines if our headers include the tcp_snd_wnd field, but that doesn't necessarily mean the running kernel supports it. Currently we start by assuming it's _not_ available, but mark it as available if we ever see a non-zero value in the field. This is a bit hit and miss in two ways: * Zero is perfectly possible window the peer could report, so we can get false negatives * We're reading TCP_INFO into a local variable, which might not be zero initialised, so if the kernel _doesn't_ write it it could have non-zero garbage, giving us false positives. We can use a more direct way of probing for this: getsockopt() reports the length of the information retreived. So, check whether that's long enough to include the field. This lets us probe the availability of the field once and for all during initialisation. That in turn allows ctx to become a const pointer to tcp_prepare_flags() which cascades through many other functions. We also move the flag for the probe result from the ctx structure to a global, to match peek_offset_cap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make some extra functions private	David Gibson	2024-09-18	1	-2/+2
\| \| \| \| \| \| \| \|	tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c, and have no prototype in a header. Make them static. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Avoid overlapping memcpy() in DUP_ACK handling	David Gibson	2024-09-12	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \|	When handling the DUP_ACK flag, we copy all the buffers making up the ack frame. However, all our frames share the same buffer for the Ethernet header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will result in a (perfectly) overlapping memcpy(). This seems to have been harmless so far, but overlapping ranges to memcpy() is undefined behaviour, so we really should avoid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base	David Gibson	2024-09-12	1	-1/+0
\| \| \| \| \| \| \| \|	This initialisation for IPv4 flags buffers is redundant with the very next line which sets both iov_base and iov_len. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range2024_09_06.6b38f07	Stefano Brivio	2024-09-06	1	-0/+2
\| \| \| \| \| \| \|	...for both passt and pasta: use passt's abstraction for this. Fixes: eedc81b6ef55 ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range	Stefano Brivio	2024-09-06	2	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit eedc81b6ef55 ("fwd, conf: Probe host's ephemeral ports"), we might need to read from /proc/sys/net/ipv4/ip_local_port_range in both passt and pasta. While pasta was already allowed to open and write /proc/sys/net entries, read access was missing in SELinux's type enforcement: add that. In passt, instead, this is the first time we need to access an entry there: add everything we need. Fixes: eedc81b6ef55 ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't risk truncating frames on full buffer in tap_pasta_input()	David Gibson	2024-09-06	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Restructure in tap_pasta_input()	David Gibson	2024-09-06	1	-26/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Improve handling of EINTR in tap_passt_input()	David Gibson	2024-09-06	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split out handling of EPOLLIN events	David Gibson	2024-09-06	1	-14/+36
\| \| \| \| \| \| \| \| \| \|	Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Fix order of operands and carry of one second in timespec_diff_us()	Stefano Brivio	2024-09-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the nanoseconds of the minuend timestamp are less than the nanoseconds of the subtrahend timestamp, we need to carry one second in the subtraction. I subtracted this second from the minuend, but didn't actually carry it in the subtraction of nanoseconds, and logged timestamps would jump back whenever we switched to the first branch of timespec_diff_us() from the second one. Most likely, the reason why I didn't carry the second is that I instinctively thought that swapping the operands would have the same effect. But it doesn't, in general: that only happens with arithmetic in modulo powers of 2. Undo the swap as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings	David Gibson	2024-09-06	2	-7/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use EPOLLET for any state of not established connections	Stefano Brivio	2024-09-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, for not established connections, we monitor sockets with edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state (outbound connection being established) but not in the TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK to the container/guest). While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted another possibility for a short EPOLLRDHUP storm (10 seconds), which doesn't seem to happen in actual use cases, but I could reproduce it: start a connection from a container, while dropping (using netfilter) ACK segments coming out of the container itself. On the server side, outside the container, accept the connection and shutdown the writing side of it immediately. At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't have any reasonable way to handle it other than waiting for the tap side to complete the three-way handshake. So we'll just keep getting this EPOLLRDHUP until the SYN_TIMEOUT kicks in. Always enable EPOLLET when EPOLLRDHUP is the only epoll event we subscribe to: in this case, getting multiple EPOLLRDHUP reports is totally useless. In the only remaining non-established state, SOCK_ACCEPTED, for inbound connections, we're anyway discarding EPOLLRDHUP events until we established the conection, because we don't know what to do with them until we get an answer from the tap side, so it's safe to enable EPOLLET also in that case. Link: https://bugs.passt.top/show_bug.cgi?id=94 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle more error conditions in udp_sock_errs()	David Gibson	2024-09-06	1	-1/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Treat errors getting errors as unrecoverable	David Gibson	2024-09-06	1	-10/+17
\| \| \| \| \| \| \| \| \| \|	We can get network errors, usually transient, reported via the socket error queue. However, at least theoretically, we could get errors trying to read the queue itself. Since we have no idea how to clear an error condition in that case, treat it as unrecoverable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split socket error handling out from udp_sock_recv()	David Gibson	2024-09-06	1	-6/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently udp_sock_recv() both attempts to clear socket errors and read a batch of datagrams for forwarding. That made sense initially, since both listening and reply sockets need to do this. However, we have certain error cases which will add additional complexity to the error processing. Furthermore, if we ever wanted to more thoroughly handle errors received here - e.g. by synthesising ICMP messages on the tap device - it will likely require different handling for the listening and reply socket cases. So, split handling of error events into its own udp_sock_errs() function. While we're there, allow it to report "unrecoverable errors". We don't have any of these so far, but some cases we're working on might require it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Helpers to log details of a flow	David Gibson	2024-09-06	2	-17/+38
\| \| \| \| \| \| \| \| \| \|	The details of a flow - endpoints, interfaces etc. - can be pretty important for debugging. We log this on flow state transitions, but it can also be useful to log this when we report specific conditions. Add some helper functions and macros to make it easy to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Allow UDP flows to be prematurely closed	David Gibson	2024-09-06	3	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unlike TCP, UDP has no in-band signalling for the end of a flow. So the only way we remove flows is on a timer if they have no activity for 180s. However, we've started to investigate some error conditions in which we want to prematurely abort / abandon a UDP flow. We can call udp_flow_close(), which will make the flow inert (sockets closed, no epoll events, can't be looked up in hash). However it will still wait 3 minutes to clear away the stale entry. Clean this up by adding an explicit 'closed' flag which will cause a flow to be more promptly cleaned up. We also publish udp_flow_close() so it can be called from other places to abort UDP flows(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Fix incorrect hash probe in flowside_lookup()	David Gibson	2024-09-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our flow hash table uses linear probing in which we step backwards through clusters of adjacent hash entries when we have near collisions. Usually that's implemented by flow_hash_probe(). However, due to some details we need a second implementation in flowside_lookup(). An embarrassing oversight in rebasing from earlier versions has mean that version is incorrect, trying to step forward through clusters rather than backward. In situations with the right sorts of has near-collisions this can lead to us not associating an ACK from the tap device with the right flow, leaving it in a not-quite-established state. If the remote peer does a shutdown() at the right time, this can lead to a storm of EPOLLRDHUP events causing high CPU load. Fixes: acca4235c46f ("flow, tcp: Generalise TCP hash table to general flow hash table") Link: https://bugs.passt.top/show_bug.cgi?id=94 Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Don't prefix log file messages with time and severity if they're ↵	Stefano Brivio	2024-09-06	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	continuations In fecb1b65b1ac ("log: Don't prefix message with timestamp on --debug if it's a continuation"), I fixed this for --debug on standard error, but not for log files: if messages are continuations, they shouldn't be prefixed by timestamp and severity. Otherwise, we'll print stuff like this: 0.0028: ERROR: Receive error on guest connection, reset0.0028: ERROR: : Bad file descriptor Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Makefile: Enable _FORTIFY_SOURCE iff needed	Michal Privoznik	2024-08-29	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On some systems source fortification is enabled whenever code optimization is enabled (e.g. with -O2). Since code fortification is explicitly enabled too (with possibly different value than the system wants, there are three levels [1]), distros are required to patch our Makefile, e.g. [2]. Detect whether fortification is not already enabled and enable it explicitly only if really needed. 1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html 2: https://github.com/gentoo/gentoo/commit/edfeb8763ac56112c59248c62c9cda13e5d01c97 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd, conf: Probe host's ephemeral ports	David Gibson	2024-08-29	3	-2/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we forward "all" ports (-t all or -u all), or use an exclude-only range, we don't actually forward all ports - that wouln't leave local ports to use for outgoing connections. Rather we forward all non-ephemeral ports - those that won't be used for outgoing connections or datagrams. Currently we assume the range of ephemeral ports is that recommended by RFC 6335, 49152-65535. However, that's not the range used by default on Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range sysctl. We can't really know what range the guest will consider ephemeral, but if it differs too much from the host it's likely to cause problems we can't avoid anyway. So, using the host's ephemeral range is a better guess than using the RFC 6335 range. Therefore, add logic to probe the host's ephemeral range, falling back to the RFC 6335 range if that fails. This has the bonus advantage of reducing the number of ports bound by -t all -u all on most Linux machines thereby reducing kernel memory usage. Specifically this reduces kernel memory usage with -t all -u all from ~380MiB to ~289MiB. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, fwd: Don't attempt to forward port 0	David Gibson	2024-08-29	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using -t all, -u all or exclude-only ranges, we'll attempt to forward all non-ephemeral port numbers, including port 0. However, this won't work as intended: bind() treats a zero port not as literal port 0, but as "pick a port for me". Because of the special meaning of port 0, we mostly outright exclude it in our handling. Do the same for setting up forwards, not attempting to forward for port 0. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, fwd: Make ephemeral port logic more flexible	David Gibson	2024-08-29	4	-7/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"Ephemeral" ports are those which the kernel may allocate as local port numbers for outgoing connections or datagrams. Because of that, they're generally not good choices for listening servers to bind to. Thefore when using -t all, -u all or exclude-only ranges, we map only non-ephemeral ports. Our logic for this is a bit rigid though: we assume the ephemeral ports are always a fixed range at the top of the port number space. We also assume PORT_EPHEMERAL_MIN is a multiple of 8, or we won't set the forward bitmap correctly. Make the logic in conf.c more flexible, using a helper moved into fwd.[ch], although we don't change which ports we consider ephemeral (yet). The new handling is undoubtedly more computationally expensive, but since it's a once-off operation at start off, I don't think it really matters. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	seccomp.sh: Try to account for terminal width while formatting list of ↵	Stefano Brivio	2024-08-27	1	-1/+4
\| \| \| \| \| \| \| \| \| \|	system calls Avoid excess lines on wide terminals, but make sure we don't fail if we can't fetch the number of columns for any reason, as it's not a fundamental feature and we don't want to break anything with it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use dual stack sockets for port forwarding when possible	David Gibson	2024-08-27	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as well as native IPv6 connections. By doing this we halve the number of listening sockets we need (assuming passt/pasta is listening on the same ports for IPv4 and IPv6). When forwarding many ports (e.g. -u all) this can significantly reduce the amount of kernel memory that passt consumes. We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual stack sockets for port forwarding when possible". Add similar support for UDP "listening" sockets. Since UDP sockets don't use as much kernel memory as TCP sockets this isn't as big a saving, but it's still significant. When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all), this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version 6.10.6 on Fedora 40, x86_64). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove unnnecessary local from udp_sock_init()	David Gibson	2024-08-27	1	-15/+15
\| \| \| \| \| \| \| \|	The 's' variable is always redundant with either 'r4' or 'r6', so remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Merge udp[46]_mh_recv arrays	David Gibson	2024-08-27	2	-39/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We've already gotten rid of most of the IPv4/IPv6 specific data structures in udp.c by merging them with each other. One significant one remains: udp[46]_mh_recv. This was a bit awkward to remove because of a subtle interaction. We initialise the msg_namelen fields to represent the total size we have for a socket address, but when we receive into the arrays those are modified to the actual length of the sockaddr we received. That meant that naively merging the arrays meant that if we received IPv4 datagrams, then IPv6 datagrams, the addresses for the latter would be truncated. In this patch address that by resetting the received msg_namelen as soon as we've found a flow for the datagram. Finding the flow is the only thing that might use the actual sockaddr length, although we in fact don't need it for the time being. This also removes the last use of the 'v6' field from udp_listen_epoll_ref, so remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Look for possible sshd-session paths (if it's there at all) in mbuto's ↵	Stefano Brivio	2024-08-27	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	profile Some distributions already have OpenSSH 9.8, which introduces split sshd/sshd-session binaries, and there we need to copy the binary from the host, which can be /usr/libexec/openssh/sshd-session (Fedora Rawhide), /usr/lib/ssh/sshd-session (Arch Linux), /usr/lib/openssh/sshd-session (Debian), and possibly other paths. Add at least those three, and, if we don't find sshd-session, assume we don't need it: it could very well be an older version of OpenSSH, as reported by David for Fedora 40, or perhaps another daemon (would Dropbear even work? I'm not sure). Reported-by: David Gibson <david@gibson.dropbear.id.au> Fixes: d6817b3930be ("test/passt.mbuto: Install sshd-session OpenSSH's split process") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	README: pasta is indeed a supported back-end for rootless Docker2024_08_21.1d6142f	Stefano Brivio	2024-08-21	1	-1/+3
\| \| \| \| \| \| \|	...https://github.com/moby/moby/issues/48257 just reminded me. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Don't stop on unrelated values when looking for --fd in close_open_files()	Stefano Brivio	2024-08-21	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Seen with krun: we get a file descriptor via --fd, but we close it and happily use the same number for TCP files. The issue is that if we also get other options before --fd, with arguments, getopt_long() stops parsing them because it sees them as non-option values. Use the - modifier at the beginning of optstring (before :, which is needed to avoid printing errors) instead of +, which means we'll continue parsing after finding unrelated option values, but getopt_long() won't reorder them anyway: they'll be passed with option value '1', which we can ignore. By the way, we also need to add : after F in the optstring, so that we're able to parse the option when given as short name as well. Now that we change the parsing mode between close_open_files() and conf(), we need to reset optind to 0, not to 1, whenever we call getopt_long() again in conf(), so that the internal initialisation of getopt_long() evaluating GNU extensions is re-triggered. Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828 Fixes: baccfb95ce0e ("conf: Stop parsing options at first non-option argument") Fixes: 09603cab28f9 ("passt, util: Close any open file that the parent might have leaked") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Update list of dependencies in README.md	Stefano Brivio	2024-08-21	1	-4/+5
\| \| \| \| \| \| \|	Mostly packages we now need to run Podman-based tests. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf)	Stefano Brivio	2024-08-21	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	These system calls are needed after the conversion of time_t to 64-bit types on 32-bit architectures. Tested by running some transfer tests with passt and pasta on Debian Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l. Suggested-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Provide own version of close_range(), and no-op fallback	Stefano Brivio	2024-08-21	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial) close_range() implementation. This will probably be added to musl soon, by the way: https://www.openwall.com/lists/musl/2024/08/01/9 Add a weakly-aliased implementation, if it's supported by the kernel. If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31 file descriptors calling close() on them is probably not a good idea. Reported-by: lemmi <lemmi@nerd2nerd.org> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp_flow: Add missing unistd.h include for close()	Stefano Brivio	2024-08-21	1	-0/+1
\| \| \| \| \| \| \| \|	For some reason, this is reported only with musl, and older glibc versions (2.31, at least). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Duplicate existing recvfrom() valgrind suppression for recv()	Stefano Brivio	2024-08-21	1	-0/+9
\| \| \| \| \| \| \| \| \|	Some architectures, including i686, actually have a recv() system call, not just a recvfrom(), and we need to cover the recv() with MSG_TRUNC into a NULL buffer for them as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/passt.mbuto: Install sshd-session OpenSSH's split process	Stefano Brivio	2024-08-21	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	OpenSSH now ships a per-session binary, sshd-session, with sshd acting as mere listener. It's typically not found in $PATH, so specify the whole path at which it's commonly installed in $PROGS. Link: https://www.openssh.com/releasenotes.html#9.8p1 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/passt.mbuto: Run sshd from vsock proxy with absolute path	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	...OpenSSH >= 9.8 otherwise complains that: sshd requires execution with an absolute path Link: https://bugs.gentoo.org/936041 Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078429 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib/setup: Transform i686 kernel architecture name into QEMU name (i386)	Stefano Brivio	2024-08-21	1	-4/+6
\| \| \| \| \| \| \| \|	It's qemu-system-i386, but uname -m reports i686. I didn't test i486 and i586. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Allow additional system calls for i386/i686	Stefano Brivio	2024-08-21	8	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	fwd, conf: Allow NAT of the guest's assigned address	David Gibson	2024-08-21	4	-17/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --map-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable. Add a new --map-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback. If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Distinguish translatable from untranslatable addresses on inbound	David Gibson	2024-08-21	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \|	fwd_nat_from_host() needs to adjust the source address for new flows coming from an address which is not accessible to the guest. Currently we always use our_tap_addr or our_tap_ll. However in cases where the address is accessible to the guest via translation (i.e. via --map-host-loopback) then it makes more sense to use that translation, rather than the fallback mapping of our_tap_*. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Allow address remapped to host to be configured	David Gibson	2024-08-21	11	-95/+237
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback. Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --map-host-loopback option. In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>