passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	pcap: Update pcap_frame() to take an iovec and offset	David Gibson	2024-02-29	1	-17/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	Update the low-level helper pcap_frame() to take a struct iovec and offset within it, rather than an explicit pointer and length for the frame. This moves the handling of an offset (to skip vnet_len) from pcap_multiple() to pcap_frame(). This doesn't accomplish a great deal immediately, but will make subsequent changes easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: Add helper to find skip over first n bytes of an io vector	David Gibson	2024-02-29	3	-16/+40
\| \| \| \| \| \| \| \| \|	Several of the IOV functions in iov.c, and also tap_send_frames_passt() needs to determine which buffer element a byte offset into an IO vector lies in. Split this out into a helper function iov_skip_bytes(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: add some functions to manage iovec	Laurent Vivier	2024-02-29	3	-4/+207
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce functions to copy to/from a buffer from/to an iovec array, to compute data length in in bytes of an iovec and to copy memory from an iovec to another. iov_from_buf(), iov_to_buf(), iov_size(), iov_copy(). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240217150725.661467-2-lvivier@redhat.com> [dwg: Small changes to suppress cppcheck warnings] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove unnecessary test for unspecified addr_out	David Gibson	2024-02-29	1	-4/+2
\| \| \| \| \| \| \| \| \| \|	If the configured output address is unspecified, we don't set the bind address to it when creating a new socket in udp_tap_handler(). That sounds sensible, but what we're leaving the bind address as is, exactly, the unspecified address, so this test makes no difference. Remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix incorrect usage of IPv6 state in IPv4 path	David Gibson	2024-02-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	When forwarding IPv4 packets in udp_tap_handler(), we incorrectly use an IPv6 address test on our IPv4 address (which could cause an out of bounds access), and possibly set our bind interface to the IPv6 interface based on it. Adjust to correctly look at the IPv4 address and IPv4 interface. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Small streamline to udp_update_hdr4()	David Gibson	2024-02-29	1	-8/+9
\| \| \| \| \| \| \| \| \| \|	Streamline the logic here slightly, by introducing a 'src' temporary for brevity. We also transform the logic for setting/clearing PORT_LOOPBACK. This makes udp_update_hdr4() more closely match the corresponding logic from udp_update_udp6(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Set pif in epoll reference for ephemeral host sockets	David Gibson	2024-02-29	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	The udp_epoll_ref contains a field for the pif to which the socket belongs. We fill this in for permanent sockets created with udp_sock_init() and for spliced sockets, however, we omit it for ephemeral sockets created for tap originated flows. This is a bug, although we currently get away with it, because we don't consult that field for such flows. Correctly fill it in. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't attempt to translate a 0.0.0.0 source address	David Gibson	2024-02-29	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If an incoming packet has a source address of 0.0.0.0 we translate that to the gateway address. This doesn't really make sense, because we have no way to do a reverse translation for reply packets. Certain UDP protocols do use an unspecified source address in some circumstances (e.g. DHCP). These generally either require no reply, a multicast reply, or provide a suitable reply address by other means. In none of those cases does translating it in passt/pasta make sense. The best we can really do here is just leave it as is. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: If no interface with a default route was found, say it	Stefano Brivio	2024-02-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	...instead of implying that by stating that there's no routable interface for a given IP version. There might be interfaces with non-default routes. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>
*	Makefile: check for cppcheck's --check-level option in cppcheck target	Stefano Brivio	2024-02-28	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't run cppcheck to find out if the --check-level=exhaustive option is available, unless we're actually going to run cppcheck later. To avoid this, move this check under the cppcheck target, and implement it in shell script instead of using Makefile directives, because we can't easily implement conditionals in recipes. Reported-by: Rahil Bhimjiani <me@rahil.website> Link: https://bugs.gentoo.org/920795 Fixes: 8640d62af719 ("cppcheck: Use "exhaustive" level checking when available") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: set the log level much earlier	Paul Holzinger	2024-02-27	2	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	--quiet is supposed to silence the "No routable interface" message but it does not work because the log level was set long after conf_ip4/6() was called which means it uses the default level which logs everything. To address this move the log level logic directly after the option parsing in conf(). Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: make --quiet set the log level to warning	Paul Holzinger	2024-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Based on the man page and help output --quiet hides informational messages. This means that warnings should still be logged. This was discussed in[1]. [1] https://archives.passt.top/passt-dev/20240216114304.7234a83f@elisabeth/T/#m42652824644973674e84baf9e0bf1d0e88104450 Signed-off-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't store errnos in socket pool	David Gibson	2024-02-27	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If tcp_sock_refill_pool() gets an error opening new sockets, it stores the negative errno of that error in the socket pool. This isn't especially useful: * It's inconsistent with the initial state of the pool (all -1) * It's inconsistent with the state of an entry that was valid and was then consumed (also -1) * By the time we did anything with this error code, it's now far removed from the situation in which the error occurred, making it difficult to report usefully We now have error reporting closer to when failures happen on the refill paths, so just leave a pool slot we can't fill as -1. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Helpers for getting sockets from the pools	David Gibson	2024-02-27	3	-29/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We maintain pools of ready-to-connect sockets in both the original and (for pasta) guest namespace to reduce latency when starting new TCP connections. If we exhaust those pools we have to take a higher latency path to get a new socket. Currently we open-code that fallback in the places we need it. To improve clarity encapsulate that into helper functions. While we're at it, give those helpers clearer error reporting. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Issue warnings if unable to refill socket pool	David Gibson	2024-02-27	3	-11/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if tcp_sock_refill_pool() is unable to fill all the slots in the pool, it will silently exit. This might lead to a later attempt to get fds from the pool to fail at which point it will be harder to tell what originally went wrong. Instead add warnings if we're unable to refill any of the socket pools when requested. We have tcp_sock_refill_pool() return an error and report it in the callers, because those callers have more context allowing for a more useful message. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Stop on first error when refilling socket pools	David Gibson	2024-02-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if we get an error opening a new socket while refilling a socket pool, we carry on to the next slot and try again. This isn't very useful, since by far the most likely cause of an error is some sort of resource exhaustion. Trying again will probably just hit the same error, and maybe even make things worse. So, instead stop on the first error while refilling the pool, making do with however many sockets we managed to open before the error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't stop refilling socket pool if we find a filled entry	David Gibson	2024-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the pool with a valid fd. This appears to makes sense: we always use fds from the front of the pool, so if we find a filled one, the rest of the pool should be filled as well. However, that's not quite correct. If a previous refill hit errors trying to open new sockets, it could leave gaps between blocks of valid fds. We're going to add some changes that could make that more likely. So, for robustness, instead skip over the filled entry but still try to refill the rest of the array. We expect simply iterating over the pool to be of small cost compared to even a single system call, so this shouldn't have much impact. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use sa_family_t for address family variables	David Gibson	2024-02-27	11	-19/+20
\| \| \| \| \| \| \| \| \| \|	Sometimes we use sa_family_t for variables and parameters containing a socket address family, other times we use a plain int. Since sa_family_t is what's actually used in struct sockaddr and friends, standardise on that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix 16-bit overflow in udp_invert_portmap()2024_02_20.1e6f92b	David Gibson	2024-02-20	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The code in udp_invert_portmap() is written based on an incorrect understanding of C's (arcane) integer promotion rules. We calculate '(in_port_t)i + delta' expecting the result to be of type in_port_t (16 bits). However "small integer types" (those narrower than 'int') are always promoted to int for expressions, meaning this calculation can overrun the rdelta[] array. Fix this, and use a new intermediate for the index, to make it very clear what it's type is. We also change i to unsigned, to avoid any possible confusion from mixing signed and unsigned types. Link: https://bugs.passt.top/show_bug.cgi?id=80 Reported-by: Laurent Jacquot <jk@lutty.net> Suggested-by: Laurent Jacquot <jk@lutty.net> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Assertion in udp_invert_portmap() can be calculated at compile time	David Gibson	2024-02-20	1	-1/+2
\| \| \| \| \| \| \| \|	All the values in this ASSERT() are known at compile time, so this can be converted to a static_assert(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Don't try to watch namespaces in procfs with inotify, use timer instead2024_02_19.ff22a78	Stefano Brivio	2024-02-19	2	-7/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We watch network namespace entries to detect when we should quit (unless --no-netns-quit is passed), and these might stored in a tmpfs typically mounted at /run/user/UID or /var/run/user/UID, or found in procfs at /proc/PID/ns/. Currently, we try to use inotify for any possible location of those entries, but inotify, of course, doesn't work on pseudo-filesystems (see inotify(7)). The man page reflects this: the description of --no-netns-quit implies that we won't quit anyway if the namespace is not "bound to the filesystem". Well, we won't quit, but, since commit 9e0dbc894813 ("More deterministic detection of whether argument is a PID, PATH or NAME"), we try. And, indeed, this is harmless, as the caveat from that commit message states. Now, it turns out that Buildah, a tool to create container images, sharing its codebase with Podman, passes a procfs entry to pasta, and expects pasta to exit once the network namespace is not needed anymore, that is, once the original container process, also spawned by Buildah, terminates. Get this to work by using the timer fallback mechanism if the namespace name is passed as a path belonging to a pseudo-filesystem. This is expected to be procfs, but I covered sysfs and devpts pseudo-filesystems as well, because nothing actually prevents creating this kind of directory structure and links there. Note that fstatfs(), according to some versions of man pages, was apparently "deprecated" by the LSB. My reasoning for using it is essentially this: https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u ...that is, there was no such thing as an LSB deprecation, and anyway there's no other way to get the filesystem type. Also note that, while it might sound more obvious to detect the filesystem type using fstatfs() on the file descriptor itself (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no matter what path was given to pasta. If we use the parent directory, we'll typically have either tmpfs or procfs reported. If the target namespace is given as a PID, or as a PID-based procfs entry, we don't risk races if this PID is recycled: our handle on /proc/PID/ns will always refer to the original namespace associated with that PID, and we don't re-open this entry from procfs to check it. There's, however, a remaining race possibility if the parent process is not the one associated to the network namespace we operate on: in that case, the parent might pass a procfs entry associated to a PID that was recycled by the time we parse it. This can't happen if the namespace PID matches the one of the parent, because we detach from the controlling terminal after parsing the namespace reference. To avoid this type of race, if desired, we could add the option for the parent to pass a PID file descriptor, that the parent obtained via pidfd_open(). This is beyond the scope of this change. Update the man page to reflect that, even if the target network namespace is passed as a procfs path or a PID, we'll now quit when the procfs entry is gone. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow pasta to remount procfs2024_02_16.08344da	Stefano Brivio	2024-02-16	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	Partially equivalent to commit abf5ef6c22d2 ("apparmor: Allow pasta to remount /proc, access entries under its own copy"): we should allow pasta to remount /proc. It still works otherwise, but further UID remapping in nested user namespaces (e.g. pasta in pasta) won't. Reported-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=79#c3 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: No routable interface for IPv4 or IPv6 is informational, not a warning	Stefano Brivio	2024-02-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...Podman users might get confused by the fact that if we can't find a default route for a given IP version, we'll report that as a warning message and possibly just before actual error messages. However, a lack of routable interface for IPv4 or IPv6 can be a normal circumstance: don't warn about it, just state that as informational message, if those are displayed (they're not in non-error paths in Podman, for example). While at it, make it clear that we're disabling IPv4 or IPv6 if there's no routable interface for the corresponding IP version. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Add fallback timer mechanism to check if namespace is gone	Stefano Brivio	2024-02-16	4	-39/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We don't know how frequently this happens, but hitting fs.inotify.max_user_watches or similar sysctl limits is definitely not out of question, and Paul mentioned that, for example, Podman's CI environments hit similar issues in the past. Introduce a fallback mechanism based on a timer file descriptor: we grab the directory handle at startup, and we can then use openat(), triggered periodically, to check if the (network) namespace directory still exists. If openat() fails at some point, exit. Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707 Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, passt.1: Exit if we can't bind a forwarded port, except for -[tu] all	Stefano Brivio	2024-02-16	2	-28/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...or similar, that is, if only excluded ranges are given (implying we'll forward any other available port). In that case, we'll usually forward large sets of ports, and it might be inconvenient for the user to skip excluding single ports that are already taken. The existing behaviour, that is, exiting only if we fail to bind all the ports for one given forwarding option, turns out to be problematic for several aspects raised by Paul: - Podman merges ranges anyway, so we might fail to bind all the ports from a specific range given by the user, but we'll not fail anyway because Podman merges it with another one where we succeed to bind at least one port. At the same time, there should be no semantic difference between multiple ranges given by a single option and multiple ranges given as multiple options: it's unexpected and not documented - the user might actually rely on a given port to be forwarded to a given container or a virtual machine, and if connections are forwarded to an unrelated process, this might raise security concerns - given that we can try and fail to bind multiple ports before exiting (in case we can't bind any), we don't have a specific error code we can return to the user, so we don't give the user helpful indication as to why we couldn't bind ports. Exit as soon as we fail to create or bind a socket for a given forwarded port, and report the actual error. Keep the current behaviour, however, in case the user wants to forward all the (available) ports for a given protocol, or all the ports with excluded ranges only. There, it's more reasonable that the user is expecting partial failures, and it's probably convenient that we continue with the ports we could forward. Update the manual page to reflect the new behaviour, and the old behaviour too in the cases where we keep it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: udp_sock_init_ns() partially duplicats udp_port_rebind_outbound()	David Gibson	2024-02-14	1	-48/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Usually automatically forwarded UDP outbound ports are set up by udp_port_rebind_outbound() called from udp_timer(). However, the very first time they're created and bound is by udp_sock_init_ns() called from udp_init(). udp_sock_init_ns() is essentially an unnecessary cut down version of udp_port_rebind_outbound(), so we can jusat remove it. Doing so does require moving udp_init() below udp_port_rebind_outbound()'s definition. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't prematurely (and incorrectly) set up automatic inbound forwards	David Gibson	2024-02-14	1	-17/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For automated inbound port forwarding in pasta mode we scan bound ports within the guest namespace via /proc and bind matching ports on the host to listen for packets. For UDP this is usually handled by udp_timer() which calls port_fwd_scan_udp() followed by udp_port_rebind(). However there's one initial scan before the the UDP timer is started: we call port_fwd_scan_udp() from port_fwd_init(), and actually bind the resulting ports in udp_sock_init_init() called from udp_init(). Unfortunately, the version in udp_sock_init_init() isn't correct. It unconditionally opens a new socket for every forwarded port, even if a socket has already been explicit created with the -u option. If the explicitly forwarded ports have particular configuration (such as a specific bound address address, or one implied by the -o option) those will not be replicated in the new socket. We essentially leak the original correctly configured socket, replacing it with one which might not be right. We could make udp_sock_init_init() use udp_port_rebind() to get that right, but there's actually no point doing so: * The initial bind was introduced by ccf6d2a7b48d ("udp: Actually bind detected namespace ports in init namespace") at which time we didn't periodically scan for bound UDP ports. Periodic scanning was introduced in 457ff122e ("udp,pasta: Periodically scan for ports to automatically forward") making the bind from udp_init() redundant. * At the time of udp_init(), programs in the guest namespace are likely not to have started yet (unless attaching a pre-existing namespace) so there's likely not anything to scan for anyway. So, simply remove the initial, broken socket create/bind, allowing automatic port forwards to be created the first time udp_timer() runs. Reported-by: Laurent Jacquot <jk@lutty.net> Suggested-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=79 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Use const rtnh pointer	David Gibson	2024-02-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	6c7623d07 ("netlink: Add support to fetch default gateway from multipath routes") inadvertently introduced a new cppcheck warning for a variable which could be a const pointer but isn't. This occurs with cppcheck-2.13.0-1.fc39.x86_64 in Fedora 39 at least. Fixes: 6c7623d07bbd ("netlink: Add support to fetch default gateway from multipath routes") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: setlogmask(0) can actually result in a system call, don't use it	Stefano Brivio	2024-02-14	2	-13/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before commit 32d07f5e59f2 ("passt, pasta: Completely avoid dynamic memory allocation"), we didn't store the current log mask in a variable, and we fetched it using setlogmask(0) wherever needed. But after that commit, we can use our log_mask copy instead. And we should: with recent glibc versions, setlogmask(0) actually results in a system call, which causes a substantial overhead with high transfer rates: we use setlogmask(0) even to decide we don't want to print debug messages. Now that we rely on log_mask in early stages, before setlogmask() is called, we need to initialise that variable to the special LOG_EMERG mask value right away: define LOG_EARLY to make this clearer, and, while at it, group conditions in vlogmsg() into something more terse. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Fix subtle bug in fast re-transmit path	David Gibson	2024-02-11	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a duplicate ack from the tap side triggers a fast re-transmit, we set both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of the duplicate ack. Setting seq_to_tap is correct: this is what triggers the retransmit from this point onwards. Setting seq_ack_from_tap is not correct, though. In most cases setting seq_ack_from_tap will be redundant but harmless: it will have already been updated to the same value by tcp_update_seqack_from_tap() a few lines above. However that call can be skipped if tcp_sock_consume() fails, which is rare but possible. In that case this update will cause problems. We use seq_ack_from_tap to track two logically distinct things: how much of the stream has been acked by the guest, and how much of the stream from the socket has been read and discarded (as opposed to MSG_PEEKed). We attempt to keep those values the same, because we discard data exactly when it is acked by the guest. However tcp_sock_consume() failing means we weren't able to disard the acked data. To handle that case, we skip the usual update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get one which supersedes it soon enough. Setting seq_ack_from_tap in the fast retransmit path, however, means we now really will have the read/discard point in the stream out of sync with seq_ack_from_tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add support to fetch default gateway from multipath routes	Stefano Brivio	2024-02-09	2	-5/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the default route for a given IP version is a multipath one, instead of refusing to start because there's no RTA_GATEWAY attribute in the set returned by the kernel, we can just pick one of the paths. To make this somewhat less arbitrary, pick the path with the highest weight, if weights differ. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/20927 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	icmp: Dedicated functions for starting and closing ping sequences	David Gibson	2024-01-22	1	-35/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	ICMP sockets are cleaned up on a timeout implemented in icmp_timer_one(), and the logic to do that cleanup is open coded in that function. Similarly new sockets are opened when we discover we don't have an existing one in icmp_tap_handler(), and again the logic is open-coded. That's not the worst thing, but it's a bit cleaner to have dedicated functions for the creation and destruction of ping sockets. This will also make things a bit easier for future changes we have in mind. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Validate packets received on ping sockets	David Gibson	2024-01-22	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	We access fields of packets received from ping sockets assuming they're echo replies, without actually checking that. Of course, we don't expect anything else from the kernel, but it's probably best to verify. While we're at it, also check for short packets, or a receive address of the wrong family. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Warn on receive errors from ping sockets	David Gibson	2024-01-22	1	-1/+4
\| \| \| \| \| \| \| \| \|	Currently we silently ignore an errors receiving a packet from a ping socket. We don't expect that to happen, so it's probably worth reporting if it does. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Consolidate icmp_sock_handler() with icmpv6_sock_handler()	David Gibson	2024-01-22	3	-59/+37
\| \| \| \| \| \| \| \| \| \| \|	Currently we have separate handlers for ICMP and ICMPv6 ping replies. Although there are a number of points of difference, with some creative refactoring we can combine these together sensibly. Although it doesn't save a vast amount of code, it does make it clearer that we're performing basically the same steps for each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Share more between IPv4 and IPv6 paths in icmp_tap_handler()	David Gibson	2024-01-22	1	-68/+68
\| \| \| \| \| \| \| \| \| \|	Currently icmp_tap_handler() consists of two almost disjoint paths for the IPv4 and IPv6 cases. The only thing they share is an error message. We can use some intermediate variables to refactor this to share some more code between those paths. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Simplify socket expiry scanning	David Gibson	2024-01-22	2	-33/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we use icmp_act[] to scan for ICMP ids which might have an open socket which could time out. However icmp_act[] contains no information that's not already in icmp_id_map[] - it's just an "index" which allows scanning for relevant entries with less cache footprint. We only scan for ICMP socket expiry every 1s, though, so it's not clear that cache footprint really matters. Furthermore, there's no strong reason we need to scan even that often - the timeout is fairly arbitrary and approximate. So, eliminate icmp_act[] in favour of directly scanning icmp_id_map[] and compensate for the cache impact by reducing the scan frequency to once every 10s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Use -1 to represent "missing" sockets	David Gibson	2024-01-22	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \|	icmp_id_map[] contains, amongst other things, fds for "ping" sockets associated with various ICMP echo ids. However, we only lazily open() those sockets, so many will be missing. We currently represent that with a 0, which isn't great, since that's technically a valid fd. Use -1 instead. This does require initializing the fields in icmp_id_map[] but we already have an obvious place to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to match host IDs to guest IDs	David Gibson	2024-01-22	1	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding pings from tap, currently we create a ping socket with a socket address whose port is set to the ID of the ping received from the guest. This causes the socket to send pings with the same ID on the host. Although this seems look a good idea for maximum transparency, it's probably unwise. First, it's fallible - the bind() could fail, and we already have fallback logic which will overwrite the packets with the expected guest id if the id we get on replies doesn't already match. We might as well do that unconditionally. But more importantly, we don't know what else on the host might be using ping sockets, so we could end up with an ID that's the same as an existing socket. You'd expect that to fail the bind() with EADDRINUSE, which would be fine: we'd fall back to rewriting the reply ids. However it appears the kernel (v6.6.3 at least), does not fail the bind() and instead it's "last socket wins" in terms of who gets the replies. So we could accidentally intercept ping replies for something else on the host. So, instead of using bind() to set the id, just let the kernel pick one and expect to translate the replies back. Although theoretically this makes the passt/pasta link a bit less "transparent", essentially nothing cares about specific ping IDs, much like TCP source ports, which we also don't preserve. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to handle "wrong direction" ping socket traffic	David Gibson	2024-01-22	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linux ICMP "ping" sockets are very specific in what they do. They let userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY). They don't let you intercept or handle incoming ping requests. In the case of passt/pasta that means we can process echo requests from tap and forward them to a ping socket, then take the replies from the ping socket and forward them to tap. We can't do the reverse: take echo requests from the host and somehow forward them to the guest. There's really no way for something outside to initiate a ping to a passt/pasta connected guest and if there was we'd need an entirely different mechanism to handle it. However, we have some logic to deal with packets going in that reverse direction. Remove it, since it can't ever be used that way. While we're there use defines for the ICMPv6 types, instead of open coded type values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Remove redundant initialisation of sendto() address	David Gibson	2024-01-22	1	-2/+0
\| \| \| \| \| \| \| \| \|	We initialise the address portion of the sockaddr for sendto() to the unspecified address, but then always overwrite it with the actual destination address before we call the sendto(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't set "port" on destination sockaddr for ping sockets	David Gibson	2024-01-22	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We set the port to the ICMP id on the sendto() address when using ICMP ping sockets. However, this has no effect: the ICMP id the kernel uses is determined only by the "port" on the socket's bound address (which is constructed inside sock_l4(), using the id we also pass to it). For unclear reasons this change triggers cppcheck 2.13.0 to give new "variable could be const pointer" warnings, so make *ih const as well to fix that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Avoid moving flow entries to compact table	David Gibson	2024-01-22	7	-87/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Enforce that freeing of closed flows must happen in deferred handlers	David Gibson	2024-01-22	5	-15/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Abstract allocation of new flows with helper function	David Gibson	2024-01-22	3	-11/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp.c open codes the process of allocating a new flow from the flow table: twice, in fact, once for guest to host and once for host to guest connections. This duplication isn't ideal and will get worse as we add more protocols to the flow table. It also makes it harder to experiment with different ways of handling flow table allocation. Instead, introduce a function to allocate a new flow: flow_alloc(). In some cases we currently check if we're able to allocate, but delay the actual allocation. We now handle that slightly differently with a flow_alloc_cancel() function to back out a recent allocation. We have that separate from a flow_free() function, because future changes we have in mind will need to handle this case a little differently. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_count from context structure to a global	David Gibson	2024-01-22	7	-18/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_log_() to near top of flow.c	David Gibson	2024-01-22	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \|	flow_log_() is a very basic widely used function that many other functions in flow.c will end up needing. At present it's below flow_table_compact() which happens not to need it, but that's likely to change. Move it to near the top of flow.c to avoid forward declarations. Code motion only, no changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets	David Gibson	2024-01-22	5	-39/+27
\| \| \| \| \| \| \| \| \| \| \|	Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Better handling of number of epoll types	David Gibson	2024-01-22	2	-3/+5
\| \| \| \| \| \| \| \| \|	As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add handling for per-flow timers	David Gibson	2024-01-22	4	-12/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>