passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp, tcp_splice: Helpers for getting sockets from the pools	David Gibson	2024-02-27	3	-29/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We maintain pools of ready-to-connect sockets in both the original and (for pasta) guest namespace to reduce latency when starting new TCP connections. If we exhaust those pools we have to take a higher latency path to get a new socket. Currently we open-code that fallback in the places we need it. To improve clarity encapsulate that into helper functions. While we're at it, give those helpers clearer error reporting. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Issue warnings if unable to refill socket pool	David Gibson	2024-02-27	3	-11/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if tcp_sock_refill_pool() is unable to fill all the slots in the pool, it will silently exit. This might lead to a later attempt to get fds from the pool to fail at which point it will be harder to tell what originally went wrong. Instead add warnings if we're unable to refill any of the socket pools when requested. We have tcp_sock_refill_pool() return an error and report it in the callers, because those callers have more context allowing for a more useful message. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Stop on first error when refilling socket pools	David Gibson	2024-02-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if we get an error opening a new socket while refilling a socket pool, we carry on to the next slot and try again. This isn't very useful, since by far the most likely cause of an error is some sort of resource exhaustion. Trying again will probably just hit the same error, and maybe even make things worse. So, instead stop on the first error while refilling the pool, making do with however many sockets we managed to open before the error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't stop refilling socket pool if we find a filled entry	David Gibson	2024-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the pool with a valid fd. This appears to makes sense: we always use fds from the front of the pool, so if we find a filled one, the rest of the pool should be filled as well. However, that's not quite correct. If a previous refill hit errors trying to open new sockets, it could leave gaps between blocks of valid fds. We're going to add some changes that could make that more likely. So, for robustness, instead skip over the filled entry but still try to refill the rest of the array. We expect simply iterating over the pool to be of small cost compared to even a single system call, so this shouldn't have much impact. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use sa_family_t for address family variables	David Gibson	2024-02-27	11	-19/+20
\| \| \| \| \| \| \| \| \| \|	Sometimes we use sa_family_t for variables and parameters containing a socket address family, other times we use a plain int. Since sa_family_t is what's actually used in struct sockaddr and friends, standardise on that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix 16-bit overflow in udp_invert_portmap()2024_02_20.1e6f92b	David Gibson	2024-02-20	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The code in udp_invert_portmap() is written based on an incorrect understanding of C's (arcane) integer promotion rules. We calculate '(in_port_t)i + delta' expecting the result to be of type in_port_t (16 bits). However "small integer types" (those narrower than 'int') are always promoted to int for expressions, meaning this calculation can overrun the rdelta[] array. Fix this, and use a new intermediate for the index, to make it very clear what it's type is. We also change i to unsigned, to avoid any possible confusion from mixing signed and unsigned types. Link: https://bugs.passt.top/show_bug.cgi?id=80 Reported-by: Laurent Jacquot <jk@lutty.net> Suggested-by: Laurent Jacquot <jk@lutty.net> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Assertion in udp_invert_portmap() can be calculated at compile time	David Gibson	2024-02-20	1	-1/+2
\| \| \| \| \| \| \| \|	All the values in this ASSERT() are known at compile time, so this can be converted to a static_assert(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Don't try to watch namespaces in procfs with inotify, use timer instead2024_02_19.ff22a78	Stefano Brivio	2024-02-19	2	-7/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We watch network namespace entries to detect when we should quit (unless --no-netns-quit is passed), and these might stored in a tmpfs typically mounted at /run/user/UID or /var/run/user/UID, or found in procfs at /proc/PID/ns/. Currently, we try to use inotify for any possible location of those entries, but inotify, of course, doesn't work on pseudo-filesystems (see inotify(7)). The man page reflects this: the description of --no-netns-quit implies that we won't quit anyway if the namespace is not "bound to the filesystem". Well, we won't quit, but, since commit 9e0dbc894813 ("More deterministic detection of whether argument is a PID, PATH or NAME"), we try. And, indeed, this is harmless, as the caveat from that commit message states. Now, it turns out that Buildah, a tool to create container images, sharing its codebase with Podman, passes a procfs entry to pasta, and expects pasta to exit once the network namespace is not needed anymore, that is, once the original container process, also spawned by Buildah, terminates. Get this to work by using the timer fallback mechanism if the namespace name is passed as a path belonging to a pseudo-filesystem. This is expected to be procfs, but I covered sysfs and devpts pseudo-filesystems as well, because nothing actually prevents creating this kind of directory structure and links there. Note that fstatfs(), according to some versions of man pages, was apparently "deprecated" by the LSB. My reasoning for using it is essentially this: https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u ...that is, there was no such thing as an LSB deprecation, and anyway there's no other way to get the filesystem type. Also note that, while it might sound more obvious to detect the filesystem type using fstatfs() on the file descriptor itself (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no matter what path was given to pasta. If we use the parent directory, we'll typically have either tmpfs or procfs reported. If the target namespace is given as a PID, or as a PID-based procfs entry, we don't risk races if this PID is recycled: our handle on /proc/PID/ns will always refer to the original namespace associated with that PID, and we don't re-open this entry from procfs to check it. There's, however, a remaining race possibility if the parent process is not the one associated to the network namespace we operate on: in that case, the parent might pass a procfs entry associated to a PID that was recycled by the time we parse it. This can't happen if the namespace PID matches the one of the parent, because we detach from the controlling terminal after parsing the namespace reference. To avoid this type of race, if desired, we could add the option for the parent to pass a PID file descriptor, that the parent obtained via pidfd_open(). This is beyond the scope of this change. Update the man page to reflect that, even if the target network namespace is passed as a procfs path or a PID, we'll now quit when the procfs entry is gone. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow pasta to remount procfs2024_02_16.08344da	Stefano Brivio	2024-02-16	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	Partially equivalent to commit abf5ef6c22d2 ("apparmor: Allow pasta to remount /proc, access entries under its own copy"): we should allow pasta to remount /proc. It still works otherwise, but further UID remapping in nested user namespaces (e.g. pasta in pasta) won't. Reported-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=79#c3 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: No routable interface for IPv4 or IPv6 is informational, not a warning	Stefano Brivio	2024-02-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...Podman users might get confused by the fact that if we can't find a default route for a given IP version, we'll report that as a warning message and possibly just before actual error messages. However, a lack of routable interface for IPv4 or IPv6 can be a normal circumstance: don't warn about it, just state that as informational message, if those are displayed (they're not in non-error paths in Podman, for example). While at it, make it clear that we're disabling IPv4 or IPv6 if there's no routable interface for the corresponding IP version. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Add fallback timer mechanism to check if namespace is gone	Stefano Brivio	2024-02-16	4	-39/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We don't know how frequently this happens, but hitting fs.inotify.max_user_watches or similar sysctl limits is definitely not out of question, and Paul mentioned that, for example, Podman's CI environments hit similar issues in the past. Introduce a fallback mechanism based on a timer file descriptor: we grab the directory handle at startup, and we can then use openat(), triggered periodically, to check if the (network) namespace directory still exists. If openat() fails at some point, exit. Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707 Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, passt.1: Exit if we can't bind a forwarded port, except for -[tu] all	Stefano Brivio	2024-02-16	2	-28/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...or similar, that is, if only excluded ranges are given (implying we'll forward any other available port). In that case, we'll usually forward large sets of ports, and it might be inconvenient for the user to skip excluding single ports that are already taken. The existing behaviour, that is, exiting only if we fail to bind all the ports for one given forwarding option, turns out to be problematic for several aspects raised by Paul: - Podman merges ranges anyway, so we might fail to bind all the ports from a specific range given by the user, but we'll not fail anyway because Podman merges it with another one where we succeed to bind at least one port. At the same time, there should be no semantic difference between multiple ranges given by a single option and multiple ranges given as multiple options: it's unexpected and not documented - the user might actually rely on a given port to be forwarded to a given container or a virtual machine, and if connections are forwarded to an unrelated process, this might raise security concerns - given that we can try and fail to bind multiple ports before exiting (in case we can't bind any), we don't have a specific error code we can return to the user, so we don't give the user helpful indication as to why we couldn't bind ports. Exit as soon as we fail to create or bind a socket for a given forwarded port, and report the actual error. Keep the current behaviour, however, in case the user wants to forward all the (available) ports for a given protocol, or all the ports with excluded ranges only. There, it's more reasonable that the user is expecting partial failures, and it's probably convenient that we continue with the ports we could forward. Update the manual page to reflect the new behaviour, and the old behaviour too in the cases where we keep it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: udp_sock_init_ns() partially duplicats udp_port_rebind_outbound()	David Gibson	2024-02-14	1	-48/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Usually automatically forwarded UDP outbound ports are set up by udp_port_rebind_outbound() called from udp_timer(). However, the very first time they're created and bound is by udp_sock_init_ns() called from udp_init(). udp_sock_init_ns() is essentially an unnecessary cut down version of udp_port_rebind_outbound(), so we can jusat remove it. Doing so does require moving udp_init() below udp_port_rebind_outbound()'s definition. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't prematurely (and incorrectly) set up automatic inbound forwards	David Gibson	2024-02-14	1	-17/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For automated inbound port forwarding in pasta mode we scan bound ports within the guest namespace via /proc and bind matching ports on the host to listen for packets. For UDP this is usually handled by udp_timer() which calls port_fwd_scan_udp() followed by udp_port_rebind(). However there's one initial scan before the the UDP timer is started: we call port_fwd_scan_udp() from port_fwd_init(), and actually bind the resulting ports in udp_sock_init_init() called from udp_init(). Unfortunately, the version in udp_sock_init_init() isn't correct. It unconditionally opens a new socket for every forwarded port, even if a socket has already been explicit created with the -u option. If the explicitly forwarded ports have particular configuration (such as a specific bound address address, or one implied by the -o option) those will not be replicated in the new socket. We essentially leak the original correctly configured socket, replacing it with one which might not be right. We could make udp_sock_init_init() use udp_port_rebind() to get that right, but there's actually no point doing so: * The initial bind was introduced by ccf6d2a7b48d ("udp: Actually bind detected namespace ports in init namespace") at which time we didn't periodically scan for bound UDP ports. Periodic scanning was introduced in 457ff122e ("udp,pasta: Periodically scan for ports to automatically forward") making the bind from udp_init() redundant. * At the time of udp_init(), programs in the guest namespace are likely not to have started yet (unless attaching a pre-existing namespace) so there's likely not anything to scan for anyway. So, simply remove the initial, broken socket create/bind, allowing automatic port forwards to be created the first time udp_timer() runs. Reported-by: Laurent Jacquot <jk@lutty.net> Suggested-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=79 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Use const rtnh pointer	David Gibson	2024-02-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	6c7623d07 ("netlink: Add support to fetch default gateway from multipath routes") inadvertently introduced a new cppcheck warning for a variable which could be a const pointer but isn't. This occurs with cppcheck-2.13.0-1.fc39.x86_64 in Fedora 39 at least. Fixes: 6c7623d07bbd ("netlink: Add support to fetch default gateway from multipath routes") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: setlogmask(0) can actually result in a system call, don't use it	Stefano Brivio	2024-02-14	2	-13/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before commit 32d07f5e59f2 ("passt, pasta: Completely avoid dynamic memory allocation"), we didn't store the current log mask in a variable, and we fetched it using setlogmask(0) wherever needed. But after that commit, we can use our log_mask copy instead. And we should: with recent glibc versions, setlogmask(0) actually results in a system call, which causes a substantial overhead with high transfer rates: we use setlogmask(0) even to decide we don't want to print debug messages. Now that we rely on log_mask in early stages, before setlogmask() is called, we need to initialise that variable to the special LOG_EMERG mask value right away: define LOG_EARLY to make this clearer, and, while at it, group conditions in vlogmsg() into something more terse. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Fix subtle bug in fast re-transmit path	David Gibson	2024-02-11	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a duplicate ack from the tap side triggers a fast re-transmit, we set both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of the duplicate ack. Setting seq_to_tap is correct: this is what triggers the retransmit from this point onwards. Setting seq_ack_from_tap is not correct, though. In most cases setting seq_ack_from_tap will be redundant but harmless: it will have already been updated to the same value by tcp_update_seqack_from_tap() a few lines above. However that call can be skipped if tcp_sock_consume() fails, which is rare but possible. In that case this update will cause problems. We use seq_ack_from_tap to track two logically distinct things: how much of the stream has been acked by the guest, and how much of the stream from the socket has been read and discarded (as opposed to MSG_PEEKed). We attempt to keep those values the same, because we discard data exactly when it is acked by the guest. However tcp_sock_consume() failing means we weren't able to disard the acked data. To handle that case, we skip the usual update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get one which supersedes it soon enough. Setting seq_ack_from_tap in the fast retransmit path, however, means we now really will have the read/discard point in the stream out of sync with seq_ack_from_tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add support to fetch default gateway from multipath routes	Stefano Brivio	2024-02-09	2	-5/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the default route for a given IP version is a multipath one, instead of refusing to start because there's no RTA_GATEWAY attribute in the set returned by the kernel, we can just pick one of the paths. To make this somewhat less arbitrary, pick the path with the highest weight, if weights differ. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/20927 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	icmp: Dedicated functions for starting and closing ping sequences	David Gibson	2024-01-22	1	-35/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	ICMP sockets are cleaned up on a timeout implemented in icmp_timer_one(), and the logic to do that cleanup is open coded in that function. Similarly new sockets are opened when we discover we don't have an existing one in icmp_tap_handler(), and again the logic is open-coded. That's not the worst thing, but it's a bit cleaner to have dedicated functions for the creation and destruction of ping sockets. This will also make things a bit easier for future changes we have in mind. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Validate packets received on ping sockets	David Gibson	2024-01-22	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	We access fields of packets received from ping sockets assuming they're echo replies, without actually checking that. Of course, we don't expect anything else from the kernel, but it's probably best to verify. While we're at it, also check for short packets, or a receive address of the wrong family. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Warn on receive errors from ping sockets	David Gibson	2024-01-22	1	-1/+4
\| \| \| \| \| \| \| \| \|	Currently we silently ignore an errors receiving a packet from a ping socket. We don't expect that to happen, so it's probably worth reporting if it does. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Consolidate icmp_sock_handler() with icmpv6_sock_handler()	David Gibson	2024-01-22	3	-59/+37
\| \| \| \| \| \| \| \| \| \| \|	Currently we have separate handlers for ICMP and ICMPv6 ping replies. Although there are a number of points of difference, with some creative refactoring we can combine these together sensibly. Although it doesn't save a vast amount of code, it does make it clearer that we're performing basically the same steps for each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Share more between IPv4 and IPv6 paths in icmp_tap_handler()	David Gibson	2024-01-22	1	-68/+68
\| \| \| \| \| \| \| \| \| \|	Currently icmp_tap_handler() consists of two almost disjoint paths for the IPv4 and IPv6 cases. The only thing they share is an error message. We can use some intermediate variables to refactor this to share some more code between those paths. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Simplify socket expiry scanning	David Gibson	2024-01-22	2	-33/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we use icmp_act[] to scan for ICMP ids which might have an open socket which could time out. However icmp_act[] contains no information that's not already in icmp_id_map[] - it's just an "index" which allows scanning for relevant entries with less cache footprint. We only scan for ICMP socket expiry every 1s, though, so it's not clear that cache footprint really matters. Furthermore, there's no strong reason we need to scan even that often - the timeout is fairly arbitrary and approximate. So, eliminate icmp_act[] in favour of directly scanning icmp_id_map[] and compensate for the cache impact by reducing the scan frequency to once every 10s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Use -1 to represent "missing" sockets	David Gibson	2024-01-22	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \|	icmp_id_map[] contains, amongst other things, fds for "ping" sockets associated with various ICMP echo ids. However, we only lazily open() those sockets, so many will be missing. We currently represent that with a 0, which isn't great, since that's technically a valid fd. Use -1 instead. This does require initializing the fields in icmp_id_map[] but we already have an obvious place to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to match host IDs to guest IDs	David Gibson	2024-01-22	1	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding pings from tap, currently we create a ping socket with a socket address whose port is set to the ID of the ping received from the guest. This causes the socket to send pings with the same ID on the host. Although this seems look a good idea for maximum transparency, it's probably unwise. First, it's fallible - the bind() could fail, and we already have fallback logic which will overwrite the packets with the expected guest id if the id we get on replies doesn't already match. We might as well do that unconditionally. But more importantly, we don't know what else on the host might be using ping sockets, so we could end up with an ID that's the same as an existing socket. You'd expect that to fail the bind() with EADDRINUSE, which would be fine: we'd fall back to rewriting the reply ids. However it appears the kernel (v6.6.3 at least), does not fail the bind() and instead it's "last socket wins" in terms of who gets the replies. So we could accidentally intercept ping replies for something else on the host. So, instead of using bind() to set the id, just let the kernel pick one and expect to translate the replies back. Although theoretically this makes the passt/pasta link a bit less "transparent", essentially nothing cares about specific ping IDs, much like TCP source ports, which we also don't preserve. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to handle "wrong direction" ping socket traffic	David Gibson	2024-01-22	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linux ICMP "ping" sockets are very specific in what they do. They let userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY). They don't let you intercept or handle incoming ping requests. In the case of passt/pasta that means we can process echo requests from tap and forward them to a ping socket, then take the replies from the ping socket and forward them to tap. We can't do the reverse: take echo requests from the host and somehow forward them to the guest. There's really no way for something outside to initiate a ping to a passt/pasta connected guest and if there was we'd need an entirely different mechanism to handle it. However, we have some logic to deal with packets going in that reverse direction. Remove it, since it can't ever be used that way. While we're there use defines for the ICMPv6 types, instead of open coded type values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Remove redundant initialisation of sendto() address	David Gibson	2024-01-22	1	-2/+0
\| \| \| \| \| \| \| \| \|	We initialise the address portion of the sockaddr for sendto() to the unspecified address, but then always overwrite it with the actual destination address before we call the sendto(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't set "port" on destination sockaddr for ping sockets	David Gibson	2024-01-22	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We set the port to the ICMP id on the sendto() address when using ICMP ping sockets. However, this has no effect: the ICMP id the kernel uses is determined only by the "port" on the socket's bound address (which is constructed inside sock_l4(), using the id we also pass to it). For unclear reasons this change triggers cppcheck 2.13.0 to give new "variable could be const pointer" warnings, so make *ih const as well to fix that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Avoid moving flow entries to compact table	David Gibson	2024-01-22	7	-87/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Enforce that freeing of closed flows must happen in deferred handlers	David Gibson	2024-01-22	5	-15/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Abstract allocation of new flows with helper function	David Gibson	2024-01-22	3	-11/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp.c open codes the process of allocating a new flow from the flow table: twice, in fact, once for guest to host and once for host to guest connections. This duplication isn't ideal and will get worse as we add more protocols to the flow table. It also makes it harder to experiment with different ways of handling flow table allocation. Instead, introduce a function to allocate a new flow: flow_alloc(). In some cases we currently check if we're able to allocate, but delay the actual allocation. We now handle that slightly differently with a flow_alloc_cancel() function to back out a recent allocation. We have that separate from a flow_free() function, because future changes we have in mind will need to handle this case a little differently. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_count from context structure to a global	David Gibson	2024-01-22	7	-18/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_log_() to near top of flow.c	David Gibson	2024-01-22	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \|	flow_log_() is a very basic widely used function that many other functions in flow.c will end up needing. At present it's below flow_table_compact() which happens not to need it, but that's likely to change. Move it to near the top of flow.c to avoid forward declarations. Code motion only, no changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets	David Gibson	2024-01-22	5	-39/+27
\| \| \| \| \| \| \| \| \| \| \|	Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Better handling of number of epoll types	David Gibson	2024-01-22	2	-3/+5
\| \| \| \| \| \| \| \| \|	As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add handling for per-flow timers	David Gibson	2024-01-22	4	-12/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add flow-centric dispatch for deferred flow handling	David Gibson	2024-01-22	5	-17/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Move per-type cleanup logic into per-type helpers	David Gibson	2024-01-22	3	-10/+14
\| \| \| \| \| \| \| \| \| \|	tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Remove redundant handling from tcp_timer()	David Gibson	2024-01-22	3	-19/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise on 'now' for current timestamp variables	David Gibson	2024-01-22	7	-37/+37
\| \| \| \| \| \| \| \| \|	In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make flow_table.h #include the protocol specific headers it needs	David Gibson	2024-01-22	4	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we mostly lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Remove unused pif_name() function	David Gibson	2024-01-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	pif_name() has no current callers, although we expect some as we expand the flow table support. I'm not sure why this didn't get caught by one of our static checkers earlier, but it's now causing cppcheck failures for me. Add a cppcheck suppression. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	11	-34/+41
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fix passt.mbuto for cases where /usr/sbin doesn't exist	David Gibson	2024-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f0ccca74 ("test: make passt.mbuto script more robust") is supposed to make mbuto more robust by standardizing on always putting things in /usr/sbin with /sbin a symlink to it. This matters because different distros have different conventions about how the two are used. However, the logic there requires that /usr/sbin at least exists to start with. This isn't always the case with Fedora derived mbuto images. Ironically the DIRS variable ensures that /sbin exists, although we then remove it, but doesn't require /usr/sbin to exist. Fix that up so that the new logic will work with Fedora. Fixes: f0ccca741f64 ("test: make passt.mbuto script more robust") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fetch most specific (longest prefix) address in nl_addr_get()2023_12_30.f091893	Stefano Brivio	2023-12-30	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This happened in most cases implicitly before commit eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions"): while going through results from netlink, we would only copy an address into the provided return buffer if no address had been picked yet. Because of the insertion logic in the kernel (ipv6_link_dev_addr()), the first returned address would also be the one added last, and, in case of a Linux guest using a DHCPv6 client as well as SLAAC, that would be the address assigned via DHCPv6, because SLAAC happens before the DHCPv6 exchange. The effect of, instead, picking the last returned address (first assigned) is visible when passt or pasta runs nested, given that, by default, they advertise a prefix for SLAAC usage, plus an address via DHCPv6. The first level (L1 guest) would get a /64 address by means of SLAAC, and a /128 address via DHCPv6, the latter matching the address on the host. The second level (L2 guest) would also get two addresses: a /64 via SLAAC (same prefix as the host), and a /128 via DHCPv6, matching the the L1 SLAAC-assigned address, not the one obtained via DHCPv6. That is, none of the L2 addresses would match the address on the host. The whole point of having a DHCPv6 server is to avoid (implicit) NAT when possible, though. Fix this in a more explicit way than the behaviour we initially had: pick the first address among the set of most specific ones, by comparing prefix lengths. Do this for IPv4 and for link-local addresses, too, to match in any case the implementation of the default source address selection. Reported-by: Yalan Zhang <yalzhang@redhat.com> Fixes: eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Default SLAAC prefix comes from address (not prefix) on host	Stefano Brivio	2023-12-30	1	-7/+7
\| \| \| \| \|	Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Fix broken link to CentOS Stream package	Stefano Brivio	2023-12-30	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: make passt.mbuto script more robust	Jon Paul Maloy	2023-12-27	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	Creation of a symbolic link from /sbin to /usr/sbin fails if /sbin exists and is non-empty. This is the case on Ubuntu-23.04. We fix this by removing /sbin before creating the link. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: make tcp_sock_set_bufsize() static (again)	Laurent Vivier	2023-12-27	2	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") has exported tcp_sock_set_bufsize() to be able to use it in tcp_splice.c, but 6ccab72d9b40 has removed its use in tcp_splice.c, so we can set it static again. Fixes: 6ccab72d9b40 ("tcp: Improve handling of fallback if socket pool is empty on new splice") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>