passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	treewide: Use sa_family_t for address family variables	David Gibson	2024-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Sometimes we use sa_family_t for variables and parameters containing a socket address family, other times we use a plain int. Since sa_family_t is what's actually used in struct sockaddr and friends, standardise on that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix 16-bit overflow in udp_invert_portmap()2024_02_20.1e6f92b	David Gibson	2024-02-20	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The code in udp_invert_portmap() is written based on an incorrect understanding of C's (arcane) integer promotion rules. We calculate '(in_port_t)i + delta' expecting the result to be of type in_port_t (16 bits). However "small integer types" (those narrower than 'int') are always promoted to int for expressions, meaning this calculation can overrun the rdelta[] array. Fix this, and use a new intermediate for the index, to make it very clear what it's type is. We also change i to unsigned, to avoid any possible confusion from mixing signed and unsigned types. Link: https://bugs.passt.top/show_bug.cgi?id=80 Reported-by: Laurent Jacquot <jk@lutty.net> Suggested-by: Laurent Jacquot <jk@lutty.net> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Assertion in udp_invert_portmap() can be calculated at compile time	David Gibson	2024-02-20	1	-1/+2
\| \| \| \| \| \| \| \|	All the values in this ASSERT() are known at compile time, so this can be converted to a static_assert(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: udp_sock_init_ns() partially duplicats udp_port_rebind_outbound()	David Gibson	2024-02-14	1	-48/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Usually automatically forwarded UDP outbound ports are set up by udp_port_rebind_outbound() called from udp_timer(). However, the very first time they're created and bound is by udp_sock_init_ns() called from udp_init(). udp_sock_init_ns() is essentially an unnecessary cut down version of udp_port_rebind_outbound(), so we can jusat remove it. Doing so does require moving udp_init() below udp_port_rebind_outbound()'s definition. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't prematurely (and incorrectly) set up automatic inbound forwards	David Gibson	2024-02-14	1	-17/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For automated inbound port forwarding in pasta mode we scan bound ports within the guest namespace via /proc and bind matching ports on the host to listen for packets. For UDP this is usually handled by udp_timer() which calls port_fwd_scan_udp() followed by udp_port_rebind(). However there's one initial scan before the the UDP timer is started: we call port_fwd_scan_udp() from port_fwd_init(), and actually bind the resulting ports in udp_sock_init_init() called from udp_init(). Unfortunately, the version in udp_sock_init_init() isn't correct. It unconditionally opens a new socket for every forwarded port, even if a socket has already been explicit created with the -u option. If the explicitly forwarded ports have particular configuration (such as a specific bound address address, or one implied by the -o option) those will not be replicated in the new socket. We essentially leak the original correctly configured socket, replacing it with one which might not be right. We could make udp_sock_init_init() use udp_port_rebind() to get that right, but there's actually no point doing so: * The initial bind was introduced by ccf6d2a7b48d ("udp: Actually bind detected namespace ports in init namespace") at which time we didn't periodically scan for bound UDP ports. Periodic scanning was introduced in 457ff122e ("udp,pasta: Periodically scan for ports to automatically forward") making the bind from udp_init() redundant. * At the time of udp_init(), programs in the guest namespace are likely not to have started yet (unless attaching a pre-existing namespace) so there's likely not anything to scan for anyway. So, simply remove the initial, broken socket create/bind, allowing automatic port forwards to be created the first time udp_timer() runs. Reported-by: Laurent Jacquot <jk@lutty.net> Suggested-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=79 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise on 'now' for current timestamp variables	David Gibson	2024-01-22	1	-8/+8
\| \| \| \| \| \| \| \| \|	In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make sock_l4() treat empty string ifname like NULL	David Gibson	2023-12-27	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	sock_l4() takes NULL for ifname if you don't want to bind the socket to a particular interface. However, for a number of the callers, it's more natural to use an empty string for that case. Change sock_l4() to accept either NULL or an empty string equivalently, and simplify some callers using that change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Avoid in_addr_t	David Gibson	2023-12-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IPv4 addresses can be stored in an in_addr_t or a struct in_addr. The former is just a type alias to a 32-bit integer, so doesn't really give us any type checking. Therefore we generally prefer the structure, since we mostly want to treat IP address as opaque objects. Fix a few places where we still use in_addr_t, but can just as easily use struct in_addr. Note there are still some uses of in_addr_t in conf.c, but those are justified: since they're doing prefix calculations, they actually need to look at the internals of the address as an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use IN4ADDR_LOOPBACK_INIT more widely	David Gibson	2023-12-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address without delving into its internals. However there are some places we don't use it, and explicitly look at the internal structure of struct in_addr, which we generally want to avoid. Use the define more widely to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd, util: Don't bind UDP ports with opposite-side bound TCP ports	Stefano Brivio	2023-11-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When pasta periodically scans bound ports and binds them on the other side in order to forward traffic, we bind UDP ports for corresponding TCP port numbers, too, to support protocols and applications such as iperf3 which use UDP port numbers matching the ones used by the TCP data connection. If we scan UDP ports in order to bind UDP ports, we skip detection of the UDP ports we already bound ourselves, to avoid looping back our own ports. Same with scanning and binding TCP ports. But if we scan for TCP ports in order to bind UDP ports, we need to skip bound TCP ports too, otherwise, as David pointed out: - we find a bound TCP port on side A, and bind the corresponding TCP and UDP ports on side B - at the next periodic scan, we find that UDP port bound on side B, and we bind the corresponding UDP port on side A - at this point, we unbind that UDP port on side B: we would otherwise loop back our own port. To fix this, we need to avoid binding UDP ports that we already bound, on the other side, as a consequence of finding a corresponding bound TCP port. Reproducing this issue is straightforward: ./pasta -- iperf3 -s # Wait one second, then from another terminal: iperf3 -c ::1 -u Reported-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Analysed-by: David Gibson <david@gibson.dropbear.id.au> Fixes: 457ff122e33c ("udp,pasta: Periodically scan for ports to automatically forward") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp,pasta: Periodically scan for ports to automatically forward	David Gibson	2023-11-19	1	-0/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pasta supports automatic port forwarding, where we look for listening sockets in /proc/net (in both namespace and outside) and establish port forwarding to match. For TCP we do this scan both at initial startup, then periodically thereafter. For UDP however, we currently only scan at start. So unlike TCP we won't update forwarding to handle services that start after pasta has begun. There's no particular reason for that, other than that we didn't implement it. So, remove that difference, by scanning for new UDP forwards periodically too. The logic is basically identical to that for TCP, but it needs some changes to handle the mildly different data structures in the UDP case. Link: https://bugs.passt.top/show_bug.cgi?id=45 Link: https://github.com/rootless-containers/rootlesskit/issues/383 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove socket from udp_{tap,splice}_map when timed out	David Gibson	2023-11-07	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We save sockets bound to particular ports in udp_{tap,splice}_map for reuse later. If they're not used for a time, we time them out and close them. However, when that happened, we weren't actually removing the fds from the relevant map. That meant that later interactions on the same port could get a stale fd from the map. The stale fd might be closed, leading to unexpected EBADF errors, or it could have been re-used by a completely different socket bound to a different port, which could lead to us incorrectly forwarding packets. Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu> Reported-by: Jay <bugs.passt.top@bitsbetwixt.com> Link: https://bugs.passt.top/show_bug.cgi?id=57 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Consistently use -1 to indicate un-opened sockets in maps	David Gibson	2023-11-07	1	-5/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp uses the udp_tap_map, udp_splice_ns and udp_splice_init tables to keep track of already opened sockets bound to specific ports. We need a way to indicate entries where a socket hasn't been opened, but the code isn't consistent if this is indicated by a 0 or a -1: * udp_splice_sendfrom() and udp_tap_handler() assume that 0 indicates an unopened socket * udp_sock_init() fills in -1 for a failure to open a socket * udp_timer_one() is somewhere in between, treating only strictly positive fds as valid -1 (or, at least, negative) is really the correct choice here, since 0 is a theoretically valid fd value (if very unlikely in practice). Change to use that consistently throughout. The table does need to be initialised to all -1 values before any calls to udp_sock_init() which can happen from conf_ports(). Because C doesn't make it easy to statically initialise non zero values in large tables, this does require a somewhat awkward call to initialise the table from conf(). This is the best approach I could see for the short term, with any luck it will go away at some point when those socket tables are replaced by a unified flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Pass originating pif to tap handler functions	David Gibson	2023-11-07	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \|	For now, packets passed to the various *_tap_handler() functions always come from the single "tap" interface. We want to allow the possibility to broaden that in future. As preparation for that, have the code in tap.c pass the pif id of the originating interface to each of those handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Record originating pif in listening socket refs	David Gibson	2023-11-07	1	-11/+12
\| \| \| \| \| \| \| \| \| \|	For certain socket types, we record in the epoll ref whether they're sockets in the namespace, or on the host. We now have the notion of "pif" to indicate what "place" a socket is associated with, so generalise the simple one-bit 'ns' to a pif id. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Clean up ref initialisation in udp_sock_init()	David Gibson	2023-11-07	1	-5/+2
\| \| \| \| \| \| \| \| \|	udp_sock_init() has a number of paths that initialise uref differently. However some of the fields are initialised the same way in all of them. Move those fields into the original initialiser to save a few lines. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-2/+2
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, tap: Correctly advance through packets in udp_tap_handler()	David Gibson	2023-09-08	1	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In both tap4_handler() and tap6_handler(), once we've sorted incoming l3 packets into "sequences", we then step through all the packets in each DUP sequence calling udp_tap_handler(). Or so it appears. In fact, udp_tap_handler() doesn't take an index and always starts with packet 0 of the sequence, even if called repeatedly. It appears to be written with the idea that the struct pool is a queue, from which it consumes packets as it processes them, but that's not how the pool data structure works. Correct this by adding an index parameter to udp_tap_handler() and altering the loops in tap.c to step through the pool properly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't pre-fill IPv4 destination address in headers	David Gibson	2023-08-22	1	-7/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't include destination address in partially precomputed csums	David Gibson	2023-08-22	1	-11/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We partially prepopulate IP and TCP header structures including, amongst other things the destination address, which for IPv4 is always the known address of the guest/namespace. We partially precompute both the IPv4 header checksum and the TCP checksum based on this. In future we're going to want more flexibility with controlling the destination for IPv4 (as we already do for IPv6), so this precomputed value gets in the way. Therefore remove the IPv4 destination from the precomputed checksum and fold it into the checksum update when we actually send a packet. Doing this means we no longer need to recompute those partial sums when the destination address changes ({tcp,udp}_update_l2_buf()) and instead the computation can be moved to compile time. This means while we perform slightly more computations on each packet, we slightly reduce the amount of memory we need to access. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Pass source address to protocol handler functions	David Gibson	2023-08-22	1	-6/+8
\| \| \| \| \| \| \| \| \| \| \|	The tap code passes the IPv4 or IPv6 destination address of packets it receives to the protocol specific code. Currently that protocol code doesn't use the source address, but we want it to in future. So, in preparation, pass the IPv4/IPv6 source address of tap packets to those functions as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Tiny cleanup to udp_sock_handler()	David Gibson	2023-08-13	1	-1/+1
\| \| \| \| \| \| \| \| \|	Move the test for c->no_udp into the function itself, rather than in the dispatching switch statement to better localize the UDP specific logic, and make for greated consistency with other handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Generalize epoll_ref to cover things other than sockets	David Gibson	2023-08-13	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make ns_enter() a void function and report setns() errors	David Gibson	2023-08-04	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ns_enter() returns an integer... but it's always zero. If we actually fail the function doesn't return. Therefore it makes more sense for this to be a function returning void, and we can remove the cases where we pointlessly checked its return value. In addition ns_enter() is usually called from an ephemeral thread created by NS_CALL(). That means that the exit(EXIT_FAILURE) there usually won't be reported (since NS_CALL() doesn't wait() for the thread). So, use die() instead to print out some information in the unlikely event that our setns() here does fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use C11 anonymous members to make poll refs less verbose to use	David Gibson	2023-08-04	1	-27/+23
\| \| \| \| \| \| \| \| \| \| \| \|	union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Work around weird false positives with cppcheck-2.9.1	David Gibson	2023-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 89e38f55 "treewide: Fix header includes to build with musl" added extra #includes to work with musl. Unfortunately with the cppcheck version I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false positives: specifically cppcheck seems to hit a #error in <bits/unistd.h> complaining about including it directly instead of via <unistd.h> (which is not something we're doing). I have no idea why that would be happening; but I'm guessing it has to be a bug in the cpp implementation in that cppcheck version. In any case, it's possible to work around this by moving the include of <unistd.h> before the include of <signal.h>. So, do that. Fixes: 89e38f55405d ("treewide: Fix header includes to build with musl") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Actually bind detected namespace ports in init namespace	Stefano Brivio	2023-03-21	1	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When I reworked udp_init() to move most of the port binding logic to conf_ports, I accidentally dropped this bit of automatic port detection (and binding) at start-up. On -U auto, in pasta mode, udp_sock_init_ns() binds ports in the namespace that correspond to ports bound in the init namespace, but on -u auto, nothing actually happens after port detection. Add udp_sock_init_init() to deal with this, and while at it fix the comment to udp_sock_init_ns(): the latter takes care of outbound "connections". This is currently not covered by tests, and the UDP port needs to be already bound in the namespace when pasta starts (periodic detection for UDP is a missing feature at the moment). It can be checked like this: $ unshare -rUn # echo $$ 590092 # socat -u UDP-LISTEN:5555 STDOUT $ pasta -q -u auto 590092 $ echo "test" \| socat -u STDIN UDP:localhost:5555 Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Fix partial success return codes in {tcp,udp}_sock_init()	Stefano Brivio	2023-03-09	1	-16/+14
\| \| \| \| \| \| \| \| \| \| \| \|	The comments say we should return 0 on partial success, and an error code on complete failure. Rationale: if the user configures a port forwarding, and we succeed to bind that port for IPv4 or IPv6 only, that might actually be what the user intended. Adjust the two functions to reflect the comments. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, udp, util: Pass socket creation errors all the way up	Stefano Brivio	2023-03-09	1	-9/+9
\| \| \| \| \| \| \| \| \|	...starting from sock_l4(), pass negative error (errno) codes instead of -1. They will only be used in two commits from now, no functional changes intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Fix header includes to build with musl	Chris Kuhn	2023-03-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Roughly inspired from a patch by Chris Kuhn: fix up includes so that we can build against musl: glibc is more lenient as headers generally include a larger amount of other headers. Compared to the original patch, I only included what was needed directly in C files, instead of adding blanket includes in local header files. It's a bit more involved, but more consistent with the current (not ideal) situation. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, icmp, tcp, udp: Add options to bind to outbound address and interface	Stefano Brivio	2023-03-09	1	-17/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I didn't notice earlier: libslirp (and slirp4netns) supports binding outbound sockets to specific IPv4 and IPv6 addresses, to force the source addresse selection. If we want to claim feature parity, we should implement that as well. Further, Podman supports specifying outbound interfaces as well, but this is simply done by resolving the primary address for an interface when the network back-end is started. However, since kernel version 5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users"), we can actually bind to a specific interface name, which doesn't need to be validated in advance. Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses, and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets to given interfaces. Given that it probably makes little sense to select addresses and routes from interfaces different than the ones given for outbound sockets, also assign those as "template" interfaces, by default, unless explicitly overridden by '-i'. For ICMP and UDP, we call sock_l4() to open outbound sockets, as we already needed to bind to given ports or echo identifiers, and we can bind() a socket only once: there, pass address (if any) and interface (if any) for the existing bind() and setsockopt() calls. For TCP, in general, we wouldn't otherwise bind sockets. Add a specific helper to do that. For UDP outbound sockets, we need to know if the final destination of the socket is a loopback address, before we decide whether it makes sense to bind the socket at all: move the block mangling the address destination before the creation of the socket in the IPv4 path. This was already the case for the IPv6 path. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Fix signedness warning on 32-bits architectures	Stefano Brivio	2023-03-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a ssize_t is an int: udp.c: In function ‘udp_sock_handler’: udp.c:774:23: warning: comparison of integer expressions of different signedness: ‘unsigned int’ and ‘ssize_t’ {aka ‘int’} [-Wsign-compare] 774 \| for (i = 0; i < n; i += m) { \| ^ udp.c:781:43: warning: comparison of integer expressions of different signedness: ‘unsigned int’ and ‘ssize_t’ {aka ‘int’} [-Wsign-compare] 781 \| for (m = 1; i + m < n; m++) { \| Change 'i' and 'm' counters in udp_sock_handler() to signed versions, to match ssize_t n. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, udp: Allow any loopback address to be used as resolver	Stefano Brivio	2023-02-27	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Andrea reports that with a Fedora 37 guest running on a Fedora 37 host, both using systemd-resolved, with passt connecting them, running with default options, DNS queries don't work. systemd-resolved on the host is reachable only at the loopback address 127.0.0.53. We advertise the default gateway address to the guest as resolver, because our local address is of course unreachable from there, which means we see DNS queries directed to the default gateway, and we redirect them to 127.0.0.1. However, systemd-resolved doesn't answer on 127.0.0.1. To fix this, set @dns_match to the address of the default gateway, unless a different resolver address is explicitly configured, so that we know we explicitly have to map DNS queries, in this case, to the address of the local resolver. This means that in udp_tap_handler() we need to check, first, if the destination address of packets matches @dns_match: even if it's the address of the local gateway, we want to map that to a specific address, which isn't necessarily 127.0.0.1. Do the same for IPv6 for consistency, even though IPv6 defines a single loopback address. Reported-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Actually use host resolver to forward DNS queries	Stefano Brivio	2023-02-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of the address of the first resolver we advertise to the guest or namespace. This was one of the intentions behind commit 3a2afde87dd1 ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields"), but I forgot to implement this part. In practice, they are usually the same thing, unless /etc/resolv.conf points to a loopback address. Fixes: 3a2afde87dd1 ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, tcp, udp: Exit if we fail to bind sockets for all given ports	Stefano Brivio	2023-02-16	1	-3/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	passt supports ranges of forwarded ports as well as 'all' for TCP and UDP, so it might be convenient to proceed if we fail to bind only some of the desired ports. But if we fail to bind even a single port for a given specification, we're clearly, unexpectedly, conflicting with another network service. In that case, report failure and exit. Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Make assertions actually useful	David Gibson	2023-02-12	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use tap_send_frames()	David Gibson	2023-01-23	1	-134/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To send frames on the tap interface, the UDP uses a fairly complicated two level batching. First multiple frames are gathered into a single "message" for the qemu stream socket, then multiple messages are send with sendmmsg(). We now have tap_send_frames() which already deals with sending a number of frames, including batching and handling partial sends. Use that to considerably simplify things. This does make a couple of behavioural changes: * We used to split messages to keep them under 32kiB (except when a single frame was longer than that). The comments claim this is needed to stop qemu from closing the connection, but we don't have any equivalent logic for TCP. I wasn't able to reproduce the problem with this series, although it was apparently easy to reproduce earlier. My suspicion is that there was never an inherent need to keep messages small, however with larger messages (and default kernel buffer sizes) the chances of needing more than one resend for partial send()s is greatly increased. We used not to correctly handle that case of multiple resends, but now we do. * Previously when we got a partial send on UDP, we would resend the remainder of the entire "message", including multiple frames. The common code now only resends the remainder of a single frame, simply dropping any frames which weren't even partially sent. This is what TCP always did and is probably a better idea for UDP too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use abstracted tap header	David Gibson	2023-01-23	1	-55/+24
\| \| \| \| \| \| \| \| \|	Update the UDP code to use the tap layer abstractions for initializing and updating the L2 and lower headers. This will make adding other tap backends in future easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Parameterize ethernet header initializer macro	David Gibson	2023-01-23	1	-2/+2
\| \| \| \| \| \| \| \| \|	We have separate IPv4 and IPv6 versions of a macro to construct an initializer for ethernet headers. However, now that we have htons_constant it's easy to simply paramterize this with the ethernet protocol number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Use named field initializers in iov_init functions	David Gibson	2023-01-23	1	-9/+4
\| \| \| \| \| \| \| \| \| \|	Both the TCP and UDP iov_init functions have some large structure literals defined in "field order" style. These are pretty hard to read since it's not obvious what value corresponds to what field. Use named field style initializers instead to make this clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't use separate sockets to listen for spliced packets	David Gibson	2023-01-13	1	-40/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, when ports are forwarded inbound in pasta mode, we open two sockets for incoming traffic: one listens on the public IP address and will forward packets to the tuntap interface. The other listens on localhost and forwards via "splicing" (resending directly via sockets in the ns). Now that we've improved the logic about whether we "splice" any individual packet, we don't need this. Instead we can have a single socket bound to 0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal with each packet as appropriate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Decide whether to "splice" per datagram rather than per socket	David Gibson	2023-01-13	1	-19/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have special sockets for receiving datagrams from locahost which can use the optimized "splice" path rather than going across the tap interface. We want to loosen this so that sockets can receive sockets that will be forwarded by both the spliced and non-spliced paths. To do this, we alter the meaning of the @splice bit in the reference to mean that packets receieved on this socket can be spliced, not that they will be spliced. They'll only actually be spliced if they come from 127.0.0.1 or ::1. We can't (for now) remove the splice bit entirely, unlike with TCP. Our gateway mapping means that if the ns initiates communication to the gw address, we'll translate that to target 127.0.0.1 on the host side. Reply packets will therefore have source address 127.0.0.1 when received on the host, but these need to go via the tap path where that will be translated back to the gateway address. We need the @splice bit to distinguish that case from packets going from localhost to a port mapped explicitly with -u which should be spliced. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify udp_sock_handler_splice() with udp_sock_handler()	David Gibson	2023-01-13	1	-60/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	These two functions now have a very similar structure, and their first part (calling recvmmsg()) is functionally identical. So, merge the two functions into one. This does have the side effect of meaning we no longer receive multiple packets at once for splice (we already didn't for tap). This does hurt throughput for small spliced packets, but improves it for large spliced packets and tap packets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Pre-populate msg_names with local address	David Gibson	2023-01-13	1	-22/+18
\| \| \| \| \| \| \| \| \| \| \|	udp_splice_namebuf is now used only for spliced sending, and so it is only ever populated with the localhost address, either IPv4 or IPv6. So, replace the awkward initialization in udp_sock_handler_splice() with statically initialized versions for IPv4 and IPv6. We then just need to update the port number in udp_sock_handler_splice(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't handle tap receive batch size calculation within a #define	David Gibson	2023-01-13	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a batch for sizing our buffers and control structures. The subtly different UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to receive at once for tap packets in the current configuration. This depends on the mode, meaning that the macro has a non-obvious dependency on the usual 'c' context variable being available. We only use it in one place, so it makes more sense to open code this. Add an explanatory comment while we're there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split receive from preparation and send in udp_sock_handler()	David Gibson	2023-01-13	1	-27/+52
\| \| \| \| \| \| \| \| \| \| \|	The receive part of udp_sock_handler() and udp_sock_handler_splice() is now almost identical. In preparation for merging that, split the receive part of udp_sock_handler() from the part preparing and sending the frames for sending on the tap interface. The latter goes into a new udp_tap_send() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split sending to passt tap interface into separate function	David Gibson	2023-01-13	1	-58/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The last part of udp_sock_handler() does the actual sending of frames to the tap interface. For pasta that's just a call to udp_tap_send_pasta() but for passt, it's moderately complex and open coded. For symmetry, move the passt send path into its own function, udp_tap_send_passt(). This will make it easier to abstract the tap interface in future (e.g. when we want to add vhost-user). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Move sending pasta tap frames to the end of udp_sock_handler()	David Gibson	2023-01-13	1	-19/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler() has a surprising difference in flow between pasta and passt mode: For pasta we send each frame to the tap interface as we prepare it. For passt, though, we prepare all the frames, then send them with a single sendmmsg(). Alter the pasta path to also prepare all the frames, then send them at the end. We already have a suitable data structure for the passt case. This will make it easier to abstract out the tap backend difference in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>