passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	udp: Update UDP "connection" timestamps in both directions	David Gibson	2022-12-06	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A UDP pseudo-connection between port A in the init namespace and port B in the pasta guest namespace involves two sockets: udp_splice_init[v6][B] and udp_splice_ns[v6][A]. The socket which originated this "connection" will be permanent but the other one will be closed on a timeout. When we get a packet from the originating socket, we update the timeout on the other socket, but we don't do the same when we get a reply packet from the other socket. However any activity on the "connection" probably indicates that it's still in use. Without this we could incorrectly time out a "connection" if it's using a protocol which involves a single initiating packet, but which then gets continuing replies from the target. Correct this by updating the timeout on both sockets for a packet in either direction. This also updates the timestamps for the permanent originating sockets which is unnecessary, but harmless. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't explicitly track originating socket for spliced "connections"	David Gibson	2022-12-06	1	-61/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we look up udp_splice_to_ns[][].orig_sock in udp_sock_handler_splice() we're finding the socket on which the originating packet for the "connection" was received on. However, we don't specifically need this socket to be the originating one - we just need one that's bound to the the source port of this reply packet in the init namespace. We can look this up in udp_splice_to_init[v6][src].target_sock, whose defining characteristic is exactly that. The same applies with init and ns swapped. In practice, of course, the port we locate this way will always be the originating port, since we couldn't have started this "connection" if it wasn't. Change this, and we no longer need the @orig_sock field at all. That leaves just @target_sock which we rename to simply @sock. The whole udp_splice_flow structure now more represents a single bound port than a "flow" per se, so rename and recomment it accordingly. Likewise the udp_splice_to_{ns,init} names are now misleading, since the ports in those maps are used in both directions. Rename them to udp_splice_{ns,init} indicating the location where the described socket is bound. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Re-use fixed bound sockets for packet forwarding when possible	David Gibson	2022-12-06	1	-9/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we look up udp_splice_to_ns[v6][src].target_sock in udp_sock_handler_splice, all we really require of the socket is that it be bound to port src in the pasta guest namespace. Similarly for udp_splice_to_init but bound in the init namespace. Usually these sockets are created temporarily by udp_splice_connect() and cleaned up by udp_timer(). However, depending on the -u and -U options its possible we have a permanent socket bound to the relevant port created by udp_sock_init(). If such a socket exists, we could use it instead of creating a temporary one. In fact we must use it, because we'll fail trying to bind() a temporary one to the same port. So allow this, store permanently bound sockets into udp_splice_to_{ns,init} in udp_sock_init(). These won't get incorrectly removed by the timer because we don't put a corresponding entry in the udp_act[] structure which directs the timer what to clean up. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't create double sockets for -U port	David Gibson	2022-12-06	1	-18/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For each IP version udp_socket() has 3 possible calls to sock_l4(). One is for the "non-spliced" bound socket in the init namespace, one for the "spliced" bound socket in the init namespace and one for the "spliced" bound socket in the pasta namespace. However when this is called to create a socket in the pasta namspeace there is a logic error which causes it to take the path for the init side spliced socket as well as the ns socket. This essentially tries to create two identical sockets on the ns side. Unsurprisingly the second bind() call fails according to strace. Correct this to only attempt to open one socket within the ns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split splice field in udp_epoll_ref into (mostly) independent bits	David Gibson	2022-12-06	1	-27/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The @splice field in union udp_epoll_ref can have a number of values for different types of "spliced" packet flows. Split it into several single bit fields with more or less independent meanings. The new @splice field is just a boolean indicating whether the socket is associated with a spliced flow, making it identical to the @splice fiend in tcp_epoll_ref. The new bit @orig, indicates whether this is a socket which can originate new udp packet flows (created with -u or -U) or a socket created on the fly to handle reply socket. @ns indicates whether the socket lives in the init namespace or the pasta namespace. Making these bits more orthogonal to each other will simplify some future cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove the @bound field from union udp_epoll_ref	David Gibson	2022-12-06	1	-5/+3
\| \| \| \| \| \| \|	We set this field, but nothing ever checked it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't connect "forward" sockets for spliced flows	David Gibson	2022-12-06	1	-50/+35
\| \| \| \| \| \| \| \| \| \| \| \|	Currently we connect() the socket we use to forward spliced UDP flows. However, we now only ever use sendto() rather than send() on this socket so there's not actually any need to connect it. Don't do so. Rename a number of things that referred to "connect" or "conn" since that would now be misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Always use sendto() rather than send() for forwarding spliced packets	David Gibson	2022-12-06	1	-33/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler_splice() has two different ways of sending out packets once it has determined the correct destination socket. For the originating sockets (which are not connected) it uses sendto() to specify a specific address. For the forward socket (which is connected) we use send(). However we know the correct destination address even for the forward socket we do also know the correct destination address. We can use this to use sendto() instead of send(), removing the need for two different paths and some staging data structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Separate tracking of inbound and outbound packet flows	David Gibson	2022-12-06	1	-57/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each entry udp_splice_map[v6][N] keeps information about two essentially unrelated packet flows. @ns_conn_sock, @ns_conn_ts and @init_bound_sock track a packet flow from port N in the host init namespace to some other port in the pasta namespace (the one @ns_conn_sock is connected to). @init_conn_sock, @init_conn_ts and @ns_bound_sock track packet flow from port N in the pasta namespace to some other port in the host init namespace (the one @init_conn_sock is connected to). Split udp_splice_map[][] into two separate tables for the two directions. Each entry in each table is a 'struct udp_splice_flow' with @orig_sock (previously the bound socket), @target_sock (previously the connected socket) and @ts (the timeout for the target socket). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Also bind() connected ports for "splice" forwarding	David Gibson	2022-12-06	1	-52/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pasta handles "spliced" port forwarding by resending datagrams received on a bound socket in the init namespace to a connected socket in the guest namespace. This means there are actually three ports associated with each "connection". First there's the source and destination ports of the originating datagram. That's also the destination port of the forwarded datagram, but the source port of the forwarded datagram is the kernel allocated bound address of the connected socket. However, by bind()ing as well as connect()ing the forwarding socket we can choose the source port of the forwarded datagrams. By choosing it to match the original source port we remove that surprising third port number and no longer need to store port numbers in struct udp_splice_port. As a bonus this means that the recipient of the packets will see the original source port if they call getpeername(). This rarely matters, but it can't hurt. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, udp: Drop mostly duplicated dns_send arrays, rename related fields	Stefano Brivio	2022-11-16	1	-9/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that we use just the first valid DNS resolver address configured, or read from resolv.conf(5) on the host, to forward DNS queries to, in case --dns-forward is used, we don't need to duplicate dns[] to dns_send[]: - rename dns_send[] back to dns[]: those are the resolvers we advertise to the guest/container - for forwarding purposes, instead of dns[], use a single field (for each protocol version): dns_host - and rename dns_fwd to dns_match, so that it's clear this is the address we are matching DNS queries against, to decide if they need to be forwarded Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, udp: Don't initialise IPv6/IPv4 sockets if IPv4/IPv6 are not enabled	Stefano Brivio	2022-11-10	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we disable a given IP version automatically (no corresponding default route on host) or administratively (--ipv4-only or --ipv6-only options), we don't initialise related buffers and services (DHCP for IPv4, NDP and DHCPv6 for IPv6). The "tap" handlers will also ignore packets with a disabled IP version. However, in commit 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports") I happily changed socket initialisation functions to take AF_UNSPEC meaning "any enabled IP version", but I forgot to add checks back for the "enabled" part. Reported by Paul: on a host without default IPv6 route, but IPv6 enabled, connect, using IPv6, to a port handled by pasta, which tries to send data to a tap device without initialised buffers for that IP version and exits because the resulting write() fails. Simpler way to reproduce: pasta -6 and inbound IPv4 connection, or pasta -4 and inbound IPv6 connection. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Check for answers to forwarded DNS queries before handling local redirects	Stefano Brivio	2022-11-04	1	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \|	Now that we allow loopback DNS addresses to be used as targets for forwarding, we need to check if DNS answers come from those targets, before deciding to eventually remap traffic for local redirects. Otherwise, the source address won't match the one configured as forwarder, which means that the guest or the container will refuse those responses. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use typing to reduce chances of IPv4 endianness errors	David Gibson	2022-11-04	1	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use IPV4_IS_LOOPBACK more widely	David Gibson	2022-11-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	This macro checks if an IPv4 address is in the loopback network (127.0.0.0/8). There are two places where we open code an identical check, use the macro instead. There are also a number of places we specifically exclude the loopback address (127.0.0.1), but we should actually be excluding anything in the loopback network. Change those sites to use the macro as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix port and address checks for DNS forwarder	Stefano Brivio	2022-10-15	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	First off, as we swap endianness for source ports in udp_fill_data_v{4,6}(), we want host endianness, not network endianness. It doesn't actually matter if we use htons() or ntohs() here, but the current version is confusing. In the IPv4 path, when we remap DNS answers, we already swapped the endianness as needed for the source port: don't swap it again, otherwise we'll not map DNS answers for IPv4. In the IPv6 path, when we remap DNS answers, we want to check that they came from our upstream DNS server, not the one configured via --dns-forward (which doesn't even need to exist for this functionality to work). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, tcp, udp: Allow specification of interface to bind to	Stefano Brivio	2022-10-15	1	-17/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since kernel version 5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users"), we can bind sockets to interfaces, if they haven't been bound yet (as in bind()). Introduce an optional interface specification for forwarded ports, prefixed by %, that can be passed together with an address. Reported use case: running local services that use ports we want to have externally forwarded: https://github.com/containers/podman/issues/14425 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Move logging functions to a new file, log.c	Stefano Brivio	2022-10-14	1	-0/+1
\| \| \| \| \| \| \| \|	Logging to file is going to add some further complexity that we don't want to squeeze into util.c. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Replace pragma to ignore bogus stringop-overread warning with workaround	Stefano Brivio	2022-09-29	1	-8/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit c318ffcb4c93 ("udp: Ignore bogus -Wstringop-overread for write() from gcc 12.1") uses a gcc pragma to ignore a bogus warning, which started appearing on gcc 12.1 (aarch64 and x86_64) due to: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103483 ...but gcc 12.2 has the same issue. Not just that: if LTO is enabled, the pragma itself is ignored (this wasn't the case with gcc 12.1), because of: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80922 Drop the pragma, and assign a frame (in the networking sense) pointer from the offset of the Ethernet header in the buffer, then pass it to write() and pcap(), so that gcc doesn't obsess anymore with the fact that an Ethernet header is 14 bytes and we're sending more than that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Fix widespread off-by-one error dealing with port numbers	David Gibson	2022-09-24	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Port numbers (for both TCP and UDP) are 16-bit, and so fit exactly into a 'short'. USHRT_MAX is therefore the maximum port number and this is widely used in the code. Unfortunately, a lot of those places don't actually want the maximum port number (USHRT_MAX == 65535), they want the total number of ports (65536). This leads to a number of potentially nasty consequences: * We have buffer overruns on the port_fwd::delta array if we try to use port 65535 * We have similar potential overruns for the tcp_sock_* arrays * Interestingly udp_act had the correct size, but we can calculate it in a more direct manner * We have a logical overrun of the ports bitmap as well, although it will just use an unused bit in the last byte so isnt harmful * Many loops don't consider port 65535 (which does mitigate some but not all of the buffer overruns above) * In udp_invert_portmap() we incorrectly compute the reverse port translation for return packets Correct all these by using a new NUM_PORTS defined explicitly for this purpose. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Treat port numbers as unsigned	David Gibson	2022-09-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Port numbers are unsigned values, but we're storing them in (signed) int variables in some places. This isn't actually harmful, because int is large enough to hold the entire range of ports. However in places we don't want to use an in_port_t (usually to avoid overflow on the last iteration of a loop) it makes more conceptual sense to use an unsigned int. This will also avoid some problems with later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Don't use indirect remap functions for conf_ports()	David Gibson	2022-09-24	1	-22/+0
\| \| \| \| \| \| \| \| \| \|	Now that we've delayed initialization of the UDP specific "reverse" map until udp_init(), the only difference between the various 'remap' functions used in conf_ports() is which array they target. So, simplify by open coding the logic into conf_ports() with a pointer to the correct mapping array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Delay initialization of UDP reversed port mapping table	David Gibson	2022-09-24	1	-3/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because it's connectionless, when mapping UDP ports we need, in addition to the table of deltas for destination ports needed by TCP, we need an inverted table to translate the source ports on return packets. Currently we fill out the inverted table at the same time we construct the main table in udp_remap_to_tap() and udp_remap_to_init(). However, we don't use either table until after we've initialized UDP, so we can delay the construction of the reverse table to udp_init(). This makes the configuration more symmetric between TCP and UDP which will enable further cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Consolidate port forwarding configuration into a common structure	David Gibson	2022-09-24	1	-17/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The configuration for how to forward ports in and out of the guest/ns is divided between several different variables. For each connect direction and protocol we have a mode in the udp/tcp context structure, a bitmap of which ports to forward also in the context structure and an array of deltas to apply if the outward facing and inward facing port numbers are different. This last is a separate global variable, rather than being in the context structure, for no particular reason. UDP also requires an additional array which has the reverse mapping used for return packets. Consolidate these into a re-used substructure in the context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Don't drop zero-length outbound UDP packets	David Gibson	2022-09-13	1	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	udp_tap_handler() currently skips outbound packets if they have a payload length of zero. This is not correct, since in a datagram protocol zero length packets still have meaning. Adjust this to correctly forward the zero-length packets by using a msghdr with msg_iovlen == 0. Bugzilla: https://bugs.passt.top/show_bug.cgi?id=19 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Don't pre-initialize msghdr array	David Gibson	2022-09-13	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \|	In udp_tap_handler() the array of msghdr structures, mm[], is initialized to zero. Since UIO_MAXIOV is 1024, this can be quite a large zero, which is expensive if we only end up using a few of its entries. It also makes it less obvious how we're setting all the control fields at the point we actually invoke sendmmsg(). Rather than pre-initializing it, just initialize each element as we use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Make substructures for IPv4 and IPv6 specific context information	David Gibson	2022-07-30	1	-31/+31
\| \| \| \| \| \| \| \| \| \| \| \|	The context structure contains a batch of fields specific to IPv4 and to IPv6 connectivity. Split those out into a sub-structure. This allows the conf_ip4() and conf_ip6() functions, which take the entire context but touch very little of it, to be given more specific parameters, making it clearer what it affects without stepping through the code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	Separate IPv4 and IPv6 configuration	David Gibson	2022-07-30	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After recent changes, conf_ip() now has essentially entirely disjoint paths for IPv4 and IPv6 configuration. So, it's cleaner to split them out into different functions conf_ip4() and conf_ip6(). Splitting these out also lets us make the interface a bit nicer, having them return success or failure directly, rather than manipulating c->v4 and c->v6 to indicate success/failure of the two versions. Since these functions may also initialize the interface index for each protocol, it turns out we can then drop c->v4 and c->v6 entirely, replacing tests on those with tests on whether c->ifi4 or c->ifi6 is non-zero (since a 0 interface index is never valid). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Whitespace fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Ignore bogus -Wstringop-overread for write() from gcc 12.1	Stefano Brivio	2022-05-19	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With current OpenSUSE Tumbleweed on aarch64 (gcc-12-1.3.aarch64) and on x86_64 (gcc-12-1.4.x86_64), but curiously not on armv7hl (gcc-12-1.3.armv7hl), gcc warns about using the _pointer_ to the 802.3 header to write the whole frame to the tap descriptor: reading between 62 and 4294967357 bytes from a region of size 14 which is bogus: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103483 Probably declaring udp_sock_fill_data_v{4,6}() as noinline would "fix" this, but that's on the data path, so I'd rather not. Use a gcc pragma instead. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, tcp, udp: Allow address specification for forwarded ports	Stefano Brivio	2022-05-01	1	-63/+99
\| \| \| \| \| \| \| \| \| \| \| \| \|	This feature is available in slirp4netns but was missing in passt and pasta. Given that we don't do dynamic memory allocation, we need to bind sockets while parsing port configuration. This means we need to process all other options first, as they might affect addressing and IP version support. It also implies a minor rework of how TCP and UDP implementations bind sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Out-of-bounds read, CWE-125 in udp_timer()	Stefano Brivio	2022-04-07	1	-1/+1
\| \| \| \| \| \| \|	Not an actual issue due to how it's typically stored, but udp_act can also be used for ports 65528-65535. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Unchecked return value from library, CWE-252	Stefano Brivio	2022-04-07	1	-1/+2
\| \| \| \| \| \| \|	All instances were harmless, but it might be useful to have some debug messages here and there. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap, tcp, udp, icmp: Cut down on some oversized buffers	Stefano Brivio	2022-03-29	1	-2/+2
\| \| \| \| \| \| \| \| \|	The existing sizes provide no measurable differences in throughput and packet rates at this point. They were probably needed as batched implementations were not complete, but they can be decreased quite a bit now. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Move flags before ts in struct udp_tap_port, avoid end padding	Stefano Brivio	2022-03-29	1	-3/+3
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Mark constant references as const	Stefano Brivio	2022-03-29	1	-16/+18
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Packet abstraction with mandatory boundary checks	Stefano Brivio	2022-03-29	1	-20/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp, util: Enforce 24-bit limit on socket numbers	Stefano Brivio	2022-03-29	1	-0/+7
\| \| \| \| \| \| \|	This should never happen, but there are no formal guarantees: ensure socket numbers are below SOCKET_MAX. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use flags for local, loopback, and configured unicast binds	Stefano Brivio	2022-03-28	1	-25/+23
\| \| \| \| \| \| \| \| \| \| \|	There's no value in keeping a separate timestamp for activity and for aging of local binds, given that they have the same timeout. Reduce that to a single timestamp, with a flag indicating the local bind. Also use flags instead of separate int fields for loopback and configured unicast address usage as source address. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split buffer queueing/writing parts of udp_sock_handler()	Stefano Brivio	2022-03-28	1	-171/+193
\| \| \| \| \| \| \| \| \| \|	...it became too hard to follow: split it off to udp_sock_fill_data_v{4,6}. While at it, use IN6_ARE_ADDR_EQUAL(a, b), courtesy of netinet/in.h, instead of open-coded memcmp(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Drop _splice from recv, send, sendto static buffer names	Stefano Brivio	2022-03-28	1	-29/+23
\| \| \| \| \| \| \|	It's already implied by the fact they don't have "l2" in their names, and dropping it improves readability a bit. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Explicitly initialise sin6_scope_id and sin_zero in sockaddr_in{,6}	Stefano Brivio	2022-02-25	1	-0/+2
\| \| \| \| \| \| \|	Not functionally needed, but gcc versions 7 to 9 (at least) will issue a warning otherwise. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Receive batching doesn't pay off when writing single frames to tap	Stefano Brivio	2022-02-21	1	-16/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, when we get data from sockets and write it as single frames to the tap device, we batch receive operations considerably, and then (conceptually) split the data in many smaller writes. It looked like an obvious choice, but performance is actually better if we receive data in many small frame-sized recvmsg()/recvmmsg(). The syscall overhead with the previous behaviour, observed by perf, comes predominantly from write operations, but receiving data in shorter chunks probably improves cache locality by a considerable amount. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Allow loopback connections from host using configured unicast address	Stefano Brivio	2022-02-21	1	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \|	Likely for testing purposes only: allow connections from host to guest or namespace using, as connection target, the configured, possibly global unicast address. In this case, we have to map the destination address to a link-local address, and for port-based tracked responses, the source address needs to be again the unicast address: not loopback, not link-local. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, udp: Introduce basic DNS forwarding	Stefano Brivio	2022-02-21	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For compatibility with libslirp/slirp4netns users: introduce a mechanism to map, in the UDP routines, an address facing guest or namespace to the first IPv4 or IPv6 address resulting from configuration as resolver. This can be enabled with the new --dns-forward option. This implies that sourcing and using DNS addresses and search lists, passed via command line or read from /etc/resolv.conf, is not bound anymore to DHCP/DHCPv6/NDP usage: for example, pasta users might just want to use addresses from /etc/resolv.conf as mapping target, while not passing DNS options via DHCP. Reflect this in all the involved code paths by differentiating DHCP/DHCPv6/NDP usage from DNS configuration per se, and in the new options --dhcp-dns, --dhcp-search for pasta, and --no-dhcp-dns, --no-dhcp-search for passt. This should be the last bit to enable substantial compatibility between slirp4netns.sh and slirp4netns(1): pass the --dns-forward option from the script too. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, pasta: Namespace-based sandboxing, defer seccomp policy application	Stefano Brivio	2022-02-21	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To reach (at least) a conceptually equivalent security level as implemented by --enable-sandbox in slirp4netns, we need to create a new mount namespace and pivot_root() into a new (empty) mountpoint, so that passt and pasta can't access any filesystem resource after initialisation. While at it, also detach IPC, PID (only for passt, to prevent vulnerabilities based on the knowledge of a target PID), and UTS namespaces. With this approach, if we apply the seccomp filters right after the configuration step, the number of allowed syscalls grows further. To prevent this, defer the application of seccomp policies after the initialisation phase, before the main loop, that's where we expect bad things to happen, potentially. This way, we get back to 22 allowed syscalls for passt and 34 for pasta, on x86_64. While at it, move #syscalls notes to specific code paths wherever it conceptually makes sense. We have to open all the file handles we'll ever need before sandboxing: - the packet capture file can only be opened once, drop instance numbers from the default path and use the (pre-sandbox) PID instead - /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection of bound ports in pasta mode, are now opened only once, before sandboxing, and their handles are stored in the execution context - the UNIX domain socket for passt is also bound only once, before sandboxing: to reject clients after the first one, instead of closing the listening socket, keep it open, accept and immediately discard new connection if we already have a valid one Clarify the (unchanged) behaviour for --netns-only in the man page. To actually make passt and pasta processes run in a separate PID namespace, we need to unshare(CLONE_NEWPID) before forking to background (if configured to do so). Introduce a small daemon() implementation, __daemon(), that additionally saves the PID file before forking. While running in foreground, the process itself can't move to a new PID namespace (a process can't change the notion of its own PID): mention that in the man page. For some reason, fork() in a detached PID namespace causes SIGTERM and SIGQUIT to be ignored, even if the handler is still reported as SIG_DFL: add a signal handler that just exits. We can now drop most of the pasta_child_handler() implementation, that took care of terminating all processes running in the same namespace, if pasta started a shell: the shell itself is now the init process in that namespace, and all children will terminate once the init process exits. Issuing 'echo $$' in a detached PID namespace won't return the actual namespace PID as seen from the init namespace: adapt demo and test setup scripts to reflect that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp, util: Fixes for bitmap handling on big-endian, casts	Stefano Brivio	2022-01-26	1	-2/+2
\| \| \| \| \| \| \| \|	Bitmap manipulating functions would otherwise refer to inconsistent sets of bits on big-endian architectures. While at it, fix up a couple of casts. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Drop <linux/ipv6.h> include, carry own ipv6hdr and opt_hdr definitions	Stefano Brivio	2022-01-26	1	-2/+0
\| \| \| \| \| \| \|	This is the only remaining Linux-specific include -- drop it to avoid clang-tidy warnings and to make code more portable. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Add cppcheck target, test, and address resulting warnings	Stefano Brivio	2021-10-21	1	-10/+8
\| \| \| \| \| \| \|	...mostly false positives, but a number of very relevant ones too, in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Avoid static initialiser for udp{4,6}_l2_buf	Stefano Brivio	2021-10-21	1	-18/+23
\| \| \| \| \| \|	With the new UDP_TAP_FRAMES value, the binary size grows considerably. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix maximum payload size calculation for IPv4 buffers, bump UDP_TAP_FRAMES	Stefano Brivio	2021-10-21	1	-2/+3
\| \| \| \| \| \| \| \|	The issue with a higher UDP_TAP_FRAMES was actually coming from a payload size the guest couldn't digest. Fix that, and bump UDP_TAP_FRAMES back to 128. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>