passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tap: Don't use EPOLLET on Qemu sockets	David Gibson	2024-07-26	1	-10/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we do set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't attempt to carry on if we get a bad frame length from qemu	David Gibson	2024-07-26	1	-9/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Better report errors receiving from QEMU socket	David Gibson	2024-07-26	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME	Stefano Brivio	2024-07-26	2	-3/+3
\| \| \| \| \| \| \| \| \|	We report relative timestamps in logs, so we want to avoid jumps in the system time. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Initialise timestamp for relative log time also if we use a log file	Stefano Brivio	2024-07-26	3	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero. Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, util: Fix sub-second part in relative log time calculation	Stefano Brivio	2024-07-26	3	-27/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some reason, in commit 01efc71ddd25 ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part. Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating the difference from the starting time. Do this in a macro as we need the same format in a few places. To calculate the difference, turn the existing timespec_diff_ms() to microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to use that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib/perf_report: Fix highlight	Stefano Brivio	2024-07-25	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fix spurious test failure with systemd-resolved	David Gibson	2024-07-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	systemd-resolved has the rather strange behaviour of listening on the non-standard loopback address 127.0.0.53. Various changes we've made in passt mean that we now usually work fine on a host using systemd-resolved. However our tests still fail in this case. We have a special case for when the guest's resolv.conf needs to differ from the host's because the resolver is on a host loopback address. However, we only consider the case where the host resolver is on 127.0.0.1, not other loopback addresses. Correct this with a different test condition. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Broaden what we consider for DNS specific forwarding rules	David Gibson	2024-07-25	2	-7/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	passt/pasta has options to redirect DNS requests from the guest to a different server address on the host side. Currently, however, only UDP packets to port 53 are considered "DNS requests". This ignores DNS requests over TCP - less common, but certainly possible. It also ignores encrypted DNS requests on port 853. Extend the DNS forwarding logic to handle both of those cases. Link: https://github.com/containers/podman/issues/23239 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Refactor tests in fwd_nat_from_tap() for clarity	David Gibson	2024-07-25	1	-13/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we start by handling the common case, where we don't translate the destination address, then we modify the tgt side for the special cases. In the process we do comparisons on the tentatively set fields in tgt, which obscures the fact that tgt should be an essentially pure function of ini, and risks people examining fields of tgt that are not yet initialized. To make this clearer, do all our tests on 'ini', constructing tgt from scratch on that basis. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Accept addresses enclosed by square brackets in port forwarding specifiers	Stefano Brivio	2024-07-25	1	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit. Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools). Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Exit if we fail to bind a UNIX domain socket with explicit path	Stefano Brivio	2024-07-25	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally. But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: iperf3 3.16 introduces multiple threads, drop our own implementation ↵	Stefano Brivio	2024-07-25	6	-145/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of that Starting from iperf3 version 3.16, -P / --parallel spawns multiple clients as separate threads, instead of multiple streams serviced by the same thread. So we can drop our lib/test implementation to spawn several iperf3 client and server processes and finally simplify things quite a bit. Adjust number of threads and UDP sending bandwidth to values that seem to be more or less matching previous throughput tests on my setup. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Update names of symbols and slabinfo entries	Stefano Brivio	2024-07-25	1	-17/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Differences in allocated Acpi-Parse entries are gone (at least) since the 6.1 Linux kernel series. I should run this on a 6.10 kernel, eventually, and adjust things further, as needed. Userspace symbols are also fairly different now: show whatever is more than 1 MiB at the moment. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Fix memory/passt tests, --netns-only is not a valid option for passt	Stefano Brivio	2024-07-25	2	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This used to work on my setup as I kept reusing an old mbuto (initramfs) image, but since commit 65923ba79877 ("conf: Accept duplicate and conflicting options, the last one wins"), --netns-only is, as originally intended, a pasta-only option. I had used --netns-only, here, to prevent passt from trying to detach its own user namespace, which is not permitted as we're in a chroot, see unshare(2). In turn, we need the chroot because passt can't pivot root directly into its own empty filesystem using an initramfs. Use switch_root into the tmpfs mountpoint instead of chroot, so that we can still detach user namespaces. Note that in the mbuto images, we can't switch to nobody as we have no password entries at all, so we need to detach a further user namespace before starting passt, to trick passt into running as UID 0. Given the new sequence, it's now more convenient to directly switch to a detached network namespace as well, which means we need to move the initialisation of the dummy network from the init script into the test script. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Drop newlines in the middle of the perror()-like messages	Stefano Brivio	2024-07-25	3	-22/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Calling vlogmsg() twice from logmsg_perror() results in this beauty: $ ./pasta -i foo Invalid interface name foo : No such device because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it. Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Change SO_PEEK_OFF support message to debug()	Stefano Brivio	2024-07-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This: $ ./pasta SO_PEEK_OFF not supported # is a bit annoying, and might trick users who face other issues into thinking that SO_PEEK_OFF not being supported on a given kernel is an actual issue. Even if SO_PEEK_OFF is supported by the kernel, that would be the only message displayed there, with default options, which looks a bit out of context. Switch that to debug(): now that Podman users can pass --debug too, we can find out quickly if it's supported or not, if SO_PEEK_OFF usage is suspected of causing any issue. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down	Stefano Brivio	2024-07-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we start pasta with some ports forwarded, but no --config-net, say: $ ./pasta -u 10001 and then use a local, non-loopback address to send traffic to that port, say: $ socat -u FILE:test UDP4:192.0.2.1:10001 pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate. By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Correctly update SO_PEEK_OFF when tcp_send_frames() drops frames	David Gibson	2024-07-24	2	-10/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using the new SO_PEEK_OFF feature on TCP sockets, we must adjust the SO_PEEK_OFF value whenever we move conn->seq_to_tap backwards. Although it was discussed during development, somewhere during the shuffles the case where we move the pointer backwards because we lost frames while sending them to the guest. This can happen, for example, if the socket buffer on the Unix socket to qemu overflows. Fixing this is slightly complicated because we need to pass a non-const context pointer to some places we previously didn't need it. While we're there also fix a small stylistic issue in the function comment for tcp_revert_seq() - it was using spaces instead of tabs. Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: probe for SO_PEEK_OFF both in tcpv4 and tcp6	Jon Maloy	2024-07-23	1	-12/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on an original patch by Jon Maloy: -- The recently added socket option SO_PEEK_OFF is not supported for TCP/IPv6 sockets. Until we get that support into the kernel we need to test for support in both protocols to set the global 'peek_offset_cap´ to true. -- Compared to the original patch: - only check for SO_PEEK_OFF support for enabled IP versions - use sa_family_t instead of int to pass the address family around Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Rename UDP listening sockets	David Gibson	2024-07-19	6	-28/+23
\| \| \| \| \| \| \| \| \| \|	EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove rdelta port forwarding maps	David Gibson	2024-07-19	4	-67/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In addition to the struct fwd_ports used by both UDP and TCP to track port forwarding, UDP also included an 'rdelta' field, which contained the reverse mapping of the main port map. This was used so that we could properly direct reply packets to a forwarded packet where we change the destination port. This has now been taken over by the flow table: reply packets will match the flow of the originating packet, and that gives the correct ports on the originating side. So, eliminate the rdelta field, and with it struct udp_fwd_ports, which now has no additional information over struct fwd_ports. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove obsolete socket tracking	David Gibson	2024-07-19	1	-91/+1
\| \| \| \| \| \| \| \| \|	Now that UDP datagrams are all directed via the flow table, we no longer use the udp_tap_map[ or udp_act[] arrays. Remove them and connected code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Direct datagrams from host to guest via flow table	David Gibson	2024-07-19	1	-134/+51
\| \| \| \| \| \| \| \| \| \| \|	This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Find or create flows for datagrams from tap interface	David Gibson	2024-07-19	3	-117/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons (perhaps increased stack usage?), this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter. Since we already allow rt_sigaction and others in the valgrind target, it seems reasonable to add rt_sigreturn as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove obsolete splice tracking	David Gibson	2024-07-19	2	-51/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle "spliced" datagrams with per-flow sockets	David Gibson	2024-07-19	9	-264/+226
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Create flows for datagrams from originating sockets	David Gibson	2024-07-19	6	-5/+242
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements the first steps of tracking UDP packets with the flow table rather than its own (buggy) set of port maps. Specifically we create flow table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE). When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Update flow forwarding logic for UDP	David Gibson	2024-07-19	1	-4/+23
\| \| \| \| \| \| \| \| \| \| \|	Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, icmp: Use general flow forwarding rules for ICMP	David Gibson	2024-07-19	2	-38/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current ICMP hard codes its forwarding rules, and never applies any translations. Change it to use the flow_target() function, so that it's translated the same as TCP (excluding TCP specific port redirection). This means that gw mapping now applies to ICMP so "ping <gw address>" will now ping the host's loopback instead of the actual gw machine. This removes the surprising behaviour that the target you ping might not be the same as you connect to with TCP. This removes the last user of flow_target_af(), so that's removed as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Flow based NAT and port forwarding for TCP	David Gibson	2024-07-19	7	-139/+245
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Manage outbound socket address via flow table	David Gibson	2024-07-19	2	-14/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now when we forward a ping to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and id will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or id we use in future. To implement this we create a new helper which sets up a new socket based on information in a flowside, which will also have future uses. It behaves slightly differently from the existing ICMP code, in that it doesn't bind to a specific interface if given a loopback address. This is logically correct - the loopback address means we need to operate through the host's loopback interface, not ifname_out. We didn't need it in ICMP because ICMP will never generate a loopback address at this point, however we intend to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Helper to create sockets based on flowside	David Gibson	2024-07-19	4	-3/+102
\| \| \| \| \| \| \| \| \|	We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Eliminate icmp_id_map	David Gibson	2024-07-19	1	-17/+2
\| \| \| \| \| \| \| \|	With previous reworks the icmp_id_map data structure is now maintained, but never used for anything. Eliminate it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Look up ping flows using flow hash	David Gibson	2024-07-19	1	-3/+15
\| \| \| \| \| \| \| \| \| \|	When we receive a ping packet from the tap interface, we currently locate the correct flow entry (if present) using an anciliary data structure, the icmp_id_map[] tables. However, we can look this up using the flow hash table - that's what it's for. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Obtain destination addresses from the flowsides	David Gibson	2024-07-19	3	-22/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	icmp_sock_handler() obtains the guest address from it's most recently observed IP. However, this can now be obtained from the common flowside information. icmp_tap_handler() builds its socket address for sendto() directly from the destination address supplied by the incoming tap packet. This can instead be generated from the flow. Using the flowsides as the common source of truth here prepares us for allowing more flexible NAT and forwarding by properly initialising that flowside information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Remove redundant id field from flow table entry	David Gibson	2024-07-19	2	-7/+5
\| \| \| \| \| \| \| \| \|	struct icmp_ping_flow contains a field for the ICMP id of the ping, but this is now redundant, since the id is also stored as the "port" in the common flowsides. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Re-use flow hash for initial sequence number generation	David Gibson	2024-07-19	3	-29/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We generate TCP initial sequence numbers, when we need them, from a hash of the source and destination addresses and ports, plus a timestamp. Moments later, we generate another hash of the same information plus some more to insert the connection into the flow hash table. With some tweaks to the flow_hash_insert() interface and changing the order we can re-use that hash table hash for the initial sequence number, rather than calculating another one. It won't generate identical results, but that doesn't matter as long as the sequence numbers are well scattered. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Generalise TCP hash table to general flow hash table	David Gibson	2024-07-19	5	-145/+172
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the data structures and helper functions for the TCP hash table to flow.c, making it a general hash table indexing sides of flows. This is largely code motion and straightforward renames. There are two semantic changes: * flow_lookup_af() now needs to verify that the entry has a matching protocol and interface as well as matching addresses and ports. * We double the size of the hash table, because it's now at least theoretically possible for both sides of each flow to be hashed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Replace TCP specific hash function with general flow hash	David Gibson	2024-07-19	4	-53/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP. Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the pif and the L4 protocol number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Eliminate SPLICE_V6 flag	David Gibson	2024-07-19	2	-7/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Since we're now constructing socket addresses based on information in the flowside, we no longer need an explicit flag to tell if we're dealing with an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag. As well as just simplifying the code, this allows for possible future extensions where we could splice an IPv4 connection to an IPv6 connection or vice versa. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify endpoint validation using flowside information	David Gibson	2024-07-19	2	-55/+18
\| \| \| \| \| \| \| \|	Now that we store all our endpoints in the flowside structure, use some inany helpers to make validation of those endpoints simpler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Manage outbound address via flow table	David Gibson	2024-07-19	1	-43/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now when we forward a connection to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and port will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or port we use in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Obtain guest address from flowside	David Gibson	2024-07-19	3	-33/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will probably result in an RST from the guest. Worse, if the guest added a new address but also retains the old one, then we could break those old connections by redirecting them to the new address. Now that we maintain flowside information, we have a record of the correct guest side address and can just use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Remove redundant information, repack connection structures	David Gibson	2024-07-19	3	-51/+47
\| \| \| \| \| \| \| \| \| \| \|	Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for target side	David Gibson	2024-07-19	8	-78/+153
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for initiating side	David Gibson	2024-07-19	6	-11/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	doc: Extend zero-recv test with methods using msghdr	David Gibson	2024-07-17	1	-8/+52
\| \| \| \| \| \| \| \| \| \| \|	This test program verifies that we can receive and discard datagrams by using recv() with a NULL buffer and zero-length. Extend it to verify it also works using recvmsg() and either an iov with a zero-length NULL buffer or an iov that itself is NULL and zero-length. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Fixed printf() message in main of recv-zero.c] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	doc: Test behaviour of closing duplicate UDP sockets	David Gibson	2024-07-17	3	-2/+108
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	To simplify lifetime management of "listening" UDP sockets, UDP flow support needs to duplicate existing bound sockets. Those duplicates will be close()d when their corresponding flow expires, but we expect the original to still receive datagrams as always. That is, we expect the close() on the duplicate to remove the duplicated fd, but not to close the underlying UDP socket. Add a test program to doc/platform-requirements to verify this requirement. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Use parameterised macros for per-side event/flag bits	David Gibson	2024-07-17	2	-41/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both the events and flags fields in tcp_splice_conn have several bits which are per-side, e.g. OUT_WAIT_0 for side 0 and OUT_WAIT_1 for side 1. This necessitates some rather awkward ternary expressions when we need to get the relevant bit for a particular side. Simplify this by using a parameterised macro for the bit values. This needs a ternary expression inside the macros, but makes the places we use it substantially clearer. That simplification in turn allows us to use a loop across each side to implement several things which are currently open coded to do equivalent things for each side in turn. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>