passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	passt, util: Close any open file that the parent might have leaked	Stefano Brivio	2024-08-08	6	-7/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any. This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul. Use close_range(2) to close all open files except for standard streams and the one from --fd. Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple. As we need to check that the passed --fd option doesn't exceed INT_MAX, because we'll parse it with strtol() but file descriptor indices are signed ints (regardless of the arguments close_range() take), extend the existing check in the actual --fd parsing in conf(), also rejecting file descriptors numbers that match standard streams, while at it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>
*	nstool: Propagate SIGTERM to processes executed in the namespace	David Gibson	2024-08-07	1	-2/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Particularly in shell it's sometimes natural to save the pid from a process run and later kill it. If doing this with nstool exec, however, it will kill nstool itself, not the program it is running, which isn't usually what you want or expect. Address this by having nstool propagate SIGTERM to its child process. It may make sense to propagate some other signals, but some introduce extra complications, so we'll worry about them when and if it seems useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	nstool: Fix some trivial typos	David Gibson	2024-08-07	1	-2/+2
\| \| \| \| \|	Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Avoid duplicate calls to logtime()	David Gibson	2024-08-07	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We use logtime() to get a timestamp for the log in two places: - in vlogmsg(), which is used only for debug_print messages - in logfile_write() which is only used messages to the log file These cases are mutually exclusive, so we don't ever print the same message with different timestamps, but that's not particularly obvious to see. It's possible future tweaks to logging logic could mean we log to two different places with different timestamps, which would be confusing. Refactor to have a single logtime() call in vlogmsg() and use it for all the places we need it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Handle errors from clock_gettime()	David Gibson	2024-08-07	1	-15/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clock_gettime() can, theoretically, fail, although it probably won't until 2038 on old 32-bit systems. Still, it's possible someone could run with a wildly out of sync clock, or new errors could be added, or it could fail due to a bug in libc or the kernel. We don't handle this well. In the debug_print case in vlogmsg we'll just ignore the failure, and print a timestamp based on uninitialised garbage. In logfile_write() we exit early and won't log anything at all, which seems like a good way to make an already weird situation undebuggable. Add some helpers to instead handle this by using "<error>" in place of a timestamp if something goes wrong with clock_gettime(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Correct formatting of timestamps	David Gibson	2024-08-07	1	-12/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logtime_fmt_and_arg() is a rather odd macro, producing both a format string and an argument, which can only be used in quite specific printf() like formulations. It also has a significant bug: it tries to display 4 digits after the decimal point (so down to tenths of milliseconds) using %04i. But the field width in printf() is always a minimum not maximum field width, so this will not truncate the given value, but will redisplay the entire tenth-of-milliseconds difference again after the decimal point. Replace the macro with an snprintf() like function which will format the timestamp, and use an explicit % to correct the display. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Make logtime_fmt() static] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Some corrections for timespec_diff_us	David Gibson	2024-08-07	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The comment for timespec_diff_us() claims it will wrap after 2^64µs. This is incorrect for two reasons: * It returns a long long, which is probably 64-bits, but might not be * It returns a signed value, so even if it is 64 bits it will wrap after 2^63µs Correct the comment and use an explicitly 64-bit type to avoid that imprecision. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, pasta: Make -g and -a skip route/addresses copy for matching IP ↵	Stefano Brivio	2024-08-07	4	-22/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	version only Paul reports that setting IPv4 address and gateway manually, using --address and --gateway, causes pasta to fail inserting IPv6 routes in a setup where multiple, inter-dependent IPv6 routes are present on the host. That's because, currently, any -g option implies --no-copy-routes altogether, and any -a implies --no-copy-addrs. Limit this implication to the matching IP version, instead, by having two copies of no_copy_routes and no_copy_addrs in the context structure, separately for IPv4 and IPv6. While at it, change them to 'bool': we had them as 'int' because getopt_long() used to set them directly, but it hasn't been the case for a while already. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, passt: Keep printing to stderr when passt is running in foreground2024_08_06.ee36266	Stefano Brivio	2024-08-06	3	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two cases where we want to stop printing to stderr: if it's closed, and if pasta spawned a shell (and --debug wasn't given). But if passt is running in foreground, we currently stop to report any message, even error messages, once we're ready, as reported by Laurent, because we set the log_runtime flag, which we use to indicate we're ready, regardless of whether we're running in foreground or not. Turn that flag (back) to log_stderr, and set it only when we really want to stop printing to stderr. Reported-by: Laurent Vivier <lvivier@redhat.com> Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Fix side in OUT_WAIT flag setting	Stefano Brivio	2024-08-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the "from" (input) side for a given transfer is 0, and we can't complete the write right away, what we need to be waiting for is for output readiness on side 1, not 0, and the other way around as well. This causes random transfer failures for local TCP connections, depending if we ever need to wait for output readiness. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/23517 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com>
*	util: Use unsigned (size_t) value for iov length	David Gibson	2024-08-06	2	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	The "correct" type for the length of an IOV is unclear: writev() and readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the unsigned size_t has some advantages, though, and it makes more sense for the case of write_remainder. Using size_t throughout here means we don't have a signed vs. unsigned comparison, and we don't have to deal with the case of iov_skip_bytes() returning a value which becomes negative when assigned to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: move all udp_flow functions to udp_flow.c	Laurent Vivier	2024-08-05	4	-261/+284
\| \| \| \| \| \| \| \| \| \| \|	No code change. They need to be exported to be available by the vhost-user version of passt. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock()	Laurent Vivier	2024-08-05	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \|	To be used with the vhost-user version of udp.c, we need to export the udp_flow functions. To avoid to export udp_meta_t too that is specific to the socket version of udp.c, don't pass udp_meta_t to it, but the only needed field, s_in. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Make logfile_write() private	David Gibson	2024-08-05	2	-171/+171
\| \| \| \| \| \| \| \| \| \|	logfile_write() is not used outside log.c, nor should it be. It should only be used externall via the general logging functions. Make it static in log.c. To avoid forward declarations this requires moving a bunch of functions earlier in the file. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Save errno on signal handler entry, restore on return when needed	Stefano Brivio	2024-08-05	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ed reported this: # Error: pasta failed with exit code 1: # Couldn't drop cap 3 from bounding set # : No child processes in a Podman CI run with tests being run in parallel. The error message itself, by the way, is fixed by commit 1cd773081f12 ("log: Drop newlines in the middle of the perror()-like messages"), but how can we possibly get ECHILD as failure code for prctl()? Well, we don't, but if we exit early enough, pasta_child_handler() might run before we're even done with isolation steps, and it calls waitid(), which sets errno. We need to restore it before returning from the signal handler (if we return after calling functions that might set it), as signal-safety(7) also implies: Fetching and setting the value of errno is async-signal-safe provided that the signal handler saves errno on entry and restores its value before returning. Eventually, we'll probably need to switch to signalfd(2) the day we want to implement multithreading, but this will do for the moment. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/23478 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: modify hostname when detaching new namespace	Danish Prakash	2024-07-30	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \|	When invoking pasta without any arguments, it's difficult to tell whether we are in the new namespace or not leaving users a bit confused. This change modifies the host namespace to add a prefix "pasta-" to make it a bit more obvious. Signed-off-by: Danish Prakash <contact@danishpraka.sh> [sbrivio: coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Fix typo in README file	AbdAlRahman Gad	2024-07-29	1	-1/+1
\| \| \| \| \| \| \| \|	- remove duplicated 'the' in the 'Services' section Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fedora/rpkg: List myself as author for changelog entries	Stefano Brivio	2024-07-26	1	-1/+5
\| \| \| \| \| \|	...instead of the latest author for contrib/fedora. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Improve handling of partially received frames on qemu socket2024_07_26.57a21d2	David Gibson	2024-07-26	2	-14/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Correctly handle frames of odd length	David Gibson	2024-07-26	2	-1/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Qemu socket protocol consists of a 32-bit frame length in network (BE) order, followed by the Ethernet frame itself. As far as I can tell, frames can be any length, with no particular alignment requirement. This means that although pkt_buf itself is aligned, if we have a frame of odd length, frames after it will have their frame length at an unaligned address. Currently we load the frame length by just casting a char pointer to (uint32_t ) and loading. Some platforms will generate a fatal trap on such an unaligned load. Even if they don't casting an incorrectly aligned pointer to (uint32_t ) is undefined behaviour, strictly speaking. Introduce a new helper to safely load a possibly unaligned value here. We assume that the compiler is smart enough to optimize this into nothing on platforms that provide performant unaligned loads. If that turns out not to be the case, we can look at improvements then. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't use EPOLLET on Qemu sockets	David Gibson	2024-07-26	1	-10/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we do set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't attempt to carry on if we get a bad frame length from qemu	David Gibson	2024-07-26	1	-9/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Better report errors receiving from QEMU socket	David Gibson	2024-07-26	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME	Stefano Brivio	2024-07-26	2	-3/+3
\| \| \| \| \| \| \| \| \|	We report relative timestamps in logs, so we want to avoid jumps in the system time. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Initialise timestamp for relative log time also if we use a log file	Stefano Brivio	2024-07-26	3	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero. Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, util: Fix sub-second part in relative log time calculation	Stefano Brivio	2024-07-26	3	-27/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some reason, in commit 01efc71ddd25 ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part. Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating the difference from the starting time. Do this in a macro as we need the same format in a few places. To calculate the difference, turn the existing timespec_diff_ms() to microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to use that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib/perf_report: Fix highlight	Stefano Brivio	2024-07-25	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fix spurious test failure with systemd-resolved	David Gibson	2024-07-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	systemd-resolved has the rather strange behaviour of listening on the non-standard loopback address 127.0.0.53. Various changes we've made in passt mean that we now usually work fine on a host using systemd-resolved. However our tests still fail in this case. We have a special case for when the guest's resolv.conf needs to differ from the host's because the resolver is on a host loopback address. However, we only consider the case where the host resolver is on 127.0.0.1, not other loopback addresses. Correct this with a different test condition. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Broaden what we consider for DNS specific forwarding rules	David Gibson	2024-07-25	2	-7/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	passt/pasta has options to redirect DNS requests from the guest to a different server address on the host side. Currently, however, only UDP packets to port 53 are considered "DNS requests". This ignores DNS requests over TCP - less common, but certainly possible. It also ignores encrypted DNS requests on port 853. Extend the DNS forwarding logic to handle both of those cases. Link: https://github.com/containers/podman/issues/23239 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Refactor tests in fwd_nat_from_tap() for clarity	David Gibson	2024-07-25	1	-13/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we start by handling the common case, where we don't translate the destination address, then we modify the tgt side for the special cases. In the process we do comparisons on the tentatively set fields in tgt, which obscures the fact that tgt should be an essentially pure function of ini, and risks people examining fields of tgt that are not yet initialized. To make this clearer, do all our tests on 'ini', constructing tgt from scratch on that basis. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Accept addresses enclosed by square brackets in port forwarding specifiers	Stefano Brivio	2024-07-25	1	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit. Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools). Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Exit if we fail to bind a UNIX domain socket with explicit path	Stefano Brivio	2024-07-25	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally. But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: iperf3 3.16 introduces multiple threads, drop our own implementation ↵	Stefano Brivio	2024-07-25	6	-145/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of that Starting from iperf3 version 3.16, -P / --parallel spawns multiple clients as separate threads, instead of multiple streams serviced by the same thread. So we can drop our lib/test implementation to spawn several iperf3 client and server processes and finally simplify things quite a bit. Adjust number of threads and UDP sending bandwidth to values that seem to be more or less matching previous throughput tests on my setup. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Update names of symbols and slabinfo entries	Stefano Brivio	2024-07-25	1	-17/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Differences in allocated Acpi-Parse entries are gone (at least) since the 6.1 Linux kernel series. I should run this on a 6.10 kernel, eventually, and adjust things further, as needed. Userspace symbols are also fairly different now: show whatever is more than 1 MiB at the moment. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Fix memory/passt tests, --netns-only is not a valid option for passt	Stefano Brivio	2024-07-25	2	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This used to work on my setup as I kept reusing an old mbuto (initramfs) image, but since commit 65923ba79877 ("conf: Accept duplicate and conflicting options, the last one wins"), --netns-only is, as originally intended, a pasta-only option. I had used --netns-only, here, to prevent passt from trying to detach its own user namespace, which is not permitted as we're in a chroot, see unshare(2). In turn, we need the chroot because passt can't pivot root directly into its own empty filesystem using an initramfs. Use switch_root into the tmpfs mountpoint instead of chroot, so that we can still detach user namespaces. Note that in the mbuto images, we can't switch to nobody as we have no password entries at all, so we need to detach a further user namespace before starting passt, to trick passt into running as UID 0. Given the new sequence, it's now more convenient to directly switch to a detached network namespace as well, which means we need to move the initialisation of the dummy network from the init script into the test script. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Drop newlines in the middle of the perror()-like messages	Stefano Brivio	2024-07-25	3	-22/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Calling vlogmsg() twice from logmsg_perror() results in this beauty: $ ./pasta -i foo Invalid interface name foo : No such device because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it. Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Change SO_PEEK_OFF support message to debug()	Stefano Brivio	2024-07-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This: $ ./pasta SO_PEEK_OFF not supported # is a bit annoying, and might trick users who face other issues into thinking that SO_PEEK_OFF not being supported on a given kernel is an actual issue. Even if SO_PEEK_OFF is supported by the kernel, that would be the only message displayed there, with default options, which looks a bit out of context. Switch that to debug(): now that Podman users can pass --debug too, we can find out quickly if it's supported or not, if SO_PEEK_OFF usage is suspected of causing any issue. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down	Stefano Brivio	2024-07-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we start pasta with some ports forwarded, but no --config-net, say: $ ./pasta -u 10001 and then use a local, non-loopback address to send traffic to that port, say: $ socat -u FILE:test UDP4:192.0.2.1:10001 pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate. By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Correctly update SO_PEEK_OFF when tcp_send_frames() drops frames	David Gibson	2024-07-24	2	-10/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using the new SO_PEEK_OFF feature on TCP sockets, we must adjust the SO_PEEK_OFF value whenever we move conn->seq_to_tap backwards. Although it was discussed during development, somewhere during the shuffles the case where we move the pointer backwards because we lost frames while sending them to the guest. This can happen, for example, if the socket buffer on the Unix socket to qemu overflows. Fixing this is slightly complicated because we need to pass a non-const context pointer to some places we previously didn't need it. While we're there also fix a small stylistic issue in the function comment for tcp_revert_seq() - it was using spaces instead of tabs. Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: probe for SO_PEEK_OFF both in tcpv4 and tcp6	Jon Maloy	2024-07-23	1	-12/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on an original patch by Jon Maloy: -- The recently added socket option SO_PEEK_OFF is not supported for TCP/IPv6 sockets. Until we get that support into the kernel we need to test for support in both protocols to set the global 'peek_offset_cap´ to true. -- Compared to the original patch: - only check for SO_PEEK_OFF support for enabled IP versions - use sa_family_t instead of int to pass the address family around Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Rename UDP listening sockets	David Gibson	2024-07-19	6	-28/+23
\| \| \| \| \| \| \| \| \| \|	EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove rdelta port forwarding maps	David Gibson	2024-07-19	4	-67/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In addition to the struct fwd_ports used by both UDP and TCP to track port forwarding, UDP also included an 'rdelta' field, which contained the reverse mapping of the main port map. This was used so that we could properly direct reply packets to a forwarded packet where we change the destination port. This has now been taken over by the flow table: reply packets will match the flow of the originating packet, and that gives the correct ports on the originating side. So, eliminate the rdelta field, and with it struct udp_fwd_ports, which now has no additional information over struct fwd_ports. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove obsolete socket tracking	David Gibson	2024-07-19	1	-91/+1
\| \| \| \| \| \| \| \| \|	Now that UDP datagrams are all directed via the flow table, we no longer use the udp_tap_map[ or udp_act[] arrays. Remove them and connected code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Direct datagrams from host to guest via flow table	David Gibson	2024-07-19	1	-134/+51
\| \| \| \| \| \| \| \| \| \| \|	This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Find or create flows for datagrams from tap interface	David Gibson	2024-07-19	3	-117/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons (perhaps increased stack usage?), this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter. Since we already allow rt_sigaction and others in the valgrind target, it seems reasonable to add rt_sigreturn as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove obsolete splice tracking	David Gibson	2024-07-19	2	-51/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle "spliced" datagrams with per-flow sockets	David Gibson	2024-07-19	9	-264/+226
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Create flows for datagrams from originating sockets	David Gibson	2024-07-19	6	-5/+242
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements the first steps of tracking UDP packets with the flow table rather than its own (buggy) set of port maps. Specifically we create flow table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE). When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Update flow forwarding logic for UDP	David Gibson	2024-07-19	1	-4/+23
\| \| \| \| \| \| \| \| \| \| \|	Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, icmp: Use general flow forwarding rules for ICMP	David Gibson	2024-07-19	2	-38/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current ICMP hard codes its forwarding rules, and never applies any translations. Change it to use the flow_target() function, so that it's translated the same as TCP (excluding TCP specific port redirection). This means that gw mapping now applies to ICMP so "ping <gw address>" will now ping the host's loopback instead of the actual gw machine. This removes the surprising behaviour that the target you ping might not be the same as you connect to with TCP. This removes the last user of flow_target_af(), so that's removed as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>