passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	treewide: Use "our address" instead of "forwarding address"	David Gibson	2024-08-21	7	-105/+106
\| \| \| \| \| \| \| \| \| \| \| \|	The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead. (While we're there correct an error in flow_initiate_af()s comments where we referred to parameters by the wrong name). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fix typo in function comment for nl_addr_set()	Stefano Brivio	2024-08-18	1	-1/+1
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: Disable neighbour solicitations on device up to prevent DAD	Stefano Brivio	2024-08-18	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As soon as we the kernel notifier for IPv6 address configuration (addrconf_notify()) sees that we bring the target interface up (NETDEV_UP), it will schedule duplicate address detection, so, by itself, setting the nodad flag later is useless, because that won't stop a detection that's already in progress. However, if we disable neighbour solicitations with IFF_NOARP (which is a misnomer for IPv6 interfaces, but there's no possibility of mixing things up), the notifier will not trigger DAD, because it can't be done, of course, without neighbour solicitations. Set IFF_NOARP as we bring up the device, and drop it after we had a chance to set the nodad attribute on the link. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Fetch link-local address from namespace interface once it's up	Stefano Brivio	2024-08-18	3	-0/+55
\| \| \| \| \| \| \| \| \| \|	As soon as we bring up the interface, the Linux kernel will set up a link-local address for it, so we can fetch it and start using right away, if we need a link-local address to communicate to the container before we see any traffic coming from it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Disable DAD for link-local addresses on namespace interface	Stefano Brivio	2024-08-18	3	-0/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It makes no sense for a container or a guest to try and perform duplicate address detection for their link-local address, as we'll anyway not relay neighbour solicitations with an unspecified source address. While they perform duplicate address detection, the link-local address is not usable, which prevents us from bringing up especially containers and communicate with them right away via IPv6. This is not enough to prevent DAD and reach the container right away: we'll need a couple more patches. As we send NLM_F_REPLACE requests right away, while we still have to read out other addresses on the same socket, we can't use nl_do(): keep track of the last sequence we sent (last address we changed), and deal with the answers to those NLM_F_REPLACE requests in a separate loop, later. Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Turn nl_link_up() into a generic function to set link flags	Stefano Brivio	2024-08-15	3	-7/+11
\| \| \| \| \| \| \|	In the next patches, we'll reuse it to set flags other than IFF_UP. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink, pasta: Split MTU setting functionality out of nl_link_up()	Stefano Brivio	2024-08-15	3	-13/+32
\| \| \| \| \| \| \| \| \|	As we'll use nl_link_up() for more than just bringing up devices, it will become awkward to carry empty MTU values around whenever we call it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Fix typo in function comment for nl_addr_get()	Stefano Brivio	2024-08-15	1	-1/+1
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Speed up by cutting on eye candy and performance test duration	Stefano Brivio	2024-08-15	8	-54/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have a number of delays when we switch to new layouts that were added to make the tests visually easier to follow, together with blinking status bars. Shorten the delays and avoid blinking the status bar if $FAST is set to 1 (no demo mode). Shorten delays in busy loops to 10ms, instead of 100ms, and skip the one-second fixed delay when we wait for the status of a command. Cut the duration of throughput and latency tests to one second, down from ten. Somewhat surprisingly, the results we get are rather consistent, and not significantly different from what we'd get with 10 seconds. This, together with Podman's commit 20f3e8909e3a ("test/system: pasta_test_do add explicit port check"), cuts the time needed on my setup for full test run from approximately 37 minutes to...: $ time ./run [exited] PASS: 165, FAIL: 0 Log at /home/sbrivio/passt/test/test_logs/test.log real 15m34.253s user 0m0.011s sys 0m0.011s Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	flow: Don't crash if guest attempts to connect to port 02024_08_14.61c0b0d	David Gibson	2024-08-14	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using a zero port on TCP or UDP is dubious, and we can't really deal with forwarding such a flow within the constraints of the socket API. Hence we ASSERT()ed that we had non-zero ports in flow_hash(). The intention was to make sure that the protocol code sanitizes such ports before completing a flow entry. Unfortunately, flow_hash() is also called on new packets to see if they have an existing flow, so the unsanitized guest packet can crash passt with the assert. Correct this by moving the assert from flow_hash() to flow_sidx_hash() which is only used on entries already in the table, not on unsanitized data. Reported-by: Matt Hamilton <matt@thmail.io> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Don't ignore -t and -u options after -D	David Gibson	2024-08-14	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	f6d5a5239264 moved handling of -D into a later loop. However as a side effect it moved this from a switch block to an if block. I left a couple of 'break' statements that don't make sense in the new context. They should be 'continue' so that we go onto the next option, rather than leaving the loop entirely. Fixes: f6d5a5239264 ("conf: Delay handling -D option until after addresses are configured") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp.c: Turn NDP responder into more declarative implementation	AbdAlRahman Gad	2024-08-13	3	-79/+246
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Add structs for NA, RA, NS, MTU, prefix info, option header, link-layer address, RDNSS, DNSSL and link-layer for RA message. - Turn NA message from purely imperative, going byte by byte, to declarative by filling it's struct. - Turn part of RA message into declarative. - Move packet_add() to be before the call of ndp() in tap6_handler() if the protocol of the packet is ICMPv6. - Add a pool of packets as an additional parameter to ndp(). - Check the size of NS packet with packet_get() before sending an NA packet. - Add documentation for the structs. - Add an enum for NDP option types. Link: https://bugs.passt.top/show_bug.cgi?id=21 Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> [sbrivio: Minor coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Delay handling -D option until after addresses are configured	David Gibson	2024-08-12	1	-39/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	add_dns[46]() rely on the gateway address and c->no_map_gw being already initialised, in order to properly handle DNS servers which need NAT to be accessed from the guest. Usually these are called from get_dns() which is well after the addresses are configured, so that's fine. However, they can also be called earlier if an explicit -D command line option is given. In this case no_map_gw and/or c->ip[46].gw may not get be initialised properly, leading to this doing the wrong thing. Luckily we already have a second pass of option parsing for things which need addresses to already be configured. Move handling of -D to there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Correct inaccurate comments on ip[46]_ctx::addr	David Gibson	2024-08-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	These fields are described as being an address for an external, routable interface. That's not necessarily the case when using -a. But, more importantly, saying where the value comes from is not as useful as what it's used for. The real purpose of this field is as the address which we assign to the guest via DHCP or --config-net. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Don't prefix message with timestamp on --debug if it's a continuation	Stefano Brivio	2024-08-12	3	-18/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we prefix the second part of messages printed through logmsg_perror() by the timestamp, on debug, we'll have two timestamps and a weird separator in the result, such as this beauty: 0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted Add a parameter to logmsg() and vlogmsg() which indicates a message continuation. If that's set, don't print the timestamp in vlogmsg(). Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Stop parsing options at first non-option argument	Stefano Brivio	2024-08-08	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that pasta supports specifying a command to be executed on the command line, even without the usual -- separator as long as there's no ambiguity, we shouldn't eat up options that are not meant for us. Paul reports, for instance, that with: pasta --config-net ip -6 route -6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'. That's because getopt_long(), by default, shuffles the argument list to shift non-option arguments at the end. Avoid that by adding '+' at the beginning of 'optstring'. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt, util: Close any open file that the parent might have leaked	Stefano Brivio	2024-08-08	6	-7/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any. This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul. Use close_range(2) to close all open files except for standard streams and the one from --fd. Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple. As we need to check that the passed --fd option doesn't exceed INT_MAX, because we'll parse it with strtol() but file descriptor indices are signed ints (regardless of the arguments close_range() take), extend the existing check in the actual --fd parsing in conf(), also rejecting file descriptors numbers that match standard streams, while at it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>
*	nstool: Propagate SIGTERM to processes executed in the namespace	David Gibson	2024-08-07	1	-2/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Particularly in shell it's sometimes natural to save the pid from a process run and later kill it. If doing this with nstool exec, however, it will kill nstool itself, not the program it is running, which isn't usually what you want or expect. Address this by having nstool propagate SIGTERM to its child process. It may make sense to propagate some other signals, but some introduce extra complications, so we'll worry about them when and if it seems useful. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	nstool: Fix some trivial typos	David Gibson	2024-08-07	1	-2/+2
\| \| \| \| \|	Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Avoid duplicate calls to logtime()	David Gibson	2024-08-07	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We use logtime() to get a timestamp for the log in two places: - in vlogmsg(), which is used only for debug_print messages - in logfile_write() which is only used messages to the log file These cases are mutually exclusive, so we don't ever print the same message with different timestamps, but that's not particularly obvious to see. It's possible future tweaks to logging logic could mean we log to two different places with different timestamps, which would be confusing. Refactor to have a single logtime() call in vlogmsg() and use it for all the places we need it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Handle errors from clock_gettime()	David Gibson	2024-08-07	1	-15/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clock_gettime() can, theoretically, fail, although it probably won't until 2038 on old 32-bit systems. Still, it's possible someone could run with a wildly out of sync clock, or new errors could be added, or it could fail due to a bug in libc or the kernel. We don't handle this well. In the debug_print case in vlogmsg we'll just ignore the failure, and print a timestamp based on uninitialised garbage. In logfile_write() we exit early and won't log anything at all, which seems like a good way to make an already weird situation undebuggable. Add some helpers to instead handle this by using "<error>" in place of a timestamp if something goes wrong with clock_gettime(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Correct formatting of timestamps	David Gibson	2024-08-07	1	-12/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logtime_fmt_and_arg() is a rather odd macro, producing both a format string and an argument, which can only be used in quite specific printf() like formulations. It also has a significant bug: it tries to display 4 digits after the decimal point (so down to tenths of milliseconds) using %04i. But the field width in printf() is always a minimum not maximum field width, so this will not truncate the given value, but will redisplay the entire tenth-of-milliseconds difference again after the decimal point. Replace the macro with an snprintf() like function which will format the timestamp, and use an explicit % to correct the display. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Make logtime_fmt() static] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Some corrections for timespec_diff_us	David Gibson	2024-08-07	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The comment for timespec_diff_us() claims it will wrap after 2^64µs. This is incorrect for two reasons: * It returns a long long, which is probably 64-bits, but might not be * It returns a signed value, so even if it is 64 bits it will wrap after 2^63µs Correct the comment and use an explicitly 64-bit type to avoid that imprecision. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, pasta: Make -g and -a skip route/addresses copy for matching IP ↵	Stefano Brivio	2024-08-07	4	-22/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	version only Paul reports that setting IPv4 address and gateway manually, using --address and --gateway, causes pasta to fail inserting IPv6 routes in a setup where multiple, inter-dependent IPv6 routes are present on the host. That's because, currently, any -g option implies --no-copy-routes altogether, and any -a implies --no-copy-addrs. Limit this implication to the matching IP version, instead, by having two copies of no_copy_routes and no_copy_addrs in the context structure, separately for IPv4 and IPv6. While at it, change them to 'bool': we had them as 'int' because getopt_long() used to set them directly, but it hasn't been the case for a while already. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, passt: Keep printing to stderr when passt is running in foreground2024_08_06.ee36266	Stefano Brivio	2024-08-06	3	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two cases where we want to stop printing to stderr: if it's closed, and if pasta spawned a shell (and --debug wasn't given). But if passt is running in foreground, we currently stop to report any message, even error messages, once we're ready, as reported by Laurent, because we set the log_runtime flag, which we use to indicate we're ready, regardless of whether we're running in foreground or not. Turn that flag (back) to log_stderr, and set it only when we really want to stop printing to stderr. Reported-by: Laurent Vivier <lvivier@redhat.com> Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Fix side in OUT_WAIT flag setting	Stefano Brivio	2024-08-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the "from" (input) side for a given transfer is 0, and we can't complete the write right away, what we need to be waiting for is for output readiness on side 1, not 0, and the other way around as well. This causes random transfer failures for local TCP connections, depending if we ever need to wait for output readiness. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/23517 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com>
*	util: Use unsigned (size_t) value for iov length	David Gibson	2024-08-06	2	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	The "correct" type for the length of an IOV is unclear: writev() and readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the unsigned size_t has some advantages, though, and it makes more sense for the case of write_remainder. Using size_t throughout here means we don't have a signed vs. unsigned comparison, and we don't have to deal with the case of iov_skip_bytes() returning a value which becomes negative when assigned to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: move all udp_flow functions to udp_flow.c	Laurent Vivier	2024-08-05	4	-261/+284
\| \| \| \| \| \| \| \| \| \| \|	No code change. They need to be exported to be available by the vhost-user version of passt. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock()	Laurent Vivier	2024-08-05	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \|	To be used with the vhost-user version of udp.c, we need to export the udp_flow functions. To avoid to export udp_meta_t too that is specific to the socket version of udp.c, don't pass udp_meta_t to it, but the only needed field, s_in. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Make logfile_write() private	David Gibson	2024-08-05	2	-171/+171
\| \| \| \| \| \| \| \| \| \|	logfile_write() is not used outside log.c, nor should it be. It should only be used externall via the general logging functions. Make it static in log.c. To avoid forward declarations this requires moving a bunch of functions earlier in the file. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Save errno on signal handler entry, restore on return when needed	Stefano Brivio	2024-08-05	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ed reported this: # Error: pasta failed with exit code 1: # Couldn't drop cap 3 from bounding set # : No child processes in a Podman CI run with tests being run in parallel. The error message itself, by the way, is fixed by commit 1cd773081f12 ("log: Drop newlines in the middle of the perror()-like messages"), but how can we possibly get ECHILD as failure code for prctl()? Well, we don't, but if we exit early enough, pasta_child_handler() might run before we're even done with isolation steps, and it calls waitid(), which sets errno. We need to restore it before returning from the signal handler (if we return after calling functions that might set it), as signal-safety(7) also implies: Fetching and setting the value of errno is async-signal-safe provided that the signal handler saves errno on entry and restores its value before returning. Eventually, we'll probably need to switch to signalfd(2) the day we want to implement multithreading, but this will do for the moment. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/23478 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pasta: modify hostname when detaching new namespace	Danish Prakash	2024-07-30	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \|	When invoking pasta without any arguments, it's difficult to tell whether we are in the new namespace or not leaving users a bit confused. This change modifies the host namespace to add a prefix "pasta-" to make it a bit more obvious. Signed-off-by: Danish Prakash <contact@danishpraka.sh> [sbrivio: coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Fix typo in README file	AbdAlRahman Gad	2024-07-29	1	-1/+1
\| \| \| \| \| \| \| \|	- remove duplicated 'the' in the 'Services' section Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fedora/rpkg: List myself as author for changelog entries	Stefano Brivio	2024-07-26	1	-1/+5
\| \| \| \| \| \|	...instead of the latest author for contrib/fedora. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Improve handling of partially received frames on qemu socket2024_07_26.57a21d2	David Gibson	2024-07-26	2	-14/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Correctly handle frames of odd length	David Gibson	2024-07-26	2	-1/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Qemu socket protocol consists of a 32-bit frame length in network (BE) order, followed by the Ethernet frame itself. As far as I can tell, frames can be any length, with no particular alignment requirement. This means that although pkt_buf itself is aligned, if we have a frame of odd length, frames after it will have their frame length at an unaligned address. Currently we load the frame length by just casting a char pointer to (uint32_t ) and loading. Some platforms will generate a fatal trap on such an unaligned load. Even if they don't casting an incorrectly aligned pointer to (uint32_t ) is undefined behaviour, strictly speaking. Introduce a new helper to safely load a possibly unaligned value here. We assume that the compiler is smart enough to optimize this into nothing on platforms that provide performant unaligned loads. If that turns out not to be the case, we can look at improvements then. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't use EPOLLET on Qemu sockets	David Gibson	2024-07-26	1	-10/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we do set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't attempt to carry on if we get a bad frame length from qemu	David Gibson	2024-07-26	1	-9/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Better report errors receiving from QEMU socket	David Gibson	2024-07-26	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME	Stefano Brivio	2024-07-26	2	-3/+3
\| \| \| \| \| \| \| \| \|	We report relative timestamps in logs, so we want to avoid jumps in the system time. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Initialise timestamp for relative log time also if we use a log file	Stefano Brivio	2024-07-26	3	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero. Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, util: Fix sub-second part in relative log time calculation	Stefano Brivio	2024-07-26	3	-27/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some reason, in commit 01efc71ddd25 ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part. Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating the difference from the starting time. Do this in a macro as we need the same format in a few places. To calculate the difference, turn the existing timespec_diff_ms() to microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to use that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib/perf_report: Fix highlight	Stefano Brivio	2024-07-25	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fix spurious test failure with systemd-resolved	David Gibson	2024-07-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	systemd-resolved has the rather strange behaviour of listening on the non-standard loopback address 127.0.0.53. Various changes we've made in passt mean that we now usually work fine on a host using systemd-resolved. However our tests still fail in this case. We have a special case for when the guest's resolv.conf needs to differ from the host's because the resolver is on a host loopback address. However, we only consider the case where the host resolver is on 127.0.0.1, not other loopback addresses. Correct this with a different test condition. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Broaden what we consider for DNS specific forwarding rules	David Gibson	2024-07-25	2	-7/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	passt/pasta has options to redirect DNS requests from the guest to a different server address on the host side. Currently, however, only UDP packets to port 53 are considered "DNS requests". This ignores DNS requests over TCP - less common, but certainly possible. It also ignores encrypted DNS requests on port 853. Extend the DNS forwarding logic to handle both of those cases. Link: https://github.com/containers/podman/issues/23239 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Refactor tests in fwd_nat_from_tap() for clarity	David Gibson	2024-07-25	1	-13/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we start by handling the common case, where we don't translate the destination address, then we modify the tgt side for the special cases. In the process we do comparisons on the tentatively set fields in tgt, which obscures the fact that tgt should be an essentially pure function of ini, and risks people examining fields of tgt that are not yet initialized. To make this clearer, do all our tests on 'ini', constructing tgt from scratch on that basis. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Accept addresses enclosed by square brackets in port forwarding specifiers	Stefano Brivio	2024-07-25	1	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit. Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools). Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Exit if we fail to bind a UNIX domain socket with explicit path	Stefano Brivio	2024-07-25	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally. But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: iperf3 3.16 introduces multiple threads, drop our own implementation ↵	Stefano Brivio	2024-07-25	6	-145/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of that Starting from iperf3 version 3.16, -P / --parallel spawns multiple clients as separate threads, instead of multiple streams serviced by the same thread. So we can drop our lib/test implementation to spawn several iperf3 client and server processes and finally simplify things quite a bit. Adjust number of threads and UDP sending bandwidth to values that seem to be more or less matching previous throughput tests on my setup. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Update names of symbols and slabinfo entries	Stefano Brivio	2024-07-25	1	-17/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Differences in allocated Acpi-Parse entries are gone (at least) since the 6.1 Linux kernel series. I should run this on a 6.10 kernel, eventually, and adjust things further, as needed. Userspace symbols are also fairly different now: show whatever is more than 1 MiB at the moment. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au>