passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	treewide: Disable gcc strict aliasing rules as needed, drop workarounds	Stefano Brivio	2023-02-27	4	-33/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recently, commit 4ddbcb9c0c55 ("tcp: Disable optimisations for tcp_hash()") worked around yet another issue we hit with gcc 12 and '-flto -O2': some stores affecting the input data to siphash_20b() were omitted altogether, and tcp_hash() wouldn't get the correct hash for incoming connections. Digging further into this revealed that, at least according to gcc's interpretation of C99 aliasing rules, passing pointers to functions with different types compared to the effective type of the object (for example, a uint8_t pointer to an anonymous struct, as it happens in tcp_hash()), doesn't guarantee that stores are not reordered across the function call. This means that, in general, our checksum and hash functions might not see parts of input data that was intended to be provided by callers. Not even switching from uint8_t to character types, which should be appropriate here, according to C99 (ISO/IEC 9899, TC3, draft N1256), section 6.5, "Expressions", paragraph 7: An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [...] — a character type. does the trick. I guess this is also subject to interpretation: casting passed pointers to character types, and then using those as different types, might still violate (dubious) aliasing rules. Disable gcc strict aliasing rules for potentially affected functions, which, in turn, disables gcc's Type-Based Alias Analysis (TBAA) optimisations based on those function arguments. Drop the existing workarounds. Also the (seemingly?) bogus 'maybe-uninitialized' warning on the tcp_tap_handler() > tcp_hash() > siphash_20b() path goes away with -fno-strict-aliasing on siphash_20b(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Suppress knownConditionTrueFalse cppcheck false positive	Stefano Brivio	2023-02-27	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck 2.10 reports: tcp.c:1815:12: style: Condition 'wnd>prev_scaled' is always false [knownConditionTrueFalse] if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) \|\| ^ tcp.c:1808:8: note: Assignment 'wnd=((1<<(16+8))<(wnd))?(1<<(16+8)):(wnd)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd); ^ tcp.c:1811:19: note: Assuming condition is false if (prev_scaled == wnd) ^ tcp.c:1815:12: note: Condition 'wnd>prev_scaled' is always false if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) \|\| ^ but this is not actually the case: wnd is typically greater than 1, and might very well be greater than prev_scaled as well. I bisected this down to cppcheck commit b4d455df487c ("Fix 11349: FP negativeIndex for clamped array index (#4627)") and reported findings at https://github.com/danmar/cppcheck/pull/4627. Suppress the warning for the moment being. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Send identifier string in log messages, openlog() won't work for us	Stefano Brivio	2023-02-27	1	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	openlog() can be used to set "ident" and have all the log messages prefixed by it, but only if we call syslog() -- this is implemented by C libraries. We don't log messages with syslog(), though, as we have a custom implementation to ensure we don't need dynamic memory allocation. This means that it's perfectly useless to call openlog(), and that we have to prefix every message we log by the identifier on our own. Reported-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, udp: Allow any loopback address to be used as resolver	Stefano Brivio	2023-02-27	2	-10/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Andrea reports that with a Fedora 37 guest running on a Fedora 37 host, both using systemd-resolved, with passt connecting them, running with default options, DNS queries don't work. systemd-resolved on the host is reachable only at the loopback address 127.0.0.53. We advertise the default gateway address to the guest as resolver, because our local address is of course unreachable from there, which means we see DNS queries directed to the default gateway, and we redirect them to 127.0.0.1. However, systemd-resolved doesn't answer on 127.0.0.1. To fix this, set @dns_match to the address of the default gateway, unless a different resolver address is explicitly configured, so that we know we explicitly have to map DNS queries, in this case, to the address of the local resolver. This means that in udp_tap_handler() we need to check, first, if the destination address of packets matches @dns_match: even if it's the address of the local gateway, we want to map that to a specific address, which isn't necessarily 127.0.0.1. Do the same for IPv6 for consistency, even though IPv6 defines a single loopback address. Reported-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Split add_dns{4,6}() out of get_dns()	Stefano Brivio	2023-02-27	1	-35/+51
\| \| \| \| \| \| \| \| \| \| \| \|	The logic handling which resolvers we add, and whether to add them, is getting rather cramped in get_dns(): split it into separate functions. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Actually use host resolver to forward DNS queries	Stefano Brivio	2023-02-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of the address of the first resolver we advertise to the guest or namespace. This was one of the intentions behind commit 3a2afde87dd1 ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields"), but I forgot to implement this part. In practice, they are usually the same thing, unless /etc/resolv.conf points to a loopback address. Fixes: 3a2afde87dd1 ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Disable optimisations for tcp_hash()2023_02_22.4ddbcb9	Stefano Brivio	2023-02-22	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	I'm not sure if we're breaking some aliasing rule here, but with gcc 12.2.1 on x86_64 and -flto, the siphash_20b() call in tcp_hash() doesn't see the connection address -- it gets all zeroes instead. Fix this temporarily by disabling optimisations for this tcp_hash(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux/passt.te: Allow setting socket option on routing netlink socket	Stefano Brivio	2023-02-21	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux/passt.te: Allow /etc/resolv.conf symlinks to be followed	Stefano Brivio	2023-02-21	1	-0/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux/passt.te: Allow setcap on the process itself	Stefano Brivio	2023-02-21	1	-0/+1
\| \| \| \| \| \| \|	This is needed by the new functions in isolate.c, add the corresponding rule. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Switch to a more reasonable model for PID and socket files	Stefano Brivio	2023-02-21	2	-5/+7
\| \| \| \| \| \| \| \|	Instead of restricting PID files to /var/run/passt.pid, which is a single file and unlikely to be used, use the user_tmp_t type which should cover any reasonable need. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Define interfaces for libvirt and similar frameworks	Stefano Brivio	2023-02-21	2	-0/+27
\| \| \| \| \| \| \| \| \| \|	Services running passt will commonly need to transition to its domain, terminate it, connect and write to its socket. The init_daemon_domain() macro now defines the default transition to the passt_t domain, using the passt_exec_t type. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux/passt.if: Fix typo in passt_read_data interface definition	Stefano Brivio	2023-02-21	1	-1/+1
\| \| \| \| \| \| \| \| \|	This is an example interface, currently unused, so it went undetected: m4 macros need a backtick at the beginning of a block instead of a single quote. Fixes: 1f4b7fa0d75d ("passt, pasta: Add examples of SELinux policy modules") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Fix typo and logic in conf_ports() check for port binding2023_02_16.4663ccc	Stefano Brivio	2023-02-16	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ouch, I accidentally pushed the previous change without running the tests: - we need to check, in conf_ports(), that udp_sock_init() managed to bind at least a port, not the opposite - for -T and -U, we have no way to know if we'll manage to bind the port later, so never report an error for those Fixes: 3d0de2c1d727 ("conf, tcp, udp: Exit if we fail to bind sockets for all given ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, tap: Silence two false positive invalidFunctionArg from cppcheck	Stefano Brivio	2023-02-16	2	-1/+8
\| \| \| \| \| \| \| \|	The newly introduced die() calls exit(), but cppcheck doesn't see it and warns about possibly invalid arguments used after the check which triggers die(). Add return statements to silence the warnings. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove 'zero_len' goto from tcp_data_from_sock	David Gibson	2023-02-16	1	-14/+12
\| \| \| \| \| \| \| \| \| \|	This goto exists purely to move this exception case out of line. Although that does make the "normal" path a little clearer, it comes at the cost of not knowing how where control will flow after jumping to the zero_len label. The exceptional case isn't that long, so just put it inline. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove 'recvmsg' goto from tcp_data_from_sock	David Gibson	2023-02-16	1	-6/+5
\| \| \| \| \| \| \|	This goto can be handled just as simply and more clearly with a do while. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Eliminate goto from tap_handler()	David Gibson	2023-02-16	1	-12/+7
\| \| \| \| \| \| \| \|	The goto here really doesn't improve clarity or brevity at all. Use a clearer construct. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't pcap frames that didn't get sent	David Gibson	2023-02-16	1	-10/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tap_send_frames() we send a number of frames to the tap device, then also write them to the pcap capture file (if configured). However the tap send can partially fail (short write()s or similar), meaning that some of the requested frames weren't actually sent, but we still write those frames to the capture file. We do give a debug message in this case, but it's misleading to add frames that we know weren't sent to the capture file. Rework to avoid this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt.1: Fix typo, improve wording in examples of port forwarding specifiers	Stefano Brivio	2023-02-16	1	-12/+17
\| \| \| \| \| \| \| \| \| \| \| \|	Based on a patch from Laine, and reports from Laine and Yalan: fix the "22-80:32-90" example, and improve wording for the other ones: instead of using "to" to denote the end of a range, use "between ... and", so that it's clear we're not referring to target ports. Reported-by: Laine Stump <laine@redhat.com> Reported-by: Yalan Zhang <yalzhang@redhat.com> Fixes: da20f57f19dc ("passt, qrap: Add man pages") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	dhcp: Fix netmask calculation for option 1 from prefix length	Stefano Brivio	2023-02-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Similar to the conf_print() fix from commit 4129764ecaeb ("conf: Fix mask calculation from prefix_len in conf_print()"): to calculate an IPv4 netmask from the prefix length, we need to left shift 32 all-one bits by 32 minus the prefix length -- not by the prefix length itself. Reported-by: Yalan Zhang <yalzhang@redhat.com> Fixes: dd09cceaee21 ("Minor improvements to IPv4 netmask handling") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Use single counter for iov elements in tap_send_frames_pasta()	Stefano Brivio	2023-02-16	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	David points out that using multiple counters to go over the iov array, namely 'i' and 'iov', makes mistakes easier. We can't just use 'iov', unless we reserve an element with zero iov_len at the end, which isn't really justified. Simply use 'i' to iterate over the array. Link: https://archives.passt.top/passt-dev/Y+mfenvLn3VJ7Dg5@yekko/ Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, tcp, udp: Exit if we fail to bind sockets for all given ports	Stefano Brivio	2023-02-16	5	-27/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	passt supports ranges of forwarded ports as well as 'all' for TCP and UDP, so it might be convenient to proceed if we fail to bind only some of the desired ports. But if we fail to bind even a single port for a given specification, we're clearly, unexpectedly, conflicting with another network service. In that case, report failure and exit. Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Don't duplicate messages on stderr before daemonising	Stefano Brivio	2023-02-16	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that logging functions force printing messages to stderr before passt forks to background, we'll have duplicate messages when running from an interactive terminal, or if --stderr is passed, because at some point we set LOG_PERROR in our __openlog() wrapper. We could defer setting LOG_PERROR, but that would change option semantics in other, unexpected ways. We could force calling passt_vsyslog() as long as the mask is set to LOG_EMERG, but that complicates the logic in logging functions even further. Go the easy way for now: don't force printing to stderr with LOG_EMERG if LOG_PERROR is already set. We should seriously consider a rework of those logging functions at this point. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	convert all remaining err() followed by exit() to die()	Laine Stump	2023-02-16	6	-91/+47
\| \| \| \| \| \| \| \|	This actually leaves us with 0 uses of err(), but someone could want to use it in the future, so we may as well leave it around. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log a detailed error (not usage()) when there are extra non-option arguments	Laine Stump	2023-02-16	1	-1/+1
\| \| \| \| \|	Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	make conf_netns_opt() exit immediately after logging error	Laine Stump	2023-02-16	1	-12/+4
\| \| \| \| \| \| \|	...and return void to simplify the caller. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	make conf_ugid() exit immediately after logging error	Laine Stump	2023-02-16	1	-18/+9
\| \| \| \| \| \| \|	Again, it can then be made to return void, simplifying the caller. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	make conf_pasta_ns() exit immediately after logging error	Laine Stump	2023-02-16	1	-24/+11
\| \| \| \| \| \| \|	As with conf_ports, this allows us to make the function return void. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	make conf_ports() exit immediately after logging error	Laine Stump	2023-02-16	1	-24/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rather than having conf_ports() (possibly) log an error, and then letting the caller log the entire usage() message and exit, just log the errors and exit immediately (using die()). For some errors, conf_ports would previously not log any specific message, leaving it up to the user to determine the problem by guessing. We replace all of those silent returns with die() (logging a specific error), thus permitting us to make conf_ports() return void, which simplifies the caller. While modifying the two callers to conf_ports() to not check for a return value, we can further simplify the code by removing the check for a non-null optarg, as that is guaranteed to never happen (due to prior calls to getopt_long() with "argument required" for all relevant options - getopt_long() would have already caught this error). Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	eliminate most calls to usage() in conf()	Laine Stump	2023-02-16	1	-214/+122
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Nearly all of the calls to usage() in conf() occur immediately after logging a more detailed error message, and the fact that these errors are occuring indicates that the user has already seen the passt usage message (or read the manpage). Spamming the logfile with the complete contents of the usage message serves only to obscure the more detailed error message. The only time when the full usage message should be output is if the user explicitly asks for it with -h (or its synonyms) As a start to eliminating the excessive calls to usage(), this patch replaces most calls to err() followed by usage() with a call to die() instead. A few other usage() calls remain, but their removal involves bit more nuance that should be properly explained in separate commit messages. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	add die() to log an error message and exit with a single call	Laine Stump	2023-02-16	2	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	Almost all occurences of err() are either immediately followed by exit(EXIT_FAILURE), usage(argv[0]) (which itself then calls exit(EXIT_FAILURE), or that is what's done immediately after returning from the function that calls err(). Modify the errfn macro so that its instantiations can include exit(EXIT_FAILURE) at the end, and use that to create a new function die() that will log an error and then exit. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log to stderr until process is daemonized, even if a log file is set	Laine Stump	2023-02-16	2	-9/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Once a log file (specified on the commandline) is opened, the logging functions will stop sending error logs to stderr, which is annoying if passt has been started by another process that wants to collect error messages from stderr so it can report why passt failed to start. It would be much nicer if passt continued sending all log messages to stderr until it daemonizes itself (at which point the process that started passt can assume that it was started successfully). The system log mask is set to LOG_EMERG when the process starts, and we're already using that to do "special" logging during the period from process start until the log level requested on the commandline is processed (setting the log mask to something else). This period almost matches with "the time before the process is daemonized"; if we just delay setting the log mask a tiny bit, then it will match exactly, and we can use it to determine if we need to send log messages to stderr even when a log file has been specified and opened. This patch delays the setting of the log mask until immediately before the call to __daemon(). It also modifies logfn() slightly, so that it will log to stderr any time log mask is LOG_EMERG, even if a log file has been opened. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fedora 32-35 have moved to the archives	David Gibson	2023-02-14	1	-4/+4
\| \| \| \| \| \| \| \|	Fedora 32-35 are now old enough that they're not on all mirrors. Fetch them from the archive server instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Update location for Debian ppc64 images	David Gibson	2023-02-14	2	-2/+5
\| \| \| \| \| \| \| \|	The current debian cloud images no longer include ppc64. Change to using the latest snapshot which does include ppc64. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Improve handling of fallback if socket pool is empty on new splice	David Gibson	2023-02-14	3	-60/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When creating a new spliced connection, we need to get a socket in the other ns from the originating one. To avoid excessive ns switches we usually get these from a pool refilled on a timer. However, if the pool runs out we need a fallback. Currently that's done by passing -1 as the socket to tcp_splice_connnect() and running it in the target ns. This means that tcp_splice_connect() itself needs to have different cases depending on whether it's given an existing socket or not, which is a separate concern from what it's mostly doing. We change it to require a suitable open socket to be passed in, and ensuring in the caller that we have one. This requires adding the fallback paths to the caller, tcp_splice_new(). We use slightly different approaches for a socket in the init ns versus the guest ns. This also means that we no longer need to run tcp_splice_connect() itself in the guest ns, which allows us to remove a bunch of boilerplate code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split pool lookup from creating new sockets in tcp_conn_new_sock()	David Gibson	2023-02-14	3	-31/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if that's empty creates a new socket in the init namespace. Both parts of this are duplicated in other places: the pool lookup logic is duplicated in tcp_splice_new(), and the socket opening logic is duplicated in tcp_sock_refill_pool(). Split the function into separate parts so we can remove both these duplications. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move socket pool declarations around	David Gibson	2023-02-14	4	-48/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice.c has some explicit extern declarations to access the socket pools. This is pretty dangerous - if we changed the type of these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the same memory in different ways with no compiler error. So, move the extern declarations to tcp_conn.h so they're visible to both tcp.c and tcp_splice.c, but not the rest of pasta. In fact the pools for the guest namespace are necessarily only used by tcp_splice.c - we have no sockets on the guest side if we're not splicing. So move those declarations and the functions that deal exclusively with them to tcp_splice.c Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split init and ns cases for tcp_sock_refill()	David Gibson	2023-02-14	1	-32/+21
\| \| \| \| \| \| \| \| \| \|	With the creation of the tcp_sock_refill_pool() helper, the ns==true and ns==false paths for tcp_sock_refill() now have almost nothing in common. Split the two versions into tcp_sock_refill_init() and tcp_sock_refill_ns() functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make a helper to refill each socket pool	David Gibson	2023-02-14	1	-30/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_refill() contains two near-identical loops to refill the IPv4 and IPv6 socket pools. In addition, if we get an error on the IPv4 pool we exit early and won't attempt to refill the IPv6 pool. At least theoretically, these are independent from each other and there's value to filling up either pool without the other. So, there's no strong reason to give up on one because the other failed. Address both of these with a helper function 'tcp_sock_refill_pool()' to refill a single given pool. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Makefile: Explict int type in FALLOC_FL_COLLAPSE_RANGE probe	Florian Weimer	2023-02-14	1	-1/+1
\| \| \| \| \| \| \| \|	Future compilers will not support implicit ints by default, causing the probe to always fail. Link: https://bugs.passt.top/show_bug.cgi?id=42 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/pasta_options: Ignore failures on shell 'exit'	Stefano Brivio	2023-02-13	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On shell 'exit' commands, running shells from pasta, we might get: Cannot set tty process group (No such process) as some TTY devices might be unaccessible. This is harmless, but after commit "pasta: propagate exit code from child command", we'll get test failures there, at least with dash. Ignore those explicitly with a ugly workaround: we can't simply do something like: exit \|\| : because the failure is reported by the shell itself once it exits, regardless of the command evaluation. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: propagate exit code from child command	Paul Holzinger	2023-02-12	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Exits codes are very useful for scripts, when the pasta child execvp() call fails with ENOENT that parent should also exit with > 0. In short the parent should always exit with the code from the child to make it useful in scripts. It is easy to test with: `pasta -- bash -c "exit 3"; echo $?` Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: correctly exit when execvp() fails	Paul Holzinger	2023-02-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	By default clone() will create a child that does not send SIGCHLD when the child exits. The caller has to specifiy the SIGNAL it should get in the flag bitmask. see clone(2) under "The child termination signal" This fixes the problem where pasta would not exit when the execvp() call failed, i.e. when the command does not exists. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: do not leak netlink sock into child	Paul Holzinger	2023-02-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When spawning a child command with pasta command... pasta should not leak fds that it opened. Only the fds that were already open should be given to the child. Run `pasta --config-net -- ls -l /proc/self/fd` from a terminal where only stdin/out/err are open. The fd 3 was opend by ls to read the /proc/self/fd dir. But fd 5 is the netlink socket that was opend in pasta. To prevent such a leak we will open the socket with SOCK_CLOEXEC. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Make assertions actually useful	David Gibson	2023-02-12	7	-20/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer	Stefano Brivio	2023-02-12	1	-14/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	David reports that TCP transfers might stall, especially with smaller socket buffer sizes, because we reset the ACK_FROM_TAP_DUE flag, in tcp_tap_handler(), whenever we receive an ACK segment, regardless of its sequence number and the fact that we might still be waiting for one. This way, we might fail to re-transmit frames on ACK timeouts. We need, instead, to: - indicate with the @retrans field only re-transmissions for the same data sequences. If we make progress, it should be reset, given that it's used to abort a connection when we exceed a given number of re-transmissions for the same data - unset the ACK_FROM_TAP_DUE flag if and only if the acknowledged sequence is the same as the last one we sent, as suggested by David - keep it set otherwise, if progress was done but not all the data we sent was acknowledged, and update the expiration of the ACK timeout Add a new helper for these purposes, tcp_update_seqack_from_tap(). To extend the ACK timeout, the new helper sets the ACK_FROM_TAP_DUE flag, even if it was already set, and conn_flag_do() triggers a timer update. This part should be revisited at a later time, because, strictly speaking, ACK_FROM_TAP_DUE isn't a flag anymore. One possibility might be to introduce another connection attribute for events affecting timer deadlines. Reported-by: David Gibson <david@gibson.dropbear.id.au> Link: https://bugs.passt.top/show_bug.cgi?id=41 Suggested-by: David Gibson <david@gibson.dropbear.id.au> Fixes: be5bbb9b0681 ("tcp: Rework timers to use timerfd instead of periodic bitmap scan") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Send frames after the first one in tap_send_frames_pasta()	Stefano Brivio	2023-02-12	1	-1/+2
\| \| \| \| \| \| \|	...instead of repeatedly sending out the first one in iov. Fixes: e21ee41ac35a ("tcp: Combine two parts of pasta tap send path together") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Wait for tap to be set up before spawning command	Stefano Brivio	2023-02-12	3	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adapted from a patch by Paul Holzinger: when pasta spawns a command, operating without a pre-existing user and network namespace, it needs to wait for the tap device to be configured and its handler ready, before the command is actually executed. Otherwise, something like: pasta --config-net nslookup passt.top usually fails as the nslookup command is issued before the network interface is ready. We can't adopt a simpler approach based on SIGSTOP and SIGCONT here: the child runs in a separate PID namespace, so it can't send SIGSTOP to itself as the kernel sees the child as init process and blocks the delivery of the signal. We could send SIGSTOP from the parent, but this wouldn't avoid the possible condition where the child isn't ready to wait for it when the parent sends it, also raised by Paul -- and SIGSTOP can't be blocked, so it can never be pending. Use SIGUSR1 instead: mask it before clone(), so that the child starts with it blocked, and can safely wait for it. Once the parent is ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the child is waiting for it, the kernel will queue it for us, because it's blocked. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 1392bc5ca002 ("Allow pasta to take a command to execute") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use tap_send_frames()	David Gibson	2023-01-23	4	-166/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To send frames on the tap interface, the UDP uses a fairly complicated two level batching. First multiple frames are gathered into a single "message" for the qemu stream socket, then multiple messages are send with sendmmsg(). We now have tap_send_frames() which already deals with sending a number of frames, including batching and handling partial sends. Use that to considerably simplify things. This does make a couple of behavioural changes: * We used to split messages to keep them under 32kiB (except when a single frame was longer than that). The comments claim this is needed to stop qemu from closing the connection, but we don't have any equivalent logic for TCP. I wasn't able to reproduce the problem with this series, although it was apparently easy to reproduce earlier. My suspicion is that there was never an inherent need to keep messages small, however with larger messages (and default kernel buffer sizes) the chances of needing more than one resend for partial send()s is greatly increased. We used not to correctly handle that case of multiple resends, but now we do. * Previously when we got a partial send on UDP, we would resend the remainder of the entire "message", including multiple frames. The common code now only resends the remainder of a single frame, simply dropping any frames which weren't even partially sent. This is what TCP always did and is probably a better idea for UDP too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>