passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: Don't reset ACK_TO_TAP_DUE on any ACK, reschedule timer as needed2023_03_21.1ee2f7c	Stefano Brivio	2023-03-21	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is mostly symmetric with commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer"): we shouldn't reset the ACK_TO_TAP_DUE flag on any inbound ACK segment, but only once we acknowledge everything we received from the guest or the container. If we don't, a client might unnecessarily hold off further data, especially during slow start, and in general we won't converge to the usable bandwidth. This is very visible especially with traffic tests on links with non-negligible latency, such as in the reported issue. There, a public iperf3 server sometimes aborts the test due do what appears to be a low iperf3's --rcv-timeout (probably less than a second). Even if this doesn't happen, the throughput will converge to a fraction of the usable bandwidth. Clear ACK_TO_TAP_DUE if we acknowledged everything, set it if we didn't, and reschedule the timer in case the flag is still set as the timer expires. While at it, decrease the ACK timer interval to 10ms. A 50ms interval is short enough for any bandwidth-delay product I had in mind (local connections, or non-local connections with limited bandwidth), but here I am, testing 1gbps transfers to a peer with 100ms RTT. Indeed, we could eventually make the timer interval dependent on the current window and estimated bandwidth-delay product, but at least for the moment being, 10ms should be long enough to avoid any measurable syscall overhead, yet usable for any real-world application. Reported-by: Lukas Mrtvy <lukas.mrtvy@gmail.com> Link: https://bugs.passt.top/show_bug.cgi?id=44 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: When a connection flag it set, don't negate it for debug print	Stefano Brivio	2023-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \|	Fix a copy and paste typo I added in commit 5474bc5485d8 ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()") and --debug altogether. Fixes: 5474bc5485d8 ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Fix false positive if cppcheck doesn't give a false positive	David Gibson	2023-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	da46fdac "tcp: Suppress knownConditionTrueFalse cppcheck false positive" introduced a suppression to work around a cppcheck bug causing a false positive warning. However, the suppression will itself cause a spurious unmatchedSuppression warning if used with a version of cppcheck from before the bug was introduced. That includes the packaged version of cppcheck in Fedora. Suppress the unmatchedSuppression as well. Fixes: da46fdac3605 ("tcp: Suppress knownConditionTrueFalse cppcheck false positive") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Work around weird false positives with cppcheck-2.9.1	David Gibson	2023-03-21	6	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 89e38f55 "treewide: Fix header includes to build with musl" added extra #includes to work with musl. Unfortunately with the cppcheck version I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false positives: specifically cppcheck seems to hit a #error in <bits/unistd.h> complaining about including it directly instead of via <unistd.h> (which is not something we're doing). I have no idea why that would be happening; but I'm guessing it has to be a bug in the cpp implementation in that cppcheck version. In any case, it's possible to work around this by moving the include of <unistd.h> before the include of <signal.h>. So, do that. Fixes: 89e38f55405d ("treewide: Fix header includes to build with musl") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Actually bind detected namespace ports in init namespace	Stefano Brivio	2023-03-21	1	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When I reworked udp_init() to move most of the port binding logic to conf_ports, I accidentally dropped this bit of automatic port detection (and binding) at start-up. On -U auto, in pasta mode, udp_sock_init_ns() binds ports in the namespace that correspond to ports bound in the init namespace, but on -u auto, nothing actually happens after port detection. Add udp_sock_init_init() to deal with this, and while at it fix the comment to udp_sock_init_ns(): the latter takes care of outbound "connections". This is currently not covered by tests, and the UDP port needs to be already bound in the namespace when pasta starts (periodic detection for UDP is a missing feature at the moment). It can be checked like this: $ unshare -rUn # echo $$ 590092 # socat -u UDP-LISTEN:5555 STDOUT $ pasta -q -u auto 590092 $ echo "test" \| socat -u STDIN UDP:localhost:5555 Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: fix tcp port forwarding in auto mode	Paul Holzinger	2023-03-21	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The logic in tcp_timer() was inverted. fwd_out should expose the host ports in the ns. Therfore it must read the ports on the host and then bind them in the netns. The same for fwd_in which checks ports in the ns and then exposes them on the host. Note that this only fixes tcp ports, udp does not seems to work at all right now with the auto mode. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Fixes: 1128fa03fe73 ("Improve types and names for port forwarding configuration") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fedora: Refresh SELinux labels in scriptlets, require -selinux package2023_03_17.dd23496	Stefano Brivio	2023-03-17	1	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of: https://fedoraproject.org/wiki/SELinux_Policy_Modules_Packaging_Draft follow this: https://fedoraproject.org/wiki/PackagingDrafts/SELinux_Independent_Policy which seems to make more sense and fixes the issue that, on a fresh install, without a reboot, the file contexts for the binaries are not actually updated. In detail: - labels are refreshed using the selinux_relabel_pre and selinux_relabel_post on install, upgrade, and uninstall - use the selinux_modules_install and selinux_modules_uninstall macros, instead of calling 'semodule' directly (no functional changes in our case) - require the -selinux package on SELinux-enabled environments and if the current system policy is "targeted" Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Makefile: Enable external override for TARGET	Stefano Brivio	2023-03-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A cross-architecture build might pass a target-specific CC on 'make', and not on 'make install', and this is what happens in Debian cross-qa tests. Given that we select binaries to be installed depending on the target architecture, this means we would build AVX2 binaries in any case on a x86_64 build machine. By overriding TARGET in package build rules, we can tell the Makefile about the target architecture, also for the 'install' (Makefile) target. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt.1: Fix description of --mtu option	Stefano Brivio	2023-03-17	1	-2/+4
\| \| \| \| \| \| \| \|	By default, 65520 bytes are advertised, and zero disables DHCP and NDP options. Fixes: ec2b58ea4dc4 ("conf, dhcp, ndp: Fix message about default MTU, make NDP consistent") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Avoid time_t/__syscall_slong_t format mismatch with long int on X32 ABI	Stefano Brivio	2023-03-17	1	-10/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On X32 (ILP32 using AMD64 system call ABI) and glibc, struct timespec::tv_nsec is __syscall_slong_t and not a long int, see also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 and timespec(3type). Fine, we could cast that down to long and be done with it. But it turns out that also time_t (not guaranteed to be equivalent to any type) is a long long int, and there we can't downcast. To keep it simple, cast both to long long int, and change formats to %lli, to avoid format warnings from gcc. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	fedora: Install SELinux interface files to shared include directory2023_03_10.70c0765	Stefano Brivio	2023-03-10	1	-0/+4
\| \| \| \| \|	Link: https://github.com/fedora-selinux/selinux-policy/pull/1613 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	contrib/selinux: Split interfaces into smaller bits	Stefano Brivio	2023-03-10	1	-10/+61
\| \| \| \| \| \| \|	...to fit accepted Fedora practices. Link: https://github.com/fedora-selinux/selinux-policy/pull/1613 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	contrib/selinux: Drop unused passt_read_data() interface	Stefano Brivio	2023-03-10	1	-8/+0
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	contrib/selinux: Drop "example" from headers: this is the actual policy	Stefano Brivio	2023-03-10	6	-6/+6
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Update Features section, plus minor improvements2023_03_09.7c7625d	Stefano Brivio	2023-03-09	1	-47/+28
\| \| \| \| \| \| \|	...it's been a while. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	contrib: Drop libvirt out-of-tree patch, integration mostly works in 9.1.0	Stefano Brivio	2023-03-09	1	-427/+0
\| \| \| \| \| \| \| \|	...and in any case, this patch doesn't offer any advantage over the current upstream integration. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	contrib: Drop QEMU out-of-tree patches	Stefano Brivio	2023-03-09	2	-208/+0
\| \| \| \| \| \| \|	Native support was introduced with commit 13c6be96618c, QEMU 7.2. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	contrib: Drop Podman out-of-tree patch, integration is upstream now	Stefano Brivio	2023-03-09	1	-605/+0
\| \| \| \| \| \| \| \|	See https://github.com/containers/podman/pull/16141, shipped in Podman 4.4. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Clamp MSS value when queueing data to tap, also for pasta	Stefano Brivio	2023-03-09	1	-14/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tom reports that a pattern of repated ~1 MiB chunks downloads over NNTP over TLS, on Podman 4.4 using pasta as network back-end, results in pasta taking one full CPU thread after a while, and the download never succeeds. On that setup, we end up re-sending the same frame over and over, with a consistent 65 534 bytes size, and never get an acknowledgement from the tap-side client. This only happens for the default MTU value (65 520 bytes) or for values that are slightly smaller than that (down to 64 499 bytes). We hit this condition because the MSS value we use in tcp_data_from_sock(), only in pasta mode, is simply clamped to USHRT_MAX, and not to the actual size of the buffers we pre-cooked for sending, which is a bit less than that. It looks like we got away with it until commit 0fb7b2b9080a ("tap: Use different io vector bases depending on tap type") fixed the setting of iov_len. Luckily, since it's pasta, we're queueing up to two frames at a time, so the worst that can happen is a badly segmented TCP stream: we always have some space at the tail of the buffer. Clamp the MSS value to the appropriate maximum given by struct tcp{4,6}_buf_data_t, no matter if we're running in pasta or passt mode. While at it, fix the comments to those structs to reflect the current struct size. This is not really relevant for any further calculation or consideration, but it's convenient to know while debugging this kind of issues. Thanks to Tom for reporting the issue in a very detailed way and for providing a test setup. Reported-by: Tom Mombourquette <tom@devnode.com> Link: https://github.com/containers/podman/issues/17703 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Terminate on EMFILE or ENFILE on sockets for port mapping	Stefano Brivio	2023-03-09	1	-7/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, we don't terminate or report failures if we fail to bind to some ports out of a given port range specifier, to allow users to conveniently specify big port ranges (or "all") without having to care about ports that might already be in use. However, running out of the open file descriptors quota is a different story: we can't do what the user requested in a very substantial way. For example, if the user specifies '-t all' and we can only bind 1024 sockets, the behaviour is rather unexpected. Fail whenever socket creation returns -ENFILE or -EMFILE. Link: https://bugs.passt.top/show_bug.cgi?id=27 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, udp: Fix partial success return codes in {tcp,udp}_sock_init()	Stefano Brivio	2023-03-09	2	-28/+23
\| \| \| \| \| \| \| \| \| \| \| \|	The comments say we should return 0 on partial success, and an error code on complete failure. Rationale: if the user configures a port forwarding, and we succeed to bind that port for IPv4 or IPv6 only, that might actually be what the user intended. Adjust the two functions to reflect the comments. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, udp, util: Pass socket creation errors all the way up	Stefano Brivio	2023-03-09	3	-32/+39
\| \| \| \| \| \| \| \| \|	...starting from sock_l4(), pass negative error (errno) codes instead of -1. They will only be used in two commits from now, no functional changes intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Carry own definition of __bswap_constant{16,32}	Stefano Brivio	2023-03-09	1	-0/+11
\| \| \| \| \| \| \| \| \| \|	musl doesn't define those, use our own definition there. This is a trivial implementation, similar to what's shipped by e.g. uClibc, glibc, libiio. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Fix header includes to build with musl	Chris Kuhn	2023-03-09	9	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Roughly inspired from a patch by Chris Kuhn: fix up includes so that we can build against musl: glibc is more lenient as headers generally include a larger amount of other headers. Compared to the original patch, I only included what was needed directly in C files, instead of adding blanket includes in local header files. It's a bit more involved, but more consistent with the current (not ideal) situation. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, passt: Rename stderr to force_stderr	Chris Kuhn	2023-03-09	3	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \|	While building against musl, gcc informs us that 'stderr' is a protected keyword. This probably comes from a #define stderr (stderr) in musl's stdio.h, to avoid a clash with extern FILE *const stderr, but I didn't really track it down. Just rename it to force_stderr, it makes more sense. [sbrivio: Added commit message] Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Use 8 KiB * netlink message header size as response buffer	Stefano Brivio	2023-03-09	1	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...instead of BUFSIZ. On musl, BUFSIZ is 1024, so we'll typically truncate the response to the request we send in nl_link(). It's usually 8192 or more with glibc. There doesn't seem to be any macro defining the rtnetlink maximum message size, and iproute2 just hardcodes 1024 * 1024 for the receive buffer, but the example in netlink(7) makes somewhat sense, looking at the kernel implementation. It's not very clean, but we're very unlikely to hit that limit, and if we do, we'll find out painlessly, because NLA_OK() will tell us right away. Reported-by: Chris Kuhn <kuhnchris+passt@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, icmp, tcp, udp: Add options to bind to outbound address and interface	Stefano Brivio	2023-03-09	6	-35/+243
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I didn't notice earlier: libslirp (and slirp4netns) supports binding outbound sockets to specific IPv4 and IPv6 addresses, to force the source addresse selection. If we want to claim feature parity, we should implement that as well. Further, Podman supports specifying outbound interfaces as well, but this is simply done by resolving the primary address for an interface when the network back-end is started. However, since kernel version 5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users"), we can actually bind to a specific interface name, which doesn't need to be validated in advance. Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses, and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets to given interfaces. Given that it probably makes little sense to select addresses and routes from interfaces different than the ones given for outbound sockets, also assign those as "template" interfaces, by default, unless explicitly overridden by '-i'. For ICMP and UDP, we call sock_l4() to open outbound sockets, as we already needed to bind to given ports or echo identifiers, and we can bind() a socket only once: there, pass address (if any) and interface (if any) for the existing bind() and setsockopt() calls. For TCP, in general, we wouldn't otherwise bind sockets. Add a specific helper to do that. For UDP outbound sockets, we need to know if the final destination of the socket is a loopback address, before we decide whether it makes sense to bind the socket at all: move the block mangling the address destination before the creation of the socket in the IPv4 path. This was already the case for the IPv6 path. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, passt.h: Rename "outbound" interface to "template" interface	Stefano Brivio	2023-03-09	2	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \|	In preparation for the next patch, make it clear that the first routable interface fetched via netlink, or the one configured via -i/--interface, is simply used as template to copy addresses and routes, not an interface we actually use to derive the source address (which will be _bound to_) for outgoing packets. The man page and usage message appear to be already clear enough. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	contrib/selinux: Let interface users set paths for log, PID, socket files	Stefano Brivio	2023-03-09	1	-1/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even libvirt itself will configure passt to write log, PID and socket files to different locations depending on whether the domain is started as root (/var/log/libvirt/...) or as a regular user (/var/log/<PID>/libvirt/...), and user_tmp_t would only cover the latter. Create interfaces for log and PID files, so that callers can specify different file contexts for those, and modify the interface for the UNIX socket file to allow different paths as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Laine Stump <laine@redhat.com> Reviewed-by: Laine Stump <laine@redhat.com>
*	contrib/selinux: Allow binding and connecting to all UDP and TCP ports	Stefano Brivio	2023-03-09	1	-12/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Laine reports that with a simple: <portForward proto='tcp'> <range start='2022' to='22'/> </portForward> in libvirt's domain XML, passt won't start as it fails to bind arbitrary ports. That was actually the intention behind passt_port_t: the user or system administrator should have explicitly configured allowed ports on a given machine. But it's probably not realistic, so just allow any port to be bound and forwarded. Also fix up some missing operations on sockets. Reported-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Laine Stump <laine@redhat.com> Reviewed-by: Laine Stump <laine@redhat.com>
*	contrib/selinux: Let passt write to stdout and stderr when it starts	Stefano Brivio	2023-03-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Otherwise, it's unusable as stand-alone tool, or in foreground mode, and it's also impossible to get output from --help or --version, because for SELinux it's just a daemon. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Laine Stump <laine@redhat.com> Reviewed-by: Laine Stump <laine@redhat.com>
*	contrib/selinux: Drop duplicate init_daemon_domain() rule	Stefano Brivio	2023-03-09	1	-1/+0
\| \| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Laine Stump <laine@redhat.com> Reviewed-by: Laine Stump <laine@redhat.com>
*	udp: Fix signedness warning on 32-bits architectures	Stefano Brivio	2023-03-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a ssize_t is an int: udp.c: In function ‘udp_sock_handler’: udp.c:774:23: warning: comparison of integer expressions of different signedness: ‘unsigned int’ and ‘ssize_t’ {aka ‘int’} [-Wsign-compare] 774 \| for (i = 0; i < n; i += m) { \| ^ udp.c:781:43: warning: comparison of integer expressions of different signedness: ‘unsigned int’ and ‘ssize_t’ {aka ‘int’} [-Wsign-compare] 781 \| for (m = 1; i + m < n; m++) { \| Change 'i' and 'm' counters in udp_sock_handler() to signed versions, to match ssize_t n. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Makefile: Fix SuperH 4 builds: it's AUDIT_ARCH_SH, not AUDIT_ARCH_SH4	Stefano Brivio	2023-03-09	1	-0/+1
\| \| \| \| \| \| \|	sh4 builds currently fail because of this: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=sh4&ver=0.0%7Egit20230216.4663ccc-1&stamp=1677091350&raw=0 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Makefile, seccomp.sh: Fix cross-builds, adjust syscalls list to compiler	Stefano Brivio	2023-03-09	2	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Debian cross-building automatic checks: http://crossqa.debian.net/src/passt currently fail because we don't use the right target architecture and compiler while building the system call lists and resolving their numbers in seccomp.sh. Pass ARCH and CC to seccomp.sh and use them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Add own prototype for __clone2() on ia642023_02_27.c538ee8	Stefano Brivio	2023-02-27	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	ia64 needs to use __clone2() as clone() is not available, but glibc doesn't export the prototype. Take it from clone(2) to avoid an implicit declaration: util.c: In function ‘do_clone’: util.c:512:16: warning: implicit declaration of function ‘__clone2’ [-Wimplicit-function-declaration] 512 \| return __clone2(fn, stack_area + stack_size / 2, stack_size / 2, \| ^~~~~~~~ Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	contrib/apparmor: Split profile into abstractions, use them	Stefano Brivio	2023-02-27	3	-60/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	One day, libvirt might actually support running passt to provide guest connectivity. Should libvirtd (or virtqemud) start passt, it will need to access socket and PID files in specific locations, and passt needs to accept SIGTERM in case QEMU fails to start after passt is already started. To make this more convenient, split the current profile into two abstractions, for passt and for pasta, so that external programmes can include the bits they need (and especially not include the pasta abstraction if they only need to start passt), plus whatever specific adaptation is needed. For stand-alone usage of passt and pasta, the 'passt' profile simply includes both abstractions, plus rules to create and access PID and capture files in default or reasonable ($HOME) locations. Tested on Debian with libvirt 9.0.0 together with a local fix to start passt as intended, namely libvirt commit c0efdbdb9f66 ("qemu_passt: Avoid double daemonizing passt"). This is an example of how the libvirtd profile (or virtqemud abstraction, or virtqemud profile) can use this: # support for passt network back-end /usr/bin/passt Cx -> passt, profile passt { /usr/bin/passt r, owner @{run}/user/[0-9]/libvirt/qemu/run/passt/ rw, signal (receive) set=("term") peer=/usr/sbin/libvirtd, signal (receive) set=("term") peer=libvirtd, include if exists <abstractions/passt> } translated: - when executing /usr/bin/passt, switch to the subprofile "passt" (not the "discrete", i.e. stand-alone profile), described below. Scrub the environment (e.g. LD_PRELOAD is dropped) - in the "passt" subprofile: - allow reading the binary - allow read and write access to PID and socket files - make passt accept SIGTERM from /usr/sbin/libvirtd, and libvirtd peer names - include anything else that's needed by passt itself Suggested-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	qrap: Generate -netdev as JSON	Andrea Bolognani	2023-02-27	1	-1/+5
\| \| \| \| \| \| \| \| \| \|	While generating -device as JSON when JSON is in use is mandatory, because not doing so can often prevent the VM from starting up, using JSON for -netdev simply makes things a bit nicer. No reason not to do it, though. Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	qrap: Introduce machine-specific PCI address base	Andrea Bolognani	2023-02-27	1	-10/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For pc machines, devices are placed directly on pci.0 with addresses like bus=pci.0,addr=0xa and in this case the existing code works correctly. For q35 machines, however, a separate PCI bus is created for each devices using a pcie-root-port, and the resulting addresses look like bus=pci.9,addr=0x0 In this case, we need to treat PCI addresses as decimal, not hexadecimal, both when parsing and generating them. This issue has gone unnoticed for a long time because it only shows up when enough PCI devices are present: for small numbers, decimal and hexadecimal overlap, masking the issue. Reported-by: Alona Paz <alkaplan@redhat.com> Fixes: 5307faa05997 ("qrap: Strip network devices from command line, set them up according to machine") Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	qrap: Drop args in JSON format	Andrea Bolognani	2023-02-27	1	-0/+5
\| \| \| \| \| \| \| \| \|	When JSON support was introduced, the drop_args array has not been updated accordingly. Fixes: b944ca185587 ("qrap: Support JSON syntax for -device") Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	qrap: Fix support for pc machines	Andrea Bolognani	2023-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	The JSON version of the template is incorrect, making qrap completely unusable when JSON arguments are used together with the pc machine type. Fixes: b944ca185587 ("qrap: Support JSON syntax for -device") Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	qrap: Fix limits for PCI addresses	Andrea Bolognani	2023-02-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	The pci.0 bus on a pc machine has 32 slots. For q35 machines, we don't expect devices to be plugged into pcie.0 directly, so technically we could have a very large number of slots by adding many pcie-root-ports, but even in this scenario 32 is a reasonable number. Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, conf, tap: Define die() as err() plus exit(), drop cppcheck workarounds	Stefano Brivio	2023-02-27	4	-18/+12
\| \| \| \| \| \| \| \| \| \| \|	If we define die() as a variadic macro, passing __VA_ARGS__ to err(), and calling exit() outside err() itself, we can drop the workarounds introduced in commit 36f0199f6ef4 ("conf, tap: Silence two false positive invalidFunctionArg from cppcheck"). Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	doc/demo: Fix and suppress ShellCheck warnings	Stefano Brivio	2023-02-27	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \|	ShellCheck reports (SC2034) that __qemu_arch is not used. Use it, and silence the resulting SC2086 warning as we want word splitting on options we pass with it. While at it, silence SC2317 warnings for commands in cleanup() that appear to be unreachable: cleanup() is only called as trap. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Fix definitions of SOCKET_MAX, TCP_MAX_CONNS	Stefano Brivio	2023-02-27	4	-5/+7
\| \| \| \| \| \| \| \|	...and, given that I keep getting this wrong, add a convenience macro, MAX_FROM_BITS(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Avoid (theoretical) resource leak (CWE-772) Coverity warning	Stefano Brivio	2023-02-27	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	If tcp_timer_ctl() gets a socket number greater than SOCKET_MAX (2 ^ 24), we return error but we don't close the socket. This is a rather formal issue given that, at least on Linux, socket numbers are monotonic and we're in general not allowed to open so many sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Avoid false (but convoluted) positive Coverity CWE-476 warning	Stefano Brivio	2023-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there are no TCP options in the header, tcp_tap_handler() will pass the corresponding pointer, fetched via packet_get(), as NULL to tcp_conn_from_sock_finish(), which in turn indirectly calls tcp_opt_get(). If there are no options, tcp_opt_get() will stop right away because the option length is indicated as zero. However, if the logic is complicated enough to follow for static checkers, adding an explicit check against NULL in tcp_opt_get() is probably a good idea. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()	Stefano Brivio	2023-02-27	2	-12/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We use the return value of fls() as array index for debug strings. While fls() can return -1 (if no bit is set), Coverity Scan doesn't see that we're first checking the return value of another fls() call with the same bitmask, before using it. Call fls() once, store its return value, check it, and use the stored value as array index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Disable gcc strict aliasing rules as needed, drop workarounds	Stefano Brivio	2023-02-27	4	-33/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recently, commit 4ddbcb9c0c55 ("tcp: Disable optimisations for tcp_hash()") worked around yet another issue we hit with gcc 12 and '-flto -O2': some stores affecting the input data to siphash_20b() were omitted altogether, and tcp_hash() wouldn't get the correct hash for incoming connections. Digging further into this revealed that, at least according to gcc's interpretation of C99 aliasing rules, passing pointers to functions with different types compared to the effective type of the object (for example, a uint8_t pointer to an anonymous struct, as it happens in tcp_hash()), doesn't guarantee that stores are not reordered across the function call. This means that, in general, our checksum and hash functions might not see parts of input data that was intended to be provided by callers. Not even switching from uint8_t to character types, which should be appropriate here, according to C99 (ISO/IEC 9899, TC3, draft N1256), section 6.5, "Expressions", paragraph 7: An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [...] — a character type. does the trick. I guess this is also subject to interpretation: casting passed pointers to character types, and then using those as different types, might still violate (dubious) aliasing rules. Disable gcc strict aliasing rules for potentially affected functions, which, in turn, disables gcc's Type-Based Alias Analysis (TBAA) optimisations based on those function arguments. Drop the existing workarounds. Also the (seemingly?) bogus 'maybe-uninitialized' warning on the tcp_tap_handler() > tcp_hash() > siphash_20b() path goes away with -fno-strict-aliasing on siphash_20b(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Suppress knownConditionTrueFalse cppcheck false positive	Stefano Brivio	2023-02-27	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck 2.10 reports: tcp.c:1815:12: style: Condition 'wnd>prev_scaled' is always false [knownConditionTrueFalse] if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) \|\| ^ tcp.c:1808:8: note: Assignment 'wnd=((1<<(16+8))<(wnd))?(1<<(16+8)):(wnd)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd); ^ tcp.c:1811:19: note: Assuming condition is false if (prev_scaled == wnd) ^ tcp.c:1815:12: note: Condition 'wnd>prev_scaled' is always false if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) \|\| ^ but this is not actually the case: wnd is typically greater than 1, and might very well be greater than prev_scaled as well. I bisected this down to cppcheck commit b4d455df487c ("Fix 11349: FP negativeIndex for clamped array index (#4627)") and reported findings at https://github.com/danmar/cppcheck/pull/4627. Suppress the warning for the moment being. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>