passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	test: Select first reported IPv6 address for guest/host comparison	Stefano Brivio	2023-12-27	5	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we run passt nested (a guest connected via passt to a guest connected via passt to the host), the first guest (L1) typically has two IPv6 addresses on the same interface: one formed from the prefix assigned via SLAAC, and another one assigned via DHCPv6 (to match the address on the host). When we select addresses for comparison, in this case, we have multiple global unicast addresses -- again, on the same interface. Selecting the first reported one on both host and guest is not entirely correct (in theory, the order might differ), but works reasonably well. Use the trick from 5beef085978e ("test: Only select a single interface or gateway in tests") to ask jq(1) for the first address returned by the query. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp: Extend lifetime of prefix, router, RDNSS and search list	Stefano Brivio	2023-12-27	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we have no mechanism to dynamically update IPv6 addressing, routing or DNS information (which should eventually be implemented via netlink monitor), so it makes no sense to limit lifetimes of NDP information to any particular value. If we do, with common configurations of systemd-networkd in a guest, we can end up in a situation where we have a /128 address assigned via DHCPv6, the NDP-assigned prefix expires, and the default route also expires. However, as there's a valid address, the prefix is not renewed. As a result, the default route becomes invalid and we lose it altogether, which implies that the guest loses IPv6 connectivity except for link-local communication. Set the router lifetime to the maximum allowed by RFC 8319, that is, 65535 seconds (about 18 hours). RFC 4861 limited this value to 9000 seconds, but RFC 8319 later updated this limit. Set prefix and DNS information lifetime to infinity. This is allowed by RFC 4861 and RFC 8319. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Make handling of shell prompts with escapes a little more reliable	David Gibson	2023-12-07	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using the old-style "pane" methods of executing commands during the tests, we need to scan the shell output for prompts in order to tell when commands have finished. This is inherently unreliable because commands could output things that look like prompts, and prompts might not look like we expect them to. The only way to really fix this is to use a better way of dispatching commands, like the newer "context" system. However, it's awkward to convert everything to "context" right at the moment, so we're still relying on some tests that do work most of the time. It is, however, particularly sensitive to fancy coloured prompts using escape sequences. Currently we try to handle this by stripping actual ESC characters with tr, then looking for some common variants. We can do a bit better: instead strip all escape sequences using sed before looking for our prompt. Or, at least, any one using [a-zA-Z] as the terminating character. Strictly speaking ANSI escapes can be terminated by any character in 0x40..0x7e, which isn't easily expressed in a regexp. This should capture all common ones, though. With this transformation we can simplify the list of patterns we then look for as a prompt, removing some redundant variants. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't defer hash table removal2023_12_04.b86afe3	David Gibson	2023-12-04	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a TCP connection is closed, we mark it by setting events to CLOSED, then some time later we do final cleanups: closing sockets, removing from the hash table and so forth. This does mean that when making a hash lookup we need to exclude any apparent matches that are CLOSED, since they represent a stale connection. This can happen in practice if one connection closes and a new one with the same endpoints is started shortly afterward. Checking for CLOSED is quite specific to TCP however, and won't work when we extend the hash table to more general flows. So, alter the code to immediately remove the connection from the hash table when CLOSED, although we still defer closing sockets and other cleanup. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: "TCP" hash secret doesn't need to be TCP specific	David Gibson	2023-12-04	4	-35/+44
\| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP state structure includes a 128-bit hash_secret which we use for SipHash calculations to mitigate attacks on the TCP hash table and initial sequence number. We have plans to use SipHash in places that aren't TCP related, and there's no particular reason they'd need their own secret. So move the hash_secret to the general context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Add helpers to get the name of a pif	David Gibson	2023-12-04	3	-1/+42
\| \| \| \| \| \| \| \| \| \| \|	Future debugging will want to identify a specific passt interface. We make a distinction in these helpers between the name of the type of pif, and name of the pif itself. For the time being these are always the same thing, since we have at most instance of each type of pif. However, that might change in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Avoid hitting guestfish command length limits	David Gibson	2023-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In test/prepare-distro-img.sh we use guestfish to tweak our distro guest images to be suitable. Part of this is using a 'copy-in' directive to copy in the source files for passt itself. Currently we copy in all the files with a single 'copy-in', since it allows listing multiple files. However it turns out that the number of arguments it can accept is fairly limited and our current list of files is already very close to that limit. Instead, expand our list of files to one copy-in per file, avoiding that limitation. This isn't much slower, because all the commands still run in a single invocation of guestfish itself. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Use epoll_ref type including flow and side	David Gibson	2023-12-04	5	-30/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP uses the 'flow' epoll_ref field for both connected sockets and timers, which consists of just the index of the relevant flow (connection). This is just fine for timers, for while it obviously works, it's subtly incomplete for sockets on spliced connections. In that case we want to know which side of the connection the event is occurring on as well as which connection. At present, we deduce that information by looking at the actual fd, and comparing it to the fds of the sockets on each side. When we use the flow table for more things, we expect more cases where something will need to know a specific side of a specific flow for an event, but nothing more. Therefore add a new 'flowside' epoll_ref field, with exactly that information. We use it for TCP connected sockets. This allows us to directly know the side for spliced connections. For "tap" connections, it's pretty meaningless, since the side is always the socket side. It still makes logical sense though, and it may become important for future flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Use unsigned to represent side	David Gibson	2023-12-04	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we use 'int' values to represent the "side" of a connection, which must always be 0 or 1. This turns out to be dangerous. In some cases we're going to want to put the side into a 1-bit bitfield. However, if that bitfield has type 'int', when we copy it out to a regular 'int' variable, it will be sign-extended and so have values 0 and -1, instead of 0 and 1. To avoid this, always use unsigned variables for the side. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Generalise TCP epoll_ref to generic flows	David Gibson	2023-12-04	3	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP uses three different epoll object types: one for connected sockets, one for timers and one for listening sockets. Listening sockets really need information that's specific to TCP, so need their own epoll_ref field. Timers and connected sockets, however, only need the connection (flow) they're associated with. As we expand the use of the flow table, we expect that to be true for more epoll fds. So, rename the "TCP" epoll_ref field to be a "flow" epoll_ref field that can be used both for TCP and for other future cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove unneccessary bounds check in tcp_timer_handler()	David Gibson	2023-12-04	1	-2/+2
\| \| \| \| \| \| \| \| \|	In tcp_timer_handler() we use conn_at_idx() to interpret the flow index from the epoll reference. However, this will never be NULL - we always put a valid index into the epoll_ref. Simplify slightly based on this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Introduce 'sidx' type to represent one side of one flow	David Gibson	2023-12-04	2	-0/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a number of places, we use indices into the flow table to identify a specific flow. We also have cases where we need to identify a particular side of a particular flow, and we expect those to become more common as we generalise the flow table to cover more things. To assist with that, introduces flow_sidx_t, an index type which identifies a specific side of a specific flow in the table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Suppress false cppcheck positive in flow_sidx()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add logging helpers for connection related messages	David Gibson	2023-12-04	4	-79/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most of the messages logged by the TCP code (be they errors, debug or trace messages) are related to a specific connection / flow. We're fairly consistent about prefixing these with the type of connection and the connection / flow index. However there are a few places where we put the index later in the message or omit it entirely. The template with the prefix is also a little bulky to carry around for every message, particularly for spliced connections. To help keep this consistent, introduce some helpers to log messages linked to a specific flow. It takes the flow as a parameter and adds a uniform prefix to each message. This makes things slightly neater now, but more importantly will help keep formatting consistent as we add more things to the flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make unified version of flow table compaction	David Gibson	2023-12-04	5	-44/+48
\| \| \| \| \| \| \| \| \| \| \|	tcp_table_compact() will move entries in the connection/flow table to keep it compact when other entries are removed. The moved entries need not have the same type as the flow removed, so it needs to be able to handle moving any type of flow. Therefore, move it to flow.c rather than being purportedly TCP specific. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: MAX_FROM_BITS() should be unsigned	David Gibson	2023-12-04	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MAX_FROM_BITS() computes the maximum value representable in a number of bits. The expression for that is an unsigned value, but we explicitly cast it to a signed int. It looks like this is because one of the main users is for FD_REF_MAX, which is used to bound fd values, typically stored as a signed int. The value MAX_FROM_BITS() is calculating is naturally non-negative, though, so it makes more sense for it to be unsigned, and to move the case to the definition of FD_REF_MAX. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Consolidate flow pointer<->index helpers	David Gibson	2023-12-04	4	-46/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety. They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper. In the process, we standardise on always using an unsigned type to store the connection / flow index, which makes more sense. tcp.c's conn_at_idx() remains for now, but we change its parameter to unsigned to match. That in turn means we can remove a check for negative values from it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Move TCP connection table to unified flow table	David Gibson	2023-12-04	9	-82/+107
\| \| \| \| \| \| \| \| \| \| \| \| \|	We want to generalise "connection" tracking to things other than true TCP connections. Continue implenenting this by renaming the TCP connection table to the "flow table" and moving it to flow.c. The definitions are split between flow.h and flow_table.h - we need this separation to avoid circular dependencies: the definitions in flow.h will be needed by many headers using the flow mechanism, but flow_table.h needs all those protocol specific headers in order to define the full flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Generalise connection types	David Gibson	2023-12-04	6	-40/+112
\| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP connections use a 1-bit selector, 'spliced', to determine the rest of the contents of the structure. We want to generalise the TCP connection table to other types of flows in other protocols. Make a start on this by replacing the tcp_conn_common structure with a new flow_common structure with an enum rather than a simple boolean indicating the type of flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Add messages to static_assert() calls	David Gibson	2023-12-04	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A while ago, we updated passt to require C11, for several reasons, but one was to be able to use static_assert() for build time checks. The C11 version of static_assert() requires a message to print in case of failure as well as the test condition it self. It was extended in C23 to make the message optional, but we don't want to require C23 at this time. Unfortunately we missed that in some of the static_assert()s we already added which use the C23 form without a message. clang-tidy has a warning for this, but for some reason it's not seeming to trigger in the current cases (but did for some I'm working on adding). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: remove useless assignment	Laurent Vivier	2023-12-04	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_send_flag(), a4826ee04b76 has replaced: th->doff = sizeof(th) / 4; th->doff += OPT_MSS_LEN / 4; th->doff += (1 + OPT_WS_LEN) / 4; by optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN; th->doff = (sizeof(th) + optlen) / 4; but forgot to remove the useless "th->doff += (1 + OPT_WS_LEN) / 4;" Fixes: a4826ee04b76 ("tcp: Defer and coalesce all segments with no data (flags) to handler") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd, util: Include additional headers to fix build with musl	Stefano Brivio	2023-12-02	2	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	lseek() is declared in unistd.h, and stdio.h provides sscanf(). Include these two headers in port_fwd.c. SIGCHLD, even if used exclusively for clone(), is defined in signal.h: add the include to util.h, as NS_CALL needs it. Reported-by: lemmi <lemmi@nerd2nerd.org> Link: https://github.com/void-linux/void-packages/actions/runs/6999782606/job/19039526604#step:7:57 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	packet: Offset plus length is not always uint32_t, but it's always size_t	Stefano Brivio	2023-12-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	According to gcc, PRIu32 matches the type of the argument we're printing here on both 64 and 32-bits architectures. According to Clang, though, that's not the case, as the result of the sum is an unsigned long on 64-bit. Use the z modifier, given that we're summing uint32_t to size_t, and the result is at most promoted to size_t. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Use 'z' length modifier for size_t/ssize_t conversions	Stefano Brivio	2023-12-02	6	-21/+22
\| \| \| \| \| \| \| \|	Types size_t and ssize_t are not necessarily long, it depends on the architecture. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	port_fwd, util: Don't bind UDP ports with opposite-side bound TCP ports	Stefano Brivio	2023-11-22	5	-11/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When pasta periodically scans bound ports and binds them on the other side in order to forward traffic, we bind UDP ports for corresponding TCP port numbers, too, to support protocols and applications such as iperf3 which use UDP port numbers matching the ones used by the TCP data connection. If we scan UDP ports in order to bind UDP ports, we skip detection of the UDP ports we already bound ourselves, to avoid looping back our own ports. Same with scanning and binding TCP ports. But if we scan for TCP ports in order to bind UDP ports, we need to skip bound TCP ports too, otherwise, as David pointed out: - we find a bound TCP port on side A, and bind the corresponding TCP and UDP ports on side B - at the next periodic scan, we find that UDP port bound on side B, and we bind the corresponding UDP port on side A - at this point, we unbind that UDP port on side B: we would otherwise loop back our own port. To fix this, we need to avoid binding UDP ports that we already bound, on the other side, as a consequence of finding a corresponding bound TCP port. Reproducing this issue is straightforward: ./pasta -- iperf3 -s # Wait one second, then from another terminal: iperf3 -c ::1 -u Reported-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Analysed-by: David Gibson <david@gibson.dropbear.id.au> Fixes: 457ff122e33c ("udp,pasta: Periodically scan for ports to automatically forward") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	valgrind: Don't disable optimizations for valgrind builds2023_11_19.4f1709d	David Gibson	2023-11-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we plan to use valgrind, we need to build passt a bit differently: * We need debug symbols so that valgrind can match problems it finds to meaningful locations * We need to allow additional syscalls in the seccomp filter, because valgrind's wrappers need them Currently we also disable optimization (-O0). This is unfortunate, because it will make valgrind tests even slower than they'd otherwise be. Worse, it's possible that the asm behaviour without optimization might be different enough that valgrind could miss a use of uninitialized variable or other fault it would detect. I suspect this was originally done because without it inlining could mean that suppressions we use don't reliably match the places we want them to. Alas, it turns out this is true even with -O0. We've now implemented a more robust workaround for this (explicit ((noinline)) attributes when compiled with -DVALGRIND). So, we can re-enable optimization for valgrind builds, making them closer to regular builds. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	valgrind: Adjust suppression for MSG_TRUNC with NULL buffer	David Gibson	2023-11-19	3	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	valgrind complains if we pass a NULL buffer to recv(), even if we use MSG_TRUNC, in which case it's actually safe. For a long time we've had a valgrind suppression for this. It singles out the recv() in tcp_sock_consume(), the only place we use MSG_TRUNC. However, tcp_sock_consume() only has a single caller, which makes it a prime candidate for inlining. If inlined, it won't appear on the stack and valgrind won't match the correct suppression. It appears that certain compiler versions (for example gcc-13.2.1 in Fedora 39) will inline this function even with the -O0 we use for valgrind builds. This breaks the suppression leading to a spurious failure in the tests. There's not really any way to adjust the suppression itself without making it overly broad (we don't want to match other recv() calls). So, as a hack explicitly prevent inlining of this function when we're making a valgrind build. To accomplish this add an explicit -DVALGRIND when making such a build. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp,pasta: Periodically scan for ports to automatically forward	David Gibson	2023-11-19	2	-2/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pasta supports automatic port forwarding, where we look for listening sockets in /proc/net (in both namespace and outside) and establish port forwarding to match. For TCP we do this scan both at initial startup, then periodically thereafter. For UDP however, we currently only scan at start. So unlike TCP we won't update forwarding to handle services that start after pasta has begun. There's no particular reason for that, other than that we didn't implement it. So, remove that difference, by scanning for new UDP forwards periodically too. The logic is basically identical to that for TCP, but it needs some changes to handle the mildly different data structures in the UDP case. Link: https://bugs.passt.top/show_bug.cgi?id=45 Link: https://github.com/rootless-containers/rootlesskit/issues/383 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify away tcp_port_rebind()	David Gibson	2023-11-19	1	-29/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_port_rebind() is desgined to be called from NS_CALL() and has two disjoint cases: one where it enters the namespace (outbound forwards) and one where it doesn't (inbound forwards). We only actually need the NS_CALL() framing for the outbound case, for inbound we can just call tcp_port_do_rebind() directly. So simplify tcp_port_rebind() to tcp_port_rebind_outbound(), allowing us to eliminate an awkward parameters structure. With that done we can safely rename tcp_port_do_rebind() to tcp_port_rebind() for brevity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use common helper for rebinding inbound and outbound ports	David Gibson	2023-11-19	1	-47/+45
\| \| \| \| \| \| \| \| \|	tcp_port_rebind() has two cases with almost but not quite identical code. Simplify things a bit by factoring this out into a single parameterised helper, tcp_port_do_rebind(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	clang-tidy: Suppress silly misc-include-cleaner warnings	David Gibson	2023-11-19	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clang-tidy from LLVM 17.0.3 (which is in Fedora 39) includes a new "misc-include-cleaner" warning that tries to make sure that headers directly provide the things that are used in the .c file. That sounds great in theory but is in practice unusable: Quite a few common things in the standard library are ultimately provided by OS-specific system headers, but for portability should be accessed via closer-to-standardised library headers. This will warn constantly about such cases: e.g. it will want you to include <linux/limits.h> instead of <limits.h> to get PATH_MAX. So, suppress this warning globally in the Makefile. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap, pasta: Handle short writes to /dev/tap2023_11_10.5ec3634	David Gibson	2023-11-10	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_send_frames_pasta() sends frames to the namespace by sending them to our the /dev/tap device. If that write() returns an error, we already handle it. However we don't handle the case where the write() returns short, meaning we haven't successfully transmitted the whole frame. I don't know if this can ever happen with the kernel tap device, but we should at least report the case so we don't get a cryptic failure. For the purposes of the return value for tap_send_frames_pasta() we treat this case as though it was an error (on the grounds that a partial frame is no use to the namespace). Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	tap, pasta: Handle incomplete tap sends for pasta too	David Gibson	2023-11-10	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since a469fc39 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") we've handled more gracefully the case where we get data from the socket side, but are temporarily unable to send it all to the tap side (e.g. due to full buffers). That code relies on tap_send_frames() returning the number of frames it successfully sent, which in turn gets it from tap_send_frames_passt() or tap_send_frames_pasta(). While tap_send_frames_passt() has returned that information since b62ed9ca ("tap: Don't pcap frames that didn't get sent"), tap_send_frames_pasta() always returns as though it succesfully sent every frame. However there certainly are cases where it will return early without sending all frames. Update it report that properly, so that the calling functions can handle it properly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Don't use TCP_WINDOW_CLAMP	David Gibson	2023-11-10	2	-60/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On the L2 tap side, we see TCP headers and know the TCP window that the ultimate receiver is advertising. In order to avoid unnecessary buffering within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt to advertise that window back to the original sock-side sender using TCP_WINDOW_CLAMP. However, TCP_WINDOW_CLAMP just doesn't work like this. Prior to kernel commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply had no effect on established sockets. After that commit, it does affect established sockets but doesn't behave the way we need: * It appears to be designed only to shrink the window, not to allow it to re-expand. * More importantly, that commit has a serious bug where if the setsockopt() is made when the existing kernel advertised window for the socket happens to be zero, it will now become locked at zero, stopping any further data from being received on the socket. Since this has never worked as intended, simply remove it. It might be possible to re-implement the intended behaviour by manipulating SO_RCVBUF, so we leave a comment to that effect. This kernel bug is the underlying cause of both the linked passt bug and the linked podman bug. We attempted to fix this before with passt commit d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag"). However while that commit masked the bug for some cases, it didn't really address the problem. Fixes: d3192f67c492 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag") Link: https://github.com/containers/podman/issues/20170 Link: https://bugs.passt.top/show_bug.cgi?id=74 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Rename and small cleanup to tcp_clamp_window()	David Gibson	2023-11-10	1	-11/+10
\| \| \| \| \| \| \| \| \| \| \| \|	tcp_clamp_window() is _mostly_ about using TCP_WINDOW_CLAMP to control the sock side advertised window, but it is also responsible for actually updating the conn->wnd_from_tap value. Rename to tcp_tap_window_update() to reflect that broader purpose, and pull the logic that's not TCP_WINDOW_CLAMP related out to the front. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/lib/perf_report: Fix up table highlight for pasta's local flows	Stefano Brivio	2023-11-10	1	-1/+9
\| \| \| \| \| \| \| \| \|	As commit 29269705239f ("test/perf: Small MTUs for spliced TCP aren't interesting") drops all columns for TCP test MTUs except for one, in throughput test for pasta's local flows, the first column we need to highlight in that table is now the second one. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Revert "selinux: Drop user_namespace class rules for Fedora 37"2023_11_07.56d9f6d	Stefano Brivio	2023-11-07	2	-0/+4
\| \| \| \| \| \| \| \|	This reverts commit 3fb3f0f7a59498bdea1d199eecfdbae6c608f78f: it was meant as a patch for Fedora 37 (and no later versions), not something I should have merged upstream. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow passt to talk over unconfined_t UNIX domain socket for --fd2023_11_07.74e6f48	Stefano Brivio	2023-11-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If passt is started with --fd to talk over a pre-opened UNIX domain socket, we don't really know what label might be associated to it, but at least for an unconfined_t socket, this bit of policy wouldn't belong to anywhere else: enable that here. This is rather loose, of course, but on the other hand passt will sandbox itself into an empty filesystem, so we're not really adding much to the attack surface except for what --fd is supposed to do. Reported-by: Matej Hrica <mhrica@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2247221 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Match implicit va_start() with va_end() in vlogmsg()	Stefano Brivio	2023-11-07	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to C99, 7.15.1: Each invocation of the va_start and va_copy macros shall be matched by a corresponding invocation of the va_end macro in the same function and the same applies to C11. I still have to come across a platform where va_end() actually does something, but thus spake the standard. This would be reported by Coverity as "Missing varargs init or cleanup" (CWE-573). Fixes: c0426ff10bc9 ("log: Add vlogmsg()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Don't try to read bound ports from invalid file handles	Stefano Brivio	2023-11-07	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a minimal fix for what would be reported by Coverity as "Improper use of negative value" (CWE-394): port_fwd_init() doesn't guarantee that all the pre-opened file handles are actually valid. We should probably warn on failing open() and open_in_ns() in port_fwd_init(), too, but that's outside the scope of this minimal fix. Before commit 5a0485425bc9 ("port_fwd: Pre-open /proc/net/* files rather than on-demand"), we used to have a single open() call and a check after it. Fixes: 5a0485425bc9 ("port_fwd: Pre-open /proc/net/* files rather than on-demand") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Sequence numbers are actually 32 bits wide	Stefano Brivio	2023-11-07	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Harmless, as we use sequence numbers monotonically anyway, but now clang-tidy reports: /home/sbrivio/passt/netlink.c:155:7: error: format specifies type 'unsigned short' but the argument has type '__u32' (aka 'unsigned int') [clang-diagnostic-format,-warnings-as-errors] nh->nlmsg_seq, seq); ^ /home/sbrivio/passt/log.h:26:7: note: expanded from macro 'die' err(__VA_ARGS__); \ ^~~~~~~~~~~ /home/sbrivio/passt/log.h:19:34: note: expanded from macro 'err' ^~~~~~~~~~~ Suppressed 222820 warnings (222816 in non-user code, 4 NOLINT). Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well. 1 warning treated as error make: *** [Makefile:255: clang-tidy] Error 1 Fixes: 9d4ab98d538f ("netlink: Add nl_do() helper for simple operations with error checking") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Simplify calculation of "omit" time for TCP throughput	David Gibson	2023-11-07	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For the TCP throughput tests, we use iperf3's -O "omit" option which ignores results for the given time at the beginning of the test. Currently we calculate this as 1/6th of the test measurement time. The purpose of -O, however, is to skip over the TCP slow start period, which in no way depends on the overall length of the test. The slow start time is roughly speaking log_2 ( max_window_size / MSS ) * round_trip_time These factors all vary between tests and machines we're running on, but we can estimate some reasonable bounds for them: * The maximum window size is bounded by the buffer sizes at each end, which shouldn't exceed 16MiB * The mss varies with the MTU we use, but the smallest we use in tests is ~256 bytes * Round trip time will vary with the system, but with these essentially local transfers it will typically be well under 1ms (on my laptop it is closer to 0.03ms) That gives a worst case slow start time of about 16ms. Setting an omit time of 0.1s uniformly is therefore more than enough, and substantially smaller than what we calculate now for the default case (10s / 6 ~= 1.7s). This reduces total time for the standard benchmark run by around 30s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Remove unnecessary --pacing-timer options	David Gibson	2023-11-07	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	We always set --pacing-timer when invoking iperf3. However, the iperf3 man page implies this is only relevant for the -b option. We only use the -b option for the UDP tests, not TCP, so remove --pacing-timer from the TCP cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: "MTU" changes in passt_tcp host to guest aren't useful	David Gibson	2023-11-07	1	-29/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP packet size used on the passt L2 link (qemu socket) makes a huge difference to passt/pasta throughput; many of passt's overheads (chiefly syscalls) are per-packet. That packet size is largely determined by the MTU on the L2 link, so we benchmark for a number of different MTUs. That works well for the guest to host transfers. For the host to guest transfers, we purport to test for different MTUs, but we're not actually adjusting anything interesting. The host to guest transfers adjust the MTU on the "host's" (actually ns) loopback interface. However, that only affects the packet size for the socket going to passt, not the packet size for the L2 link that passt manages - passt can and will repack the stream into packets of its own size. Since the depacketization on that socket is handled by the kernel it doesn't have a lot of bearing on passt's performance. We can't fix this by changing the L2 link MTU from the guest side (as we do for guest to host), because that would only change the guest's view of the MTU, passt would still think it has the large MTU. We could test this by using the --mtu option to passt, but that would require restarting passt for each run, which is awkward in the current setup. So, for now, drop all the "small MTU" tests for host to guest. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Explicitly control UDP packet length, instead of MTU	David Gibson	2023-11-07	2	-94/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Packet size can make a big difference to UDP throughput, so it makes sense to measure it for a variety of different sizes. Currently we do this by adjusting the MTU on the relevant interface before running iperf3. However, the UDP packet size has no inherent connection to the MTU - it's controlled by the sender, and the MTU just affects whether the packet will make it through or be fragmented. The only reason adjusting the MTU works is because iperf3 bases its default packet size on the (path) MTU. We can test this more simply by using the -l option to the iperf3 client to directly control the packet size, instead of adjusting the MTU. As well as simplifying this lets us test different packet sizes for host to ns traffic. We couldn't do that previously because we don't have permission to change the MTU on the host. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Small MTUs for spliced TCP aren't interesting	David Gibson	2023-11-07	1	-52/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we make TCP throughput measurements for spliced connections with a number of different MTU values. However, the results from this aren't really interesting. Unlike with tap connections, spliced connections only involve the loopback interface on host and container, not a "real" external interface. lo typically has an MTU of 65535 and there is very little reason to ever change that. So, the measurements for smaller MTUs are rarely going to be relevant. In addition, the fact that we can offload all the {de,}packetization to the kernel with splice(2) means that the throughput difference between these MTUs isn't very great anyway. Remove the short MTUs and only show spliced throughput for the normal 65535 byte loopback MTU. This reduces runtime of the performance tests on my laptop by about 1 minute (out of ~24 minutes). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Start iperf3 server less often	David Gibson	2023-11-07	5	-109/+213
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we start both the iperf3 server(s) and client(s) afresh each time we want to make a bandwidth measurement. That's not really necessary as usually a whole batch of bandwidth measurements can use the same server. Split up the iperf3 directive into 3 directives: iperf3s to start the server, iperf3 to make a measurement and iperf3k to kill the server, so that we can start the server less often. This - and more importantly, the reduced number of waits for the server to be ready - reduces runtime of the performance tests on my laptop by about 4m (out of ~28minutes). For now we still restart the server between IPv4 and IPv6 tests. That's because in some cases the latency measurements we make in between use the same ports. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Get iperf3 stats from client side	David Gibson	2023-11-07	2	-19/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	iperf3 generates statistics about its run on both the client and server sides. They don't have exactly the same information, but both have the pieces we need (AFAICT the server communicates some nformation to the client over the control socket, so the most important information is in the client side output, even if measured by the server). Currently we use the server side information for our measurements. Using the client side information has several advantages though: * We can directly wait for the client to complete and we know we'll have the output we want. We don't need to sleep to give the server time to write out the results. * That in turn means we can wrap up as soon as the client is done, we don't need to wait overlong to make sure everything is finished. * The slightly different organisation of the data in the client output means that we always want the same json value, rather than requiring slightly different onces for UDP and TCP. The fact that we avoid some extra delays speeds up the overal run of the perf tests by around 7 minutes (out of around 35 minutes) on my laptop. The fact that we no longer unconditionally kill client and server after a certain time means that the client could run indefinitely if the server doesn't respond. We mitigate that by setting 1s connect timeout on the client. This isn't foolproof - if we get an initial response, but then lose connectivity this could still run indefinitely, however it does cover by far the most likely failure cases. --snd-timeout would provide more robustness, but I've hit odd failures when trying to use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Remove stale iperf3c/iperf3s directives	David Gibson	2023-11-07	2	-6/+1
\| \| \| \| \| \| \| \| \| \|	Some older revisions used separate iperf3c and iperf3s test directives to invoke the iperf3 client and server. Those were combined into a single iperf3 directive some time ago, but a couple of places still have the old syntax. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove socket from udp_{tap,splice}_map when timed out	David Gibson	2023-11-07	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We save sockets bound to particular ports in udp_{tap,splice}_map for reuse later. If they're not used for a time, we time them out and close them. However, when that happened, we weren't actually removing the fds from the relevant map. That meant that later interactions on the same port could get a stale fd from the map. The stale fd might be closed, leading to unexpected EBADF errors, or it could have been re-used by a completely different socket bound to a different port, which could lead to us incorrectly forwarding packets. Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu> Reported-by: Jay <bugs.passt.top@bitsbetwixt.com> Link: https://bugs.passt.top/show_bug.cgi?id=57 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Consistently use -1 to indicate un-opened sockets in maps	David Gibson	2023-11-07	3	-5/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp uses the udp_tap_map, udp_splice_ns and udp_splice_init tables to keep track of already opened sockets bound to specific ports. We need a way to indicate entries where a socket hasn't been opened, but the code isn't consistent if this is indicated by a 0 or a -1: * udp_splice_sendfrom() and udp_tap_handler() assume that 0 indicates an unopened socket * udp_sock_init() fills in -1 for a failure to open a socket * udp_timer_one() is somewhere in between, treating only strictly positive fds as valid -1 (or, at least, negative) is really the correct choice here, since 0 is a theoretically valid fd value (if very unlikely in practice). Change to use that consistently throughout. The table does need to be initialised to all -1 values before any calls to udp_sock_init() which can happen from conf_ports(). Because C doesn't make it easy to statically initialise non zero values in large tables, this does require a somewhat awkward call to initialise the table from conf(). This is the best approach I could see for the short term, with any luck it will go away at some point when those socket tables are replaced by a unified flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>