passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	pif: Record originating pif in listening socket refs	David Gibson	2023-11-07	6	-23/+28
\| \| \| \| \| \| \| \| \| \|	For certain socket types, we record in the epoll ref whether they're sockets in the namespace, or on the host. We now have the notion of "pif" to indicate what "place" a socket is associated with, so generalise the simple one-bit 'ns' to a pif id. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Introduce notion of passt/pasta interface	David Gibson	2023-11-07	2	-1/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several possible ways of communicating with other entities. We use sockets to communicate with the host and other network sites, but also in a different context to communicate "spliced" channels to a namespace. We also use a tuntap device or a qemu socket to communicate with the namespace or guest. For the time being these are just defined implicitly by how we structure things. However, there are other communication channels we want to use in future (e.g. virtio-user), and we want to allow more flexible forwarding between those. To accomplish that we're going to want a specific way of referring to those channels. Introduce the concept of a "passt/pasta interface" or "pif" representing a specific channel to communicate network data. Each pif is assumed to be associated with a specific network namespace in the broad sense (that is as a place where IP addresses have a consistent meaning - not the Linux specific sense). But there could be multiple pifs communicating with the same namespace (e.g. the spliced and tap interfaces in pasta). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Clean up ref initialisation in udp_sock_init()	David Gibson	2023-11-07	1	-5/+2
\| \| \| \| \| \| \| \| \|	udp_sock_init() has a number of paths that initialise uref differently. However some of the fields are initialised the same way in all of them. Move those fields into the original initialiser to save a few lines. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Simplify get_bound_ports_() to port_fwd_scan_()	David Gibson	2023-11-07	3	-38/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	get_bound_ports_() now only use their context and ns parameters to determine which forwarding maps they're operating on. Each function needs the map they're actually updating, as well as the map for the other direction, to avoid creating forwarding loops. The UDP function also requires the corresponding TCP map, to implement the behaviour where we forward UDP ports of the same number as bound TCP ports for tools like iperf3. Passing those maps directly as parameters simplifies the code without making the callers life harder, because those already know the relevant maps. IMO, invoking these functions in terms of where they're looking for updated forwarding also makes more logical sense than in terms of where they're looking for bound ports. Given that new way of looking at the functions, also rename them to port_fwd_scan_(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Move port scanning /proc fds into struct port_fwd	David Gibson	2023-11-07	3	-35/+36
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we store /proc/net fds used to implement automatic port forwarding in the proc_net_{tcp,udp} fields of the main context structure. However, in fact each of those is associated with a particular direction of forwarding, and we already have struct port_fwd which collects all other information related to a particular direction of port forwarding. We can simplify things a bit by moving the /proc fds into struct port_fwd. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Split TCP and UDP cases for get_bound_ports()	David Gibson	2023-11-07	3	-36/+47
\| \| \| \| \| \| \| \| \| \| \| \|	Currently get_bound_ports() takes a parameter to determine if it scans for UDP or TCP bound ports, but in fact there's almost nothing in common between those two paths. The parameter appears primarily to have been a convenience for when we needed to invoke this function via NS_CALL(). Now that we don't need that, split it into separate TCP and UDP versions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Don't NS_CALL get_bound_ports()	David Gibson	2023-11-07	2	-71/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we want to scan for bound ports in the namespace we use NS_CALL() to run get_bound_ports() in the namespace. However, the only thing it actually needed to be in the namespace for was to open the /proc/net file it was scanning. Since we now always pre-open those, we no longer need to switch to the namespace for the actual get_bound_ports() calls. That in turn means that tcp_port_detect() doesn't need to run in the ns either, and we can just replace it with inline calls to get_bound_ports(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Pre-open /proc/net/* files rather than on-demand	David Gibson	2023-11-07	1	-18/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	procfs_scan_listen() can either use an already opened fd for a /proc/net file, or it will open it. So, effectively it will open the file on the first call, then re-use the fd in subsequent calls. However, it's not possible to open the /proc/net files after we isolate our filesystem in isolate_prefork(). That means that for each /proc/net file we must call procfs_scan_listen() at least once before isolate_prefork(), or it won't work afterwards. That happens to be the case, but it's a pretty fragile requirement. To make this more robust, instead always pre-open the /proc files we will need in get_bounds_port_init() and have procfs_scan_listen() just use those. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add open_in_ns() helper	David Gibson	2023-11-07	2	-0/+54
\| \| \| \| \| \| \| \| \| \|	Most of our helpers which need to enter the pasta network namespace are quite specialised. Add one which is rather general - it just open()s a given file in the namespace context and returns the fd back to the main namespace. This will have some future uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Better parameterise procfs_scan_listen()	David Gibson	2023-11-07	1	-31/+24
\| \| \| \| \| \| \| \| \| \| \| \| \|	procfs_scan_listen() does some slightly clunky logic to deduce the fd it wants to use, the path it wants to open and the state it's looking for based on parameters for protocol, IP version and whether we're in the namespace. However, the caller already has to make choices based on similar parameters so it can just pass in the things that procfs_scan_listen() needs directly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Move automatic port forwarding code to port_fwd.[ch]	David Gibson	2023-11-07	8	-153/+185
\| \| \| \| \| \| \| \| \| \| \| \| \|	The implementation of scanning /proc files to do automatic port forwarding is a bit awkwardly split between procfs_scan_listen() in util.c, get_bound_ports() and related functions in conf.c and the initial setup code in conf(). Consolidate all of this into port_fwd.h, which already has some related definitions, and a new port_fwd.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Cleaner initialisation of default forwarding modes	David Gibson	2023-11-07	1	-33/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Initialisation of the forwarding mode variables is complicated a bit by the fact that we have different defaults for passt and pasta modes. This leads to some debateably duplicated code between those two cases. More significantly, however, currently the final setting of the mode variable is interleaved with the code to actually start automatic scanning when that's selected. This essentially mingles "policy" code (setting the default mode), with implementation code (making that happen). That's a bit inflexible if we want to change policies in future. Disentangle these two pieces, and use a slightly different construction to make things briefer as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Drop user_namespace class rules for Fedora 37	Stefano Brivio	2023-11-07	2	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	With current selinux-policy-37.22-1.fc37.noarch, and presumably any future update for Fedora 37, the user_namespace class is not available, so statements using it prevent the policy from being loaded. If a class is not defined in the base policy, any related permission is assumed to be enabled, so we can safely drop those. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2237996 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	dhcp: put option 53 at the beginning2023_10_04.f851084	Stas Sergeev	2023-10-04	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	... unless it is listed in 55. Many clients expect option 53 at the beginning. mTCP has this code: if ( resp->options[0] != 53 ) { TRACE_WARN(( "Dhcp: first option was not a Dhcp msg type\n" )); return; } wattcp32 has this: static int DHCP_is_ack (void) { const BYTE opt = (const BYTE) &dhcp_in.dh_opt[4]; return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_ACK); } static int DHCP_is_nack (void) { const BYTE opt = (const BYTE) &dhcp_in.dh_opt[4]; return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_NAK); } Link: https://bugs.passt.top/show_bug.cgi?id=77 Signed-off-by: Stas Sergeev <stsp2@yandex.ru> [sbrivio: s/options 53/option 53/ and s/other/others/ in comment] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tap: Don't increase tap-side sequence counter for dropped frames	Stefano Brivio	2023-10-04	3	-10/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptors we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag	Stefano Brivio	2023-10-04	1	-5/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the sender, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the sender. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway). Reported-by: Matej Hrica <mhrica@redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=74 Analysed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Fix comment to tcp_sock_consume()	Stefano Brivio	2023-10-04	1	-1/+1
\| \| \| \| \| \| \| \| \|	Note that tcp_sock_consume() doesn't update ACK sequence counters anymore. Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	cppcheck: Work around bug in cppcheck 2.12.0	David Gibson	2023-10-04	1	-0/+7
\| \| \| \| \| \| \| \| \|	cppcheck 2.12.0 (and maybe some other versions) things this if condition is always true, which is demonstrably not true. Work around the bug for now. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Use "exhaustive" level checking when available	David Gibson	2023-10-04	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent enough cppcheck versions (at least as of cppcheck 2.12) give this error processing conf.c: conf.c:1179:1: information: ValueFlow analysis is limited in conf. Use --check-level=exhaustive if full analysis is wanted. [checkLevelNormal] Adding --check-level=exhaustive doesn't seem to significantly increase the cppcheck run time for us, so enable it when possible, suppressing that warning. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Remove overly cryptic selection of forward table	David Gibson	2023-10-04	1	-12/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a couple of places in conf(), we use a local 'fwd' variable to reference one of our forwarding tables. The value depends on which command line option we're currently looking at, and is initialized rather cryptically from an assignment side-effect within the if condition checking that option. Newer versions of cppcheck complain about this assignment being an always true condition, but in any case it's both clearer and slightly shorter to use separate if branches for the two cases and set the forwarding parameter more directly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	18	-46/+50
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Use incremental rather than all-at-once siphash functions	David Gibson	2023-09-30	5	-145/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have a bunch of variants of the siphash functions for different data sizes. The callers, in tcp.c, need to pack the various values they want to hash into a temporary structure, then call the appropriate version. We can avoid the copy into the temporary by directly using the incremental siphash functions. The length specific hash functions also have an undocumented constraint that the data pointer they take must, in fact, be aligned to avoid unaligned accesses, which may cause crashes on some architectures. So, prefer the incremental approach and remove the length-specific functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash, checksum: Move TBAA explanation to checksum.c	David Gibson	2023-09-30	2	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \|	A number of checksum and hash functions require workarounds for the odd behaviour of Type-Baased Alias Analysis. We have a detailed comment about this on siphash_8b() and other functions reference that. Move the main comment to csume_16b() instead, because we're going to reorganise things in siphash.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Make internal helpers public	David Gibson	2023-09-30	2	-106/+111
\| \| \| \| \| \| \| \| \|	Move a bunch of code from siphash.c to siphash.h, making it available to other modules. This will allow places which need hashes of more complex objects to construct them incrementally. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Use specific structure for internal state	David Gibson	2023-09-30	1	-38/+42
\| \| \| \| \| \| \| \|	To improve type safety, encapsulate the internal state of the SipHash algorithm into a dedicated structure type. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Use more hygienic state initialiser	David Gibson	2023-09-30	1	-17/+12
\| \| \| \| \| \| \| \| \| \|	The PREAMBLE macro sets up the SipHash initial internal state. It also defines that state as a variable, which isn't macro hygeinic. With previous changes simplifying this premable, it's now possible to replace it with a macro which simply expands to a C initialisedrfor that state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Fix bug in state initialisation	David Gibson	2023-09-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	The SipHash algorithm starts with initializing the 32 bytes of internal state with some magic numbers XORed with the hash key. However, our implementation has a bug - rather than XORing the hash key, it sets the initial state to copies of the key. I don't know if that affects any of the cryptographic properties of SipHash but it's not what we should be doing. Fix it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Clean up hash finalisation with posthash_final() function	David Gibson	2023-09-30	1	-30/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The POSTAMBLE macro implements the finalisation steps of SipHash. It relies on some variables in the environment, including returning the final hash value that way. That isn't great hygeine. In addition the PREAMBLE macro takes a length parameter which is used only to initialize the 'b' value that's not used until the finalisation and is also sometimes modified in a non-obvious way by the callers. The 'b' value is always composed from the total length of the hash input plus up to 7 bytes of "tail" data - that is the remainder of the hash input after a multiple of 8 bytes has been consumed. Simplify all this by replacing the POSTAMBLE macro with a siphash_final() function which takes the length and tail data as parameters and returns the final hash value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Add siphash_feed() helper	David Gibson	2023-09-30	1	-31/+21
\| \| \| \| \| \| \| \| \| \| \| \| \|	We have macros or inlines for a number of common operations in the siphash functions. However, in a number of places we still open code feeding another 64-bits of data into the hash function: an xor, followed by 2 rounds of shuffling, followed by another xor. Implement an inline function for this, which results in somewhat shortened code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Make sip round calculations an inline function rather than macro	David Gibson	2023-09-30	1	-22/+29
\| \| \| \| \| \| \| \| \|	The SIPROUND(n) macro implements n rounds of SipHash shuffling. It relies on 'v' and '__i' variables being available in the context it's used in which isn't great hygeine. Replace it with an inline function instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Make siphash functions consistently return 64-bit results	David Gibson	2023-09-30	3	-16/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some of the siphas_*b() functions return 64-bit results, others 32-bit results, with no obvious pattern. siphash_32b() also appears to do this incorrectly - taking the 64-bit hash value and simply returning it truncated, rather than folding the two halves together. Since SipHash proper is defined to give a 64-bit hash, make all of them return 64-bit results. In the one caller which needs a 32-bit value, tcp_seq_init() do the fold down to 32-bits ourselves. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Consolidate and improve workarounds for clang-tidy issue 58992	David Gibson	2023-09-27	4	-13/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several workarounds for a clang-tidy bug where the checker doesn't recognize that a number of system calls write to - and therefore initialise - a socket address. We can't neatly use a suppression, because the bogus warning shows up some time after the actual system call, when we access a field of the socket address which clang-tidy erroneously thinks is uninitialised. Consolidate these workarounds into one place by using macros to implement wrappers around affected system calls which add a memset() of the sockaddr to silence clang-tidy. This removes the need for the individual memset() workarounds at the callers - and the somewhat longwinded explanatory comments. We can then use a #define to not include the hack in "real" builds, but only consider it for clang-tidy. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Avoid shadowing index(3)	David Gibson	2023-09-27	6	-35/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A classic gotcha of the standard C library is that its unwise to call any variable 'index' because it will shadow the standard string library function index(3). This can cause warnings from cppcheck amongst others, and it also means that if the variable is removed you tend to get confusing type errors (or sometimes nothing at all) instead of a nice simple "name is not defined" error. Strictly speaking this only occurs if <string.h> is included, but that is so common that as a rule it's best to just avoid it always. We have a number of places which hit this trap, so rename variables and parameters to avoid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Always send an ACK segment once the handshake is completed	Stefano Brivio	2023-09-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The reporter is running a SMTP server behind pasta, and the client waits for the server's banner before sending any data. In turn, the server waits for our ACK after sending SYN,ACK, which never comes. If we use the ACK_IF_NEEDED indication to tcp_send_flag(), given that there's no pending data, we delay sending the ACK segment at the end of the three-way handshake until we have some data to send to the server. This was actually intended, as I thought we would lower the latency for new connections, but we can't assume that the client will start sending data first (SMTP is the typical example where this doesn't happen). And, trying out this patch with SSH (where the client starts sending data first), the reporter actually noticed we have a lower latency by forcing an ACK right away. Comparing a capture before the patch: 13:07:14.007704 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [S], seq 1797034836, win 65535, options [mss 4096,nop,wscale 7], length 0 13:07:14.007769 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [S.], seq 2297052481, ack 1797034837, win 65480, options [mss 65480,nop,wscale 7], length 0 13:07:14.008462 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 65535, length 21 13:07:14.008496 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [.], ack 22, win 512, length 0 13:07:14.011799 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [P.], seq 1:515, ack 22, win 512, length 514 and after: 13:10:26.165364 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [S], seq 4165939595, win 65535, options [mss 4096,nop,wscale 7], length 0 13:10:26.165391 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [S.], seq 985607380, ack 4165939596, win 65480, options [mss 65480,nop,wscale 7], length 0 13:10:26.165418 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], ack 1, win 512, length 0 13:10:26.165683 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 512, length 21 13:10:26.165698 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [.], ack 22, win 512, length 0 13:10:26.167107 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [P.], seq 1:515, ack 22, win 512, length 514 the latency between the initial SYN segment and the first data transmission actually decreases from 792µs to 334µs. This is not statistically relevant as we have a single measurement, but it can't be that bad, either. Reported-by: cr3bs (from IRC) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	dhcp: Actually note down the length of options received by the client	Stefano Brivio	2023-09-27	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It turns out we never used 'clen' until commit 1f24d3efb499 ("dhcp: support BOOTP clients"), and we always ignored option 55 (Parameter Request List), while, according to RFC 2132, we MUST try to insert the requested options in the order requested by the client. The commit mentioned above made this visible because now every client is reported as sending a DHCPREQUEST as an old BOOTP client, based on the lack of option 53 (that is, zero length). Fixes: b439984641ed ("merd: ARP and DHCP handlers, connection tracking fixes") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	dhcpv6: Properly separate domain names in search list	Stefano Brivio	2023-09-27	1	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To prepare the DHCPv6 domain search list option, we go over the flattened list of domains, and replace both dots and zero bytes with a counter of bytes in the next label, implementing the encoding specified by section 3.1 of RFC 1035. If there are multiple domains in the list, however, zero bytes serve as markers for the end of a domain name, and we'll replace them with the length of the first label of the next domain, plus one. This is wrong. We should only convert the dots before the labels. To distinguish between label separators and domain names separators, for simplicity, introduce a dot before the first label of every domain we copy to form the list. All dots are then replaced by label lengths, and separators (zero bytes) remain as they are. As we do this, we need to make sure we don't replace the trailing dot, if present: that's already a separator. Skip copying it, and just add separators as needed. Now that we don't copy those, though, we might end up with zero-length domains: skip them, as they're meaningless anyway. And as we might skip domains, we can't use the index 'i' to check if we're at the beginning of the option -- use 'srch' instead. This is very similar to how we prepare the list for NDP option 31, except that we don't need padding (RFC 8106, 5.2) here, and we should refactor this into common functions, but it probably makes sense to rework the NDP responder (https://bugs.passt.top/show_bug.cgi?id=21) first. Reported-by: Sebastian Mitterle <smitterl@redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=75 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Fix licensing information display in --version2023_09_08.05627dc	Stefano Brivio	2023-09-08	1	-2/+2
\| \| \| \| \| \| \|	The regular expression I used when relicensing to GPLv2+ missed this. Fixes: ca2749e1bd52 ("passt: Relicense to GPL 2.0, or any later version") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Correct handling of FIN,ACK followed by SYN	David Gibson	2023-09-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the guest tries to establish a connection, it could give up on it by sending a FIN,ACK instead of a plain ACK to our SYN,ACK. It could then make a new attempt to establish a connection with the same addresses and ports with a new SYN. Although it's unlikely, it could send the 2nd SYN very shortly after the FIN,ACK resulting in both being received in the same batch of packets from the tap interface. Currently, we don't handle that correctly, when we receive a FIN,ACK on a not fully established connection we discard the remaining packets in the batch, and so will never process the 2nd SYN. Correct this by returning 1 from tcp_tap_handler() in this case, so we'll just consume the FIN,ACK and continue to process the rest of the batch. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Consolidate paths where we initiate reset on tap interface	David Gibson	2023-09-08	1	-22/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are a number of conditions where we will issue a TCP RST in response to something unexpected we received from the tap interface. These occur in both tcp_data_from_tap() and tcp_tap_handler(). In tcp_tap_handler() use a 'goto out of line' technique to consolidate all these paths into one place. For the tcp_data_from_tap() cases use a negative return code and direct that to the same path in tcp_tap_handler(), its caller. In this case we want to discard all remaining packets in the batch we have received: even if they're otherwise good, they'll be invalidated when the guest receives the RST we're sending. This is subtly different from the case where we receive an RST, where we could in theory get a new SYN immediately afterwards. Clarify that with a common on the now common reset path. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Correctly handle RST followed rapidly by SYN	David Gibson	2023-09-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although it's unlikely in practice, the guest could theoretically reset one TCP connection then immediately start a new one with the same addressses and ports, such that we get an RST then a SYN in the same batch of received packets in tcp_tap_handler(). We don't correctly handle that unlikely case, because when we receive the RST, we discard any remaining packets in the batch so we'd never see the SYN. This could happen in either tcp_tap_handler() or tcp_data_from_tap(). Correct that by returning 1, so that the caller will continue calling tcp_tap_handler() on subsequent packets allowing us to process any subsequent SYN. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Return consumed packet count from tcp_data_from_tap()	David Gibson	2023-09-08	1	-10/+15
\| \| \| \| \| \| \| \| \| \|	Currently tcp_data_from_tap() is assumed to consume all packets remaining in the packet pool it is given. However there are some edge cases where that's not correct. In preparation for fixing those, change it to return a count of packets consumed and use that in its caller. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Never hash match closed connections	David Gibson	2023-09-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	>From a practical point of view, when a TCP connection ends, whether by FIN or by RST, we set the CLOSED event, then some time later we remove the connection from the hash table and clean it up. However, from a protocol point of view, once it's closed, it's gone, and any new packets with matching addresses and ports are either forming a new connection, or are invalid packets to discard. Enforce these semantics in the TCP hash logic by never hash matching closed connections. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove some redundant packet_get() operations	David Gibson	2023-09-08	1	-10/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Both tcp_data_from_tap() and tcp_tap_handler() call packet_get() to get the entire L4 packet length, then immediately call it again to check that the packet is long enough to include a TCP header. The features of packet_get() let us easily combine these together, we just need to adjust the length slightly, because we want the value to include the TCP header length. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, tap: Correctly advance through packets in udp_tap_handler()	David Gibson	2023-09-08	3	-20/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In both tap4_handler() and tap6_handler(), once we've sorted incoming l3 packets into "sequences", we then step through all the packets in each DUP sequence calling udp_tap_handler(). Or so it appears. In fact, udp_tap_handler() doesn't take an index and always starts with packet 0 of the sequence, even if called repeatedly. It appears to be written with the idea that the struct pool is a queue, from which it consumes packets as it processes them, but that's not how the pool data structure works. Correct this by adding an index parameter to udp_tap_handler() and altering the loops in tap.c to step through the pool properly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tap: Correctly advance through packets in tcp_tap_handler()	David Gibson	2023-09-08	3	-22/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In both tap4_handler() and tap6_handler(), once we've sorted incoming l3 packets into "sequences", we then step through all the packets in each TCP sequence calling tcp_tap_handler(). Or so it appears. In fact, tcp_tap_handler() doesn't take an index and always looks at packet 0 of the sequence, except when it calls tcp_data_from_tap() to process data packets. It appears to be written with the idea that the struct pool is a queue, from which it consumes packets as it processes them, but that's not how the pool data structure works - they are more like an array of packets. We only get away with this, because setup packets for TCP tend to come in separate batches (because we need to reply in between) and so we only get a bunch of packets for the same connection together when they're data packets (tcp_data_from_tap() has its own loop through packets). Correct this by adding an index parameter to tcp_tap_handler() and altering the loops in tap.c to step through the pool properly. Link: https://bugs.passt.top/show_bug.cgi?id=68 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Add Podman system test with bats for pasta2023_09_07.ee58f37	Stefano Brivio	2023-09-07	3	-2/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ugly as hell, but we keep breaking things otherwise, and I keep forgetting to run this manually (as long as it's based on my local Podman setup, that's the only alternative). We need to clone the Podman repository as distribution packages don't contain test scripts, typically. While at it, build the latest version which is what really matters. As we're planning anyway to revamp the test framework, I'd be inclined to just add this without too many thoughts, and have it as a nice-to-have requirement reminder for the new framework. Link: https://github.com/containers/podman/pull/19699 Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	dhcp: support BOOTP clients	Stas Sergeev	2023-09-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	BOOTP clients do not use tagged messages for requests. As such, any message without the DHCP option 53, should be considered a BOOTP request. Link: https://bugs.passt.top/show_bug.cgi?id=72 Signed-off-by: Stas Sergeev <stsp2@yandex.ru> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: fix uses of l3_len in tap4_handler()	Stas Sergeev	2023-09-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	l3_len was calculated from the ethernet frame size, and it was assumed to be equal to the length stored in an IP packet. But if the ethernet frame is padded, then l3_len calculated that way can only be used as a bound check to validate the length stored in an IP header. It should not be used for calculating the l4_len. This patch makes sure the small padded ethernet frames are properly processed, by trusting the length stored in an IP header. Link: https://bugs.passt.top/show_bug.cgi?id=73 Signed-off-by: Stas Sergeev <stsp2@yandex.ru> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fedora: Replace pasta hard links by separate builds	Stefano Brivio	2023-09-07	1	-6/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The hard link trick didn't actually fix the issue with SELinux file contexts properly: as opposed to symbolic links, SELinux now correctly associates types to the labels that are set -- except that those labels are now shared, so we can end up (depending on how rpm(8) extracts the archives) with /usr/bin/passt having a pasta_exec_t context. This got rather confusing as running restorecon(8) seemed to fix up labels -- but that's simply toggling between passt_exec_t and pasta_exec_t for both links, because each invocation will just "fix" the file with the mismatching context. Replace the hard links with two separate builds of the binary, as suggested by David. The build is reproducible, so we pass "-pasta" in the VERSION for pasta's build. This is wasteful but better than the alternative. Just copying the binary over would otherwise cause issues with debuginfo packages due to duplicate Build-IDs -- and rpmbuild(8) also warns about them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Add pasta's own profile	Stefano Brivio	2023-09-07	3	-10/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If pasta and pasta.avx2 are hard links to passt and passt.avx2, AppArmor will attach their own profiles on execution, and we can restrict passt's profile to what it actually needs. Note that pasta needs to access all the resources that passt needs, so the pasta abstraction still includes passt's one. I plan to push the adaptation required for the Debian package in commit 5bb812e79143 ("debian/rules: Override pasta symbolic links with hard links"), on Salsa. If other distributions need to support AppArmor profiles they can follow a similar approach. The profile itself will be installed, there, via dh_apparmor, in a separate commit, b52557fedcb1 ("debian/rules: Install new pasta profile using dh_apparmor"). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>