passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: unify l2 TCPv4 and TCPv6 queues and structuresHEAD master	Jon Maloy	16 hours	1	-5/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following the preparations in the previous commit, we can now remove the payload and flag queues dedicated for TCPv6 and TCPv4 and move all traffic into common queues handling both protocol types. Apart from reducing code and memory footprint, this change reduces a potential risk for TCPv4 traffic starving out TCPv6 traffic. Since we always flush out the TCPv4 frame queue before the TCPv6 queue, the latter will never be handled if the former fails to send all its frames. Tests with iperf3 shows no measurable change in performance after this change. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use runtime tests for TCP_INFO fields	David Gibson	5 days	1	-26/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In order to use particular fields from the TCP_INFO getsockopt() we need them to be in structure returned by the runtime kernel. We attempt to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed in the Makefile. However, that's not correct, because the kernel headers we compile against may not be the same as the runtime kernel. We instead should check against the size of structure returned from the TCP_INFO getsockopt() as we already do for tcpi_snd_wnd. Switch from the compile time flags to a runtime test. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Generalise probing for tcpi_snd_wnd field	David Gibson	5 days	1	-15/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we need the field to be supported in the runtime kernel (snd_wnd_cap). In fact we should check that for for every tcp_info field we want to use, beyond the very old ones shared with BSD. Prepare to do that, by generalising the probing from setting a single bool to instead record the size of the returned TCP_INFO structure. We can then use that recorded value to check for the presence of any field we need. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove compile-time dependency on struct tcp_info version	David Gibson	5 days	1	-21/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the Makefile we probe to create several defines based on the presence of particular fields in struct tcp_info. These defines are used for two purposes, neither of which they accomplish well: 1) Determining if the tcp_info fields are available at runtime. For this purpose the defines are Just Plain Wrong, since the runtime kernel may not be the same as the compile time kernel. We corrected this for tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt 2) Allowing the source to compile against older kernel headers which don't have the fields in question. This works in theory, but it does mean we won't be able to use the fields, even if later run against a newer kernel. Furthermore, it's quite fragile: without much more thorough tests of builds in different environments that we're currently set up for, it's very easy to miss cases where we're accessing a field without protection from an #ifdef. For example we currently access tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd(). Improve this with a different approach, borrowed from qemu (which has many instances of similar problems). Don't compile against linux/tcp.h, using netinet/tcp.h instead. Then for when we need an extension field, define a struct tcp_info_linux, copied from the kernel, with all the fields we're interested in. That may need updating from future kernel versions, but only when we want to use a new extension, so it shouldn't be frequent. This allows us to remove the HAS_SND_WND define entirely. We keep HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1), we'll fix that in a later patch. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Trivial grammar fixes in comments] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use structures to construct initial TCP options	David Gibson	8 days	1	-14/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	As a rule, we prefer constructing packets with matching C structures, rather than building them byte by byte. However, one case we still build byte by byte is the TCP options we include in SYN packets (in fact the only time we generate TCP options on the tap interface). Rework this to use a structure and initialisers which make it a bit clearer what's going on. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
*	tcp: Send "empty" handshake ACK before first data segment	Stefano Brivio	2024-10-15	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting from commit 9178a9e3462d ("tcp: Always send an ACK segment once the handshake is completed"), we always send an ACK segment, without any payload, to complete the three-way handshake while establishing a connection started from a socket. We queue that segment after checking if we already have data to send to the tap, which means that its sequence number is higher than any segment with data we're sending in the same iteration, if any data is available on the socket. However, in tcp_defer_handler(), we first flush "flags" buffers, that is, we send out segments without any data first, and then segments with data, which means that our "empty" ACK is sent before the ACK segment with data (if any), which has a lower sequence number. This appears to be harmless as the guest or container will generally reorder segments, but it looks rather weird and we can't exclude it's actually causing problems. Queue the empty ACK first, so that it gets a lower sequence number, before checking for any data from the socket. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Update TCP checksum using an iovec array	Laurent Vivier	2024-10-04	1	-18/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP header and payload are supposed to be in the same buffer, and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute the checksum from the base address of the header using the length of the IP payload. In the future (for vhost-user) we need to dispatch the TCP header and the TCP payload through several buffers. To be able to manage that, we provide an iovec array that points to the data of the TCP frame. We provide also an offset to be able to provide an array that contains the TCP frame embedded in an lower level frame, and this offset points to the TCP header inside the iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use tcp_payload_t rather than tcphdr	Laurent Vivier	2024-10-04	1	-20/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the checksum using the TCP header and the TCP payload, it is clearer to use a pointer to tcp_payload_t that includes tcphdr and payload rather than a pointer to tcphdr (and guessing TCP header is followed by the payload). Move tcp_payload_t and tcp_flags_t to tcp_internal.h. (They will be used also by vhost-user). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Make {tcp,udp}_sock_init() take an inany address	David Gibson	2024-09-25	1	-29/+18
\| \| \| \| \| \| \| \| \|	tcp_sock_init() and udp_sock_init() take an address to bind to as an address family and void * pair. Use an inany instead. Formerly AF_UNSPEC was used to indicate that we want to listen on both 0.0.0.0 and ::, now use a NULL inany to indicate that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	util, pif: Replace sock_l4() with pif_sock_l4()	David Gibson	2024-09-25	1	-5/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The sock_l4() function is very convenient for creating sockets bound to a given address, but its interface has some problems. Most importantly, the address and port alone aren't enough in some cases. For link-local addresses (at least) we also need the pif in order to properly construct a socket adddress. This case doesn't yet arise, but it might cause us trouble in future. Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it should attempt to create a "dual stack" socket which will respond to both IPv4 and IPv6 traffic. This only makes sense if there is no specific address given. We verify this at runtime, but it would be nicer if we could enforce it structurally. For sockets associated specifically with a single flow we already replaced sock_l4() with flowside_sock_l4() which avoids those problems. Now, replace all the remaining users with a new pif_sock_l4() which also takes an explicit pif. The new function takes the address as an inany *, with NULL indicating the dual stack case. This does add some complexity in some of the callers, however future planned cleanups should make this go away again. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Allow checksum to be disabled	Laurent Vivier	2024-09-18	1	-21/+31
\| \| \| \| \| \| \| \| \|	We can need not to set TCP checksum. Add a parameter to tcp_fill_headers4() and tcp_fill_headers6() to disable it. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean	David Gibson	2024-09-18	1	-3/+3
\| \| \| \| \| \| \| \|	This parameter is already treated as a boolean internally. Make it a 'bool' type for clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify ifdef logic in tcp_update_seqack_wnd()	David Gibson	2024-09-18	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	This function has a block conditional on !snd_wnd_cap shortly before an snd_wnd_cap is statically false). Therefore, simplify this down to a single conditional with an else branch. While we're there, fix some improperly indented closing braces. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Clean up tcpi_snd_wnd probing	David Gibson	2024-09-18	1	-26/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When available, we want to retrieve our socket peer's advertised window and forward that to the guest. That information has been available from the kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035. Currently our probing for this is a bit odd. The HAS_SND_WND define determines if our headers include the tcp_snd_wnd field, but that doesn't necessarily mean the running kernel supports it. Currently we start by assuming it's _not_ available, but mark it as available if we ever see a non-zero value in the field. This is a bit hit and miss in two ways: * Zero is perfectly possible window the peer could report, so we can get false negatives * We're reading TCP_INFO into a local variable, which might not be zero initialised, so if the kernel _doesn't_ write it it could have non-zero garbage, giving us false positives. We can use a more direct way of probing for this: getsockopt() reports the length of the information retreived. So, check whether that's long enough to include the field. This lets us probe the availability of the field once and for all during initialisation. That in turn allows ctx to become a const pointer to tcp_prepare_flags() which cascades through many other functions. We also move the flag for the probe result from the ctx structure to a global, to match peek_offset_cap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make some extra functions private	David Gibson	2024-09-18	1	-2/+2
\| \| \| \| \| \| \| \|	tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c, and have no prototype in a header. Make them static. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use EPOLLET for any state of not established connections	Stefano Brivio	2024-09-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, for not established connections, we monitor sockets with edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state (outbound connection being established) but not in the TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK to the container/guest). While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted another possibility for a short EPOLLRDHUP storm (10 seconds), which doesn't seem to happen in actual use cases, but I could reproduce it: start a connection from a container, while dropping (using netfilter) ACK segments coming out of the container itself. On the server side, outside the container, accept the connection and shutdown the writing side of it immediately. At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't have any reasonable way to handle it other than waiting for the tap side to complete the three-way handshake. So we'll just keep getting this EPOLLRDHUP until the SYN_TIMEOUT kicks in. Always enable EPOLLET when EPOLLRDHUP is the only epoll event we subscribe to: in this case, getting multiple EPOLLRDHUP reports is totally useless. In the only remaining non-established state, SOCK_ACCEPTED, for inbound connections, we're anyway discarding EPOLLRDHUP events until we established the conection, because we don't know what to do with them until we get an answer from the tap side, so it's safe to enable EPOLLET also in that case. Link: https://bugs.passt.top/show_bug.cgi?id=94 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf)	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	These system calls are needed after the conversion of time_t to 64-bit types on 32-bit architectures. Tested by running some transfer tests with passt and pasta on Debian Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l. Suggested-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Allow additional system calls for i386/i686	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Use "our address" instead of "forwarding address"	David Gibson	2024-08-21	1	-16/+17
\| \| \| \| \| \| \| \| \| \| \| \|	The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead. (While we're there correct an error in flow_initiate_af()s comments where we referred to parameters by the wrong name). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Change SO_PEEK_OFF support message to debug()	Stefano Brivio	2024-07-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This: $ ./pasta SO_PEEK_OFF not supported # is a bit annoying, and might trick users who face other issues into thinking that SO_PEEK_OFF not being supported on a given kernel is an actual issue. Even if SO_PEEK_OFF is supported by the kernel, that would be the only message displayed there, with default options, which looks a bit out of context. Switch that to debug(): now that Podman users can pass --debug too, we can find out quickly if it's supported or not, if SO_PEEK_OFF usage is suspected of causing any issue. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: probe for SO_PEEK_OFF both in tcpv4 and tcp6	Jon Maloy	2024-07-23	1	-12/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on an original patch by Jon Maloy: -- The recently added socket option SO_PEEK_OFF is not supported for TCP/IPv6 sockets. Until we get that support into the kernel we need to test for support in both protocols to set the global 'peek_offset_cap´ to true. -- Compared to the original patch: - only check for SO_PEEK_OFF support for enabled IP versions - use sa_family_t instead of int to pass the address family around Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	flow, tcp: Flow based NAT and port forwarding for TCP	David Gibson	2024-07-19	1	-75/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Re-use flow hash for initial sequence number generation	David Gibson	2024-07-19	1	-22/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We generate TCP initial sequence numbers, when we need them, from a hash of the source and destination addresses and ports, plus a timestamp. Moments later, we generate another hash of the same information plus some more to insert the connection into the flow hash table. With some tweaks to the flow_hash_insert() interface and changing the order we can re-use that hash table hash for the initial sequence number, rather than calculating another one. It won't generate identical results, but that doesn't matter as long as the sequence numbers are well scattered. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Generalise TCP hash table to general flow hash table	David Gibson	2024-07-19	1	-133/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the data structures and helper functions for the TCP hash table to flow.c, making it a general hash table indexing sides of flows. This is largely code motion and straightforward renames. There are two semantic changes: * flow_lookup_af() now needs to verify that the entry has a matching protocol and interface as well as matching addresses and ports. * We double the size of the hash table, because it's now at least theoretically possible for both sides of each flow to be hashed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Replace TCP specific hash function with general flow hash	David Gibson	2024-07-19	1	-50/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP. Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the pif and the L4 protocol number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify endpoint validation using flowside information	David Gibson	2024-07-19	1	-54/+18
\| \| \| \| \| \| \| \|	Now that we store all our endpoints in the flowside structure, use some inany helpers to make validation of those endpoints simpler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Manage outbound address via flow table	David Gibson	2024-07-19	1	-43/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now when we forward a connection to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and port will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or port we use in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Obtain guest address from flowside	David Gibson	2024-07-19	1	-28/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will probably result in an RST from the guest. Worse, if the guest added a new address but also retains the old one, then we could break those old connections by redirecting them to the new address. Now that we maintain flowside information, we have a record of the correct guest side address and can just use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Remove redundant information, repack connection structures	David Gibson	2024-07-19	1	-25/+27
\| \| \| \| \| \| \| \| \| \| \|	Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for target side	David Gibson	2024-07-19	1	-38/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for initiating side	David Gibson	2024-07-19	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, icmp, tcp: Clean up helpers for getting flow from index	David Gibson	2024-07-17	1	-6/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP (both regular and spliced) and ICMP both have macros to retrieve the relevant protcol specific flow structure from a flow index. In most cases what we actually want is to get the specific flow from a sidx. Replace those simple macros with a more precise inline, which also asserts that the flow is of the type we expect. While we're they're also add a pif_at_sidx() helper to get the interface of a specific flow & side, which is useful in some places. Finally, fix some minor style issues in the comments on some of the existing sidx related helpers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, tcp: Tweak handling of no_udp and no_tcp flags	David Gibson	2024-07-17	1	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We abort the UDP socket handler if the no_udp flag is set. But if UDP was disabled we should never have had a UDP socket to trigger the handler in the first place. If we somehow did, ignoring it here isn't really going to help because aborting without doing anything is likely to lead to an epoll loop. The same is the case for the TCP socket and timer handlers and the no_tcp flag. Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to several other entry points to the protocol specific code which should never be called if the protocol is disabled. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: handle shrunk window advertisements from guest	Jon Maloy	2024-07-15	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A bug in kernel TCP may lead to a deadlock where a zero window is sent from the guest peer, while it is unable to send out window updates even after socket reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisements from the peer can only be triggered by data packets arriving from this side. However, currently such packets are never sent, because the zero-window condition prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered only after the peer has dropped one or more arriving packets because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that the implementation goes against the RFC-9293 recommendation that a previously advertised window never should shrink. RFC-9293 seems to permit that we can continue sending up to the right edge of the last advertised non-zero window in such situations, so that is what we do to resolve this situation. It turns out that this solution is extremely simple to implememt in the code: We just omit to save the advertised zero-window when we see that it has shrunk, i.e., if the acknowledged sequence number in the advertisement message is lower than that of the last data byte sent from our side. When that is the case, the following happens: - The 'retr' flag in tcp_data_from_tap() will be 'false', so no retransmission will occur at this occasion. - The data stream will soon reach the right edge of the previously advertised window. In fact, in all observed cases we have seen that it is already there when the zero-advertisement arrives. - At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set, unless they already have been, meaning that only the next timer expiration will open for data retransmission or transmission. - When that happens, the memory squeeze at the guest will normally have abated, and the data flow can resume. It should be noted that although this solves the problem we have at hand, it is a work-around, and not a genuine solution to the described kernel bug. Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor fix in commit title and commit reference in comment to workaround Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: leverage support of SO_PEEK_OFF socket option when available	Jon Maloy	2024-07-15	1	-3/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	>From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction. In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not. Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Add flow_sidx_valid() helper	David Gibson	2024-07-05	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To implement the TCP hash table, we need an invalid (NULL-like) value for flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we treat (usually) anything with an out of bounds flow index the same way. That's not always done consistently though. In flow_at_sidx() we open code a check on the flow index. In tcp_hash_probe() we instead compare against FLOW_SIDX_NONE, and in some other places we use the fact that flow_at_sidx() will return NULL in this case, even if we don't otherwise need the flow it returns. Clean this up a bit, by adding an explicit flow_sidx_valid() test function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: sock_l4() determine protocol from epoll type rather than the reverse	David Gibson	2024-07-05	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Replace strerror() calls	Stefano Brivio	2024-06-21	1	-16/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Don't rely on bind() to fail to decide that connection target is valid	Stefano Brivio	2024-06-19	1	-17/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit e1a2e2780c91 ("tcp: Check if connection is local or low RTT was seen before using large MSS") added a call to bind() before we issue a connect() to the target for an outbound connection. If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we can conclude that the target address is a local (host) address, and we can use an unlimited MSS. While at it, according to the reasoning of that commit, if bind() succeeds, we would know right away that nobody is listening at that (local) address and port, and we don't even need to call connect(): we can just fail early and reset the connection attempt. But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will actually succeed, so we can't rely on it to fail in general. The visible issue with the existing behaviour is that we would reset any outbound connection to non-local addresses, if non-local binds are enabled. Keep the significant optimisation for local addresses along with the bind() call, but if it succeeds, don't draw any conclusion: close the socket, grab another one, and proceed normally. This will incur a small latency penalty if non-local binds are enabled (we'll likely fetch an existing socket from the pool but additionally call close()), or if the target is local but not bound: we'll need to call connect() and get a failure before relaying that failure back. Link: https://github.com/containers/podman/issues/23003 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: move buffers management functions to their own file	Laurent Vivier	2024-06-13	1	-548/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move all the TCP parts using internal buffers to tcp_buf.c and keep generic TCP management functions in tcp.c. Add tcp_internal.h to export needed functions from tcp.c and tcp_buf.h from tcp_buf.c With this change we can use existing TCP functions with a different kind of memory storage as for instance the shared memory provided by the guest via vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: extract buffer management from tcp_send_flag()	Laurent Vivier	2024-06-13	1	-24/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit isolates the internal data structure management used for storing data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[], tcp4_flags[], ...) from the tcp_send_flag() function. The extracted functionality is relocated to a new function named tcp_fill_flag_header(). tcp_fill_flag_header() is now a generic function that accepts parameters such as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[]. This separation sets the stage for utilizing tcp_prepare_flags() to set the memory provided by the guest via vhost-user in future developments. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Suppress constParameterCallback errors	David Gibson	2024-06-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several functions which are used as callbacks for NS_CALL() which only read their void * parameter, they don't write it. The constParameterCallback warning in cppcheck 2.14.1 complains that this parameter could be const void , also pointing out that that would require casting the function pointer when used as a callback. Casting the function pointers seems substantially uglier than using a non-const void as the parameter, especially since in each case we cast the void * to a const pointer of specific type immediately. So, suppress these errors. I think it would make logical sense to suppress this globally, but that would cause unmatchedSuppression errors on earlier cppcheck versions. So, instead individually suppress it, along with unmatchedSuppression in the relevant places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Fix some error paths which didn't clean up flows properly	David Gibson	2024-06-07	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flow table entries need to be fully initialised before returning to the main epoll loop. Commit 0060acd1 ("flow: Clarify and enforce flow state transitions") now enforces that: once a flow is allocated we must either cancel it, or activate it before returning to the main loop, or we will hit an ASSERT(). Some error paths in tcp_conn_from_tap() weren't correctly updated for this requirement - we can exit with a flow entry incompletely initialised. Correct that by cancelling the flows in those situations. I don't have enough information to be certain if this is the cause for podman bug 22925, but it plausibly could be. Fixes: 0060acd11b19 ("flow: Clarify and enforce flow state transitions") Link: https://github.com/containers/podman/issues/22925 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	clang-tidy: Enable the bugprone-macro-parentheses check	David Gibson	2024-06-07	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We globally disabled this, with a justification lumped together with several checks about braces. They don't really go together, the others are essentially a stylistic choice which doesn't match our style. Omitting brackets on macro parameters can lead to real and hard to track down bugs if an expression is ever passed to the macro instead of a plain identifier. We've only gotten away with the macros which trigger the warning, because of other conventions its been unlikely to invoke them with anything other than a simple identifier. Fix the macros, and enable the warning for the future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make pointer const in tcp_revert_seq	David Gibson	2024-06-07	1	-1/+1
\| \| \| \| \| \| \| \| \|	The th pointer could be const, which causes a cppcheck warning on at least some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40). Fixes: e84a01e94c9f ("tcp: move seq_to_tap update to when frame is queued") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: move seq_to_tap update to when frame is queued	Jon Maloy	2024-06-05	1	-22/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem. This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue. We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove interim 'tapside' field from connection	David Gibson	2024-05-22	1	-6/+6
\| \| \| \| \| \| \| \| \| \|	We recently introduced this field to keep track of which side of a TCP flow is the guest/tap facing one. Now that we generically record which pif each side of each flow is connected to, we can easily derive that, and no longer need to keep track of it explicitly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Record the pifs for each side of each flow	David Gibson	2024-05-22	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have no generic information flows apart from the type and state, everything else is specific to the flow type. Start introducing generic flow information by recording the pifs which the flow connects. To keep track of what information is valid, introduce new flow states: INI for when the initiating side information is complete, and TGT for when both sides information is complete, but we haven't chosen the flow type yet. For now, these states don't do an awful lot, but they'll become more important as we add more generic information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make side 0 always be the initiating side	David Gibson	2024-05-22	1	-11/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each flow in the flow table has two sides, 0 and 1, representing the two interfaces between which passt/pasta will forward data for that flow. Which side is which is currently up to the protocol specific code: TCP uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except for spliced connections where it uses 0 for the initiating side and 1 for the target side. ICMP also uses 0 for the host/"sock" side and 1 for the guest/"tap" side, but in its case the latter is always also the initiating side. Make this generically consistent by always using side 0 for the initiating side and 1 for the target side. This doesn't simplify a lot for now, and arguably makes TCP slightly more complex, since we add an extra field to the connection structure to record which is the guest facing side. This is an interim change, which we'll be able to remove later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify and enforce flow state transitions	David Gibson	2024-05-22	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flows move over several different states in their lifetime. The rules for these are documented in comments, but they're pretty complex and a number of the transitions are implicit, which makes this pretty fragile and error prone. Change the code to explicitly track the states in a field. Make all transitions explicit and logged. To the extent that it's practical in C, enforce what can and can't be done in various states with ASSERT()s. While we're at it, tweak the docs to clarify the restrictions on each state a bit. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>