passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings	David Gibson	14 days	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle more error conditions in udp_sock_errs()	David Gibson	14 days	1	-1/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Treat errors getting errors as unrecoverable	David Gibson	14 days	1	-10/+17
\| \| \| \| \| \| \| \| \| \|	We can get network errors, usually transient, reported via the socket error queue. However, at least theoretically, we could get errors trying to read the queue itself. Since we have no idea how to clear an error condition in that case, treat it as unrecoverable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split socket error handling out from udp_sock_recv()	David Gibson	14 days	1	-6/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently udp_sock_recv() both attempts to clear socket errors and read a batch of datagrams for forwarding. That made sense initially, since both listening and reply sockets need to do this. However, we have certain error cases which will add additional complexity to the error processing. Furthermore, if we ever wanted to more thoroughly handle errors received here - e.g. by synthesising ICMP messages on the tap device - it will likely require different handling for the listening and reply socket cases. So, split handling of error events into its own udp_sock_errs() function. While we're there, allow it to report "unrecoverable errors". We don't have any of these so far, but some cases we're working on might require it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use dual stack sockets for port forwarding when possible	David Gibson	2024-08-27	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as well as native IPv6 connections. By doing this we halve the number of listening sockets we need (assuming passt/pasta is listening on the same ports for IPv4 and IPv6). When forwarding many ports (e.g. -u all) this can significantly reduce the amount of kernel memory that passt consumes. We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual stack sockets for port forwarding when possible". Add similar support for UDP "listening" sockets. Since UDP sockets don't use as much kernel memory as TCP sockets this isn't as big a saving, but it's still significant. When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all), this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version 6.10.6 on Fedora 40, x86_64). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove unnnecessary local from udp_sock_init()	David Gibson	2024-08-27	1	-15/+15
\| \| \| \| \| \| \| \|	The 's' variable is always redundant with either 'r4' or 'r6', so remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Merge udp[46]_mh_recv arrays	David Gibson	2024-08-27	1	-37/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We've already gotten rid of most of the IPv4/IPv6 specific data structures in udp.c by merging them with each other. One significant one remains: udp[46]_mh_recv. This was a bit awkward to remove because of a subtle interaction. We initialise the msg_namelen fields to represent the total size we have for a socket address, but when we receive into the arrays those are modified to the actual length of the sockaddr we received. That meant that naively merging the arrays meant that if we received IPv4 datagrams, then IPv6 datagrams, the addresses for the latter would be truncated. In this patch address that by resetting the received msg_namelen as soon as we've found a flow for the datagram. Finding the flow is the only thing that might use the actual sockaddr length, although we in fact don't need it for the time being. This also removes the last use of the 'v6' field from udp_listen_epoll_ref, so remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf)	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	These system calls are needed after the conversion of time_t to 64-bit types on 32-bit architectures. Tested by running some transfer tests with passt and pasta on Debian Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l. Suggested-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Allow additional system calls for i386/i686	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I haven't tested i386 for a long time (after playing with some openSUSE i586 image a couple of years ago). It turns out that a number of system calls we actually need were denied by the seccomp filter, and not even basic functionality works. Add some system calls that glibc started using with the 64-bit time ("t64") transition, see also: https://wiki.debian.org/ReleaseGoals/64bit-time that is: clock_gettime64, timerfd_gettime64, fcntl64, and recvmmsg_time64. Add further system calls that are needed regardless of time_t width, that is, mmap2 (valgrind profile only), _llseek and sigreturn (common outside x86_64), and socketcall (same as s390x). I validated this against an almost full run of the test suite, with just a few selected tests skipped. Fixes needed to run most tests on i386/i686, and other assorted fixes for tests, are included in upcoming patches. Reported-by: Uroš Knupleš <uros@knuples.net> Analysed-by: Faidon Liambotis <paravoid@debian.org> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Use "our address" instead of "forwarding address"	David Gibson	2024-08-21	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \|	The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead. (While we're there correct an error in flow_initiate_af()s comments where we referred to parameters by the wrong name). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: move all udp_flow functions to udp_flow.c	Laurent Vivier	2024-08-05	1	-260/+0
\| \| \| \| \| \| \| \| \| \| \|	No code change. They need to be exported to be available by the vhost-user version of passt. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp_flow: Remove udp_meta_t from the parameters of udp_flow_from_sock()	Laurent Vivier	2024-08-05	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \|	To be used with the vhost-user version of udp.c, we need to export the udp_flow functions. To avoid to export udp_meta_t too that is specific to the socket version of udp.c, don't pass udp_meta_t to it, but the only needed field, s_in. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Rename UDP listening sockets	David Gibson	2024-07-19	1	-14/+11
\| \| \| \| \| \| \| \| \| \|	EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove rdelta port forwarding maps	David Gibson	2024-07-19	1	-36/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In addition to the struct fwd_ports used by both UDP and TCP to track port forwarding, UDP also included an 'rdelta' field, which contained the reverse mapping of the main port map. This was used so that we could properly direct reply packets to a forwarded packet where we change the destination port. This has now been taken over by the flow table: reply packets will match the flow of the originating packet, and that gives the correct ports on the originating side. So, eliminate the rdelta field, and with it struct udp_fwd_ports, which now has no additional information over struct fwd_ports. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove obsolete socket tracking	David Gibson	2024-07-19	1	-91/+1
\| \| \| \| \| \| \| \| \|	Now that UDP datagrams are all directed via the flow table, we no longer use the udp_tap_map[ or udp_act[] arrays. Remove them and connected code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Direct datagrams from host to guest via flow table	David Gibson	2024-07-19	1	-134/+51
\| \| \| \| \| \| \| \| \| \| \|	This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Find or create flows for datagrams from tap interface	David Gibson	2024-07-19	1	-114/+97
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons (perhaps increased stack usage?), this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter. Since we already allow rt_sigaction and others in the valgrind target, it seems reasonable to add rt_sigreturn as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove obsolete splice tracking	David Gibson	2024-07-19	1	-47/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle "spliced" datagrams with per-flow sockets	David Gibson	2024-07-19	1	-261/+175
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Create flows for datagrams from originating sockets	David Gibson	2024-07-19	1	-4/+165
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements the first steps of tracking UDP packets with the flow table rather than its own (buggy) set of port maps. Specifically we create flow table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE). When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle errors on UDP sockets	David Gibson	2024-07-17	1	-0/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we ignore all events other than EPOLLIN on UDP sockets. This means that if we ever receive an EPOLLERR event, we'll enter an infinite loop on epoll, because we'll never do anything to clear the error. Luckily that doesn't seem to have happened in practice, but it's certainly fragile. Furthermore changes in how we handle UDP sockets with the flow table mean we will start receiving error events. Add handling of EPOLLERR events. For now we just read the error from the error queue (thereby clearing the error state) and print a debug message. We can add more substantial handling of specific events in future if we want to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, tcp: Tweak handling of no_udp and no_tcp flags	David Gibson	2024-07-17	1	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We abort the UDP socket handler if the no_udp flag is set. But if UDP was disabled we should never have had a UDP socket to trigger the handler in the first place. If we somehow did, ignoring it here isn't really going to help because aborting without doing anything is likely to lead to an epoll loop. The same is the case for the TCP socket and timer handlers and the no_tcp flag. Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to several other entry points to the protocol specific code which should never be called if the protocol is disabled. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Make udp_sock_recv static	David Gibson	2024-07-17	1	-2/+2
\| \| \| \| \| \| \| \|	Through an oversight this was previously declared as a public function although it's only used in udp.c and there is no prototype in any header. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Consolidate datagram batching	David Gibson	2024-07-05	1	-90/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we receive datagrams on a socket, we need to split them into batches depending on how they need to be forwarded (either via a specific splice socket, or via tap). The logic to do this, is somewhat awkwardly split between udp_buf_sock_handler() itself, udp_splice_send() and udp_tap_send(). Move all the batching logic into udp_buf_sock_handler(), leaving udp_splice_send() to just send the prepared batch. udp_tap_send() reduces to just a call to tap_send_frames() so open-code that call in udp_buf_sock_handler(). This will allow separating the batching logic from the rest of the datagram forwarding logic, which we'll need for upcoming flow table support. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Move some more of sock_handler tasks into sub-functions	David Gibson	2024-07-05	1	-46/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four things between them: 1. Receive some datagrams from a socket 2. Split those datagrams into batches depending on how they need to be sent (via tap or via a specific splice socket) 3. Prepare buffers for each datagram to send it onwards 4. Actually send it onwards Split (1) and (3) into specific helper functions. This isn't immediately useful (udp_splice_prepare(), in particular, is trivial), but it will make further reworks clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't repeatedly initialise udp[46]_eth_hdr	David Gibson	2024-07-05	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Since we split our packet frame buffers into different pieces, we have a single buffer per IP version for the ethernet header, rather than one per frame. This makes sense since our ethernet header is alwaus the same. However we initialise those buffers udp[46]_eth_hdr inside a per frame loop. Pull that outside the loop so we just initialise them once. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify udp[46]_l2_iov	David Gibson	2024-07-05	1	-23/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The only differences between these arrays are that udp4_l2_iov is pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame header and udp6_l2_iov points to the IPv6 versions. We already have to set up a bunch of headers per-frame, including updating udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust the IOV entries to point at the correct headers for the frame than to have two complete sets of iovecs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify udp[46]_mh_splice	David Gibson	2024-07-05	1	-27/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where the only difference is that they point to different sockaddr buffers for the destination address. Unify these by having the common array point at a sockaddr_inany as the address. This does mean slightly more work when we're about to splice, because we need to write the whole socket address, rather than just the port. However it removes 32 mmsghdr structures and we're going to need more flexibility constructing that target address for the flow table. Because future changes might mean that the address isn't always loopback, change the name of the common address from *_localname to udp_splicename. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Rename IOV and mmsghdr arrays	David Gibson	2024-07-05	1	-34/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make the salient points about these various arrays clearer with renames: * udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do with L2. They are, however, specific to receiving not sending. Rename to udp_iov_recv and udp[46]_mh_recv. * udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa. Rename to udp[46]_l2_iov * udp[46]_localname are (for now) pre-populated with the local address but the more salient point is that these are the destination address for the splice arrays. Rename to udp[46]_splice_to Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Pass full epoll reference through more of sock handler path	David Gibson	2024-07-05	1	-30/+33
\| \| \| \| \| \| \| \| \| \| \| \|	udp_buf_sock_handler() takes the epoll reference from the receiving socket, and passes the UDP relevant part on to several other functions. Future changes are going to need several different epoll types for UDP, and to pass that information through to some of those functions. To avoid extra noise in the patches making the real changes, change those functions now to take the full epoll reference, rather than just the UDP part. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: sock_l4() determine protocol from epoll type rather than the reverse	David Gibson	2024-07-05	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Reduce scope of rport in udp_invert_portmap()2024_06_24.1ee2eca	Stefano Brivio	2024-06-24	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck 2.14 warns that the scope of the rport variable could be reduced: do that, as reverted commit c80fa6a6bb44 ("udp: Make rport calculation more local") did, but keep the temporary variable of in_port_t type, otherwise the sum gets promoted to int. While at it, add a comment explaining why we calculate rport like this instead of directly using the sum as array index. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Revert "udp: Make rport calculation more local"	Stefano Brivio	2024-06-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	This reverts commit c80fa6a6bb4415ad48f9e11424310875d0d99bc7, as it reintroduces the issue fixed by commit 1e6f92b995a9 ("udp: Fix 16-bit overflow in udp_invert_portmap()"). Reported-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=80 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Move management of udp[46]_localname into udp_splice_send()	David Gibson	2024-06-14	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Mostly, udp_sock_handler() is independent of how the datagrams it processes will be forwarded (tap or splice). However, it also updates the msg_name fields for spliced sends, which doesn't really make sense here. Move it into udp_splice_send() which is all about spliced sends. This does potentially mean we'll update the field to the same value several times, but we're going to need this in future anyway: with the extensions the flow table allows, it might not be the same value each time after all. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Rework how we divide queued datagrams between sending methods	David Gibson	2024-06-14	1	-61/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler() takes a number of datagrams from sockets that depending on their addresses could be forwarded either to the L2 interface ("tap") or to another socket ("spliced"). In the latter case we can also only send packets together if they have the same source port, and therefore are sent via the same socket. To reduce the total number of system calls we gather contiguous batches of datagrams with the same destination interface and socket where applicable. The determination of what the target is is made by udp_mmh_splice_port(). It returns the source port for splice packets and -1 for "tap" packets. We find batches by looking ahead in our queue until we find a datagram whose "splicefrom" port doesn't match the first in our current batch. udp_mmh_splice_port() is moderately expensive, and unfortunately we can call it twice on the same datagram: once as the (last + 1) entry in one batch (to check it's not in that batch), then again as the first entry in the next batch. Avoid this by keeping track of the "splice port" in the metadata structure, and filling it in one entry ahead of the one we're currently considering. This is a bit subtle, but not that hard. It will also generalise better when we have more complex possibilities based on the flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fold checking of splice flag into udp_mmh_splice_port()	David Gibson	2024-06-14	1	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_mmh_splice_port() is used to determine if a UDP datagram can be "spliced" (forwarded via a socket instead of tap). We only invoke it if the origin socket has the 'splice' flag set. Fold the checking of the flag into the helper itself, which makes the caller simpler. It does mean we have a loop looking for a batch of spliceable or non-spliceable packets even in the case where the flag is clear. This shouldn't be that expensive though, since each call to udp_mmh_splice_port() will return without accessing memory in that case. In any case we're going to need a similar loop in more cases with upcoming flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	vhost-user: compare mode MODE_PASTA and not MODE_PASST	Laurent Vivier	2024-06-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	As we are going to introduce the MODE_VU that will act like the mode MODE_PASST, compare to MODE_PASTA rather than to add a comparison to MODE_VU when we check for MODE_PASST. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: rename udp_sock_handler() to udp_buf_sock_handler()	Laurent Vivier	2024-06-13	1	-3/+3
\| \| \| \| \| \| \| \| \|	We are going to introduce a variant of the function to use vhost-user buffers rather than passt internal buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: refactor UDP header update functions	Laurent Vivier	2024-06-13	1	-26/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions to improve code portability by replacing the udp_meta_t parameter with more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr) and the source socket address (sockaddr_in/sockaddr_in6). It also moves the tap_hdr_update() function call inside the udp_tap_send() function not to have to pass the TAP header to udp_update_hdr4() and udp_update_hdr6() This refactor reduces complexity by making the functions more modular and ensuring that each function operates on more narrowly scoped data structures. This will facilitate future backend introduction like vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Make rport calculation more local	David Gibson	2024-06-07	1	-2/+1
\| \| \| \| \| \| \| \| \|	cppcheck 2.14.1 complains about the rport variable not being in as small as scope as it could be. It's also only used once, so we might as well just open code the calculation for it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Single buffer for IPv4, IPv6 headers and metadata	David Gibson	2024-05-02	1	-77/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have separate arrays for IPv4 and IPv6 which contain the headers for guest-bound packets, and also the originating socket address. We can combine these into a single array of "metadata" structures with space for both pre-cooked IPv4 and IPv6 headers, as well as shared space for the tap specific header and socket address (using sockaddr_inany). Because we're using IOVs to separately address the pieces of each frame, these structures don't need to be packed to keep the headers contiguous so we can more naturally arrange for the alignment we want. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use the same buffer for the L2 header for all frames	David Gibson	2024-05-02	1	-21/+15
\| \| \| \| \| \| \| \| \| \|	Currently each tap-bound frame buffer has room for its own ethernet header. However the ethernet header is always the same for such frames, so now that we're indirectly referencing the ethernet header via iov, we can use a single buffer for all of them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Share payload buffers between IPv4 and IPv6	David Gibson	2024-05-02	1	-59/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the IPv4 and IPv6 paths unnecessarily use different buffers for the UDP payload. Now that we're handling the various pieces of the UDP packets with an iov, we can split the payload part of the buffers off into its own array shared between IPv4 and IPv6. As well as saving a little memory, this allows the payload buffers to be neatly page aligned. With the buffers merged, udp[46]_l2_iov_sock contain exactly the same thing as each other and can also be merged. Likewise udp[46]_iov_splice can be merged together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Explicitly set checksum in guest-bound UDP headers	David Gibson	2024-05-02	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For IPv4, UDP checksums are optional and can just be set to 0. udp_update_hdr4() ignores the checksum field entirely. Since these are set to 0 during startup, this works as intended for now. However, we'd like to share payload and UDP header buffers betweem IPv4 and IPv6, which does calculate UDP checksums. Therefore, for robustness, we should explicitly set the checksum field to 0 for guest-bound UDP packets. In the tap_udp4_send() slow path, however, we do allow IPv4 UDP checksums to be calculated as a compile time option. For consistency, use the same thing in the udp_update_hdr4() path, which will typically initialize to 0, but calculate a real checksum if configured to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Combine initialisation of IPv4 and IPv6 iovs	David Gibson	2024-05-02	1	-61/+53
\| \| \| \| \| \| \| \| \| \|	We're going to introduce more sharing between the IPv4 and IPv6 buffer structures. Prepare for this by combinng the initialisation functions. While we're at it remove the misleading "sock" from the name since these initialise both tap side and sock side structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split tap-bound UDP packets into multiple buffers using io vector	David Gibson	2024-05-02	1	-25/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When sending to the tap device, currently we assemble the headers and payload into a single contiguous buffer. Those are described by a single struct iovec, then a batch of frames is sent to the device with tap_send_frames(). In order to better integrate the IPv4 and IPv6 paths, we want the IP header in a different buffer that might not be contiguous with the payload. To prepare for that, split the UDP packet into an iovec of buffers. We use the same split that Laurent recently introduced for TCP for convenience. This removes the last use of tap_hdr_len_(), tap_frame_base() and tap_frame_len(), so remove those too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: Helper macro to construct iovs covering existing variables or fields	David Gibson	2024-05-02	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \|	Laurent's recent changes mean we use IO vectors much more heavily in the TCP code. In many of those cases, and few others around the code base, individual iovs of these vectors are constructed to exactly cover existing variables or fields. We can make initializing such iovs shorter and clearer with a macro for the purpose. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise variable names for various packet lengths	David Gibson	2024-05-02	1	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At various points we need to track the lengths of a packet including or excluding various different sets of headers. We don't always use the same variable names for doing so. Worse in some places we use the same name for different things: e.g. tcp_fill_headers[46]() use ip_len for the length including the IP headers, but then tcp_send_flag() which calls it uses it to mean the IP payload length only. To improve clarity, standardise on these names: dlen: L4 protocol payload length ("data length") l4len: plen + length of L4 protocol header l3len: l4len + length of IPv4/IPv6 header l2len: l3len + length of L2 (ethernet) header Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Make csum_ip4_header() take a host endian length	David Gibson	2024-05-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	csum_ip4_header() takes the packet length as a network endian value. In general it's very error-prone to pass non-native-endian values as a raw integer. It's particularly bad here because this differs from other checksum functions (e.g. proto_ipv4_header_psum()) which take host native lengths. It turns out all the callers have easy access to the native endian value, so switch it to use host order like everything else. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split tap specific and L2 (ethernet) headers	David Gibson	2024-05-02	1	-9/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In some places (well, actually only UDP now) we use struct tap_hdr to represent both tap backend specific and L2 ethernet headers. Handling these together seemed like a good idea at the time, but Laurent's changes in the TCP code working towards vhost-user support suggest that treating them separately is more useful, more often. Alter struct tap_hdr to represent only the TAP backend specific headers. Updated related helpers and the UDP code to match. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>