passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	tcp: Don't compute total bytes in a message until we need it	David Gibson	2023-01-23	1	-35/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp[46]_l2_buf_bytes keep track of the total number of bytes we have queued to send to the tap interface. tcp_l2_buf_flush_passt() uses this to determine if sendmsg() has sent all the data we requested, or whether we need to resend a trailing portion. However, the logic for finding where we're up to in the case of a short sendmsg() can equally well tell whether we've had one at all, without knowing the total number in advance. This does require an extra loop after each sendmsg(), but it's doing simple arithmetic on values we've already been accessing, and it leads to overall simpler code. tcp[46]_l2_flags_buf_bytes were being calculated, but never used for anything, so simply remove them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Combine two parts of passt tap send path together	David Gibson	2023-01-23	1	-8/+12
\| \| \| \| \| \| \| \| \| \| \|	tcp_l2_buf_flush() open codes the "primary" send of message to the passt tap interface, but calls tcp_l2_buf_flush_part() to handle the case of a short send. Combine these two passt-specific operations into tcp_l2_buf_flush_passt() which is a little cleaner and will enable furrther cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pcap: Replace pcapm() with pcap_multiple()	David Gibson	2023-01-23	3	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pcapm() captures multiple frames from a msghdr, however the only thing it cares about in the msghdr is the list of buffers, where it assumes there is one frame to capture per buffer. That's what we want for its single caller but it's not the only obvious choice here (one frame per msghdr would arguably make more sense in isolation). In addition pcapm() has logic that only makes sense in the context of the passt specific path its called from: it skips the first 4 bytes of each buffer, because those have the qemu vnet_len rather than the frame proper. Make this clearer by replacing pcapm() with pcap_multiple() which more explicitly takes one struct iovec per frame, and parameterizes how much of each buffer to skip (i.e. the offset of the frame within the buffer). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pcap: Introduce pcap_frame() helper	David Gibson	2023-01-23	1	-38/+38
\| \| \| \| \| \| \| \| \| \| \| \|	pcap(), pcapm() and pcapmm() duplicate some code, for the actual writing to the capture file. The purpose of pcapm() and pcapmm() not calling pcap() seems to be to avoid repeatedly calling gettimeofday() and to avoid printing errors for every packet in a batch if there's a problem. We can accomplish that while still sharing code by adding a new helper which takes the packet timestamp as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't use separate sockets to listen for spliced packets	David Gibson	2023-01-13	1	-40/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, when ports are forwarded inbound in pasta mode, we open two sockets for incoming traffic: one listens on the public IP address and will forward packets to the tuntap interface. The other listens on localhost and forwards via "splicing" (resending directly via sockets in the ns). Now that we've improved the logic about whether we "splice" any individual packet, we don't need this. Instead we can have a single socket bound to 0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal with each packet as appropriate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Decide whether to "splice" per datagram rather than per socket	David Gibson	2023-01-13	2	-20/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have special sockets for receiving datagrams from locahost which can use the optimized "splice" path rather than going across the tap interface. We want to loosen this so that sockets can receive sockets that will be forwarded by both the spliced and non-spliced paths. To do this, we alter the meaning of the @splice bit in the reference to mean that packets receieved on this socket can be spliced, not that they will be spliced. They'll only actually be spliced if they come from 127.0.0.1 or ::1. We can't (for now) remove the splice bit entirely, unlike with TCP. Our gateway mapping means that if the ns initiates communication to the gw address, we'll translate that to target 127.0.0.1 on the host side. Reply packets will therefore have source address 127.0.0.1 when received on the host, but these need to go via the tap path where that will be translated back to the gateway address. We need the @splice bit to distinguish that case from packets going from localhost to a port mapped explicitly with -u which should be spliced. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify udp_sock_handler_splice() with udp_sock_handler()	David Gibson	2023-01-13	1	-60/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	These two functions now have a very similar structure, and their first part (calling recvmmsg()) is functionally identical. So, merge the two functions into one. This does have the side effect of meaning we no longer receive multiple packets at once for splice (we already didn't for tap). This does hurt throughput for small spliced packets, but improves it for large spliced packets and tap packets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Pre-populate msg_names with local address	David Gibson	2023-01-13	2	-22/+25
\| \| \| \| \| \| \| \| \| \| \|	udp_splice_namebuf is now used only for spliced sending, and so it is only ever populated with the localhost address, either IPv4 or IPv6. So, replace the awkward initialization in udp_sock_handler_splice() with statically initialized versions for IPv4 and IPv6. We then just need to update the port number in udp_sock_handler_splice(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't handle tap receive batch size calculation within a #define	David Gibson	2023-01-13	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a batch for sizing our buffers and control structures. The subtly different UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to receive at once for tap packets in the current configuration. This depends on the mode, meaning that the macro has a non-obvious dependency on the usual 'c' context variable being available. We only use it in one place, so it makes more sense to open code this. Add an explanatory comment while we're there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split receive from preparation and send in udp_sock_handler()	David Gibson	2023-01-13	1	-27/+52
\| \| \| \| \| \| \| \| \| \| \|	The receive part of udp_sock_handler() and udp_sock_handler_splice() is now almost identical. In preparation for merging that, split the receive part of udp_sock_handler() from the part preparing and sending the frames for sending on the tap interface. The latter goes into a new udp_tap_send() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split sending to passt tap interface into separate function	David Gibson	2023-01-13	1	-58/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The last part of udp_sock_handler() does the actual sending of frames to the tap interface. For pasta that's just a call to udp_tap_send_pasta() but for passt, it's moderately complex and open coded. For symmetry, move the passt send path into its own function, udp_tap_send_passt(). This will make it easier to abstract the tap interface in future (e.g. when we want to add vhost-user). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Move sending pasta tap frames to the end of udp_sock_handler()	David Gibson	2023-01-13	1	-19/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler() has a surprising difference in flow between pasta and passt mode: For pasta we send each frame to the tap interface as we prepare it. For passt, though, we prepare all the frames, then send them with a single sendmmsg(). Alter the pasta path to also prepare all the frames, then send them at the end. We already have a suitable data structure for the passt case. This will make it easier to abstract out the tap backend difference in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf/pasta_tcp: Add host to namespace cases for traffic via tap	Stefano Brivio	2023-01-05	1	-0/+57
\| \| \| \| \| \| \| \| \| \|	Similarly to UDP cases, these were missing as it wasn't clear, when the other tests were introduced, if using the global address of a namespace, from the host, should have resulted in connections being routed via the tap interface. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Explicitly check option length field values in tcp_opt_get()	Stefano Brivio	2023-01-05	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Reported by Coverity (CWE-606, Untrusted loop bound), and actually harmless because we'll exit the option-scanning loop if the remaining length is not enough for a new option, instead of reading past the header. In any case, it looks like a good idea to explicitly check for reasonable values of option lengths. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/perf/pasta_udp: Add host to namespace cases for traffic via tap	Stefano Brivio	2023-01-05	1	-0/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	These were missing as it wasn't clear, when the other tests were introduced, if using the global address of a namespace, from the host, should have resulted in traffic being routed via the tap interface (as opposed to the loopback interface). We now clarified that's actually the case. Use same values and thresholds as the tests for loopback traffic, as throughput figures currently indicate there isn't much difference. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Factor out control structure management from udp_sock_fill_data_v[46]	David Gibson	2022-12-06	1	-68/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The main purpose of udp_sock_fill_data_v[46]() is to construct the IP, UDP and other headers we'll need to forward data onto the tap interface. In addition they update the control structures (iovec and mmsghdr) we'll need to send the messages, and in the case of pasta actually sends it. This leads the control structure management and the send itself awkwardly split between udp_sock_fill_data_v[46]() and their caller udp_sock_handler(). In addition, this tail part of udp_sock_fill_datav[46] is essentially common between the IPv4 and IPv6 versions, apart from which control array we're working on. Clean this up by reducing these functions to just construct the headers and renaming them to udp_update_hdr[46]() accordingly. The control structure updates are now all in the caller, and common for IPv4 and IPv6. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Preadjust udp[46]_l2_iov_tap[].iov_base for pasta mode	David Gibson	2022-12-06	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we always populate udp[46]_l2_iov_tap[].iov_base with the very start of the header buffers, including space for the qemu vnet_len tag suitable for passt mode. That's ok because we don't actually use these iovecs for pasta mode. However, we do know the mode in udp_sock[46]_iov_init() so adjust these to the beginning of the headers we'll actually need for the mode: including the vnet_len tag for passt, but excluding it for pasta. This allows a slightly nicer way to locate the right buffer to send in the pasta case, and will allow some additional cleanups later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Better factor IPv4 and IPv6 paths in udp_sock_handler()	David Gibson	2022-12-06	1	-22/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Apart from which mh array they're operating on the recvmmsg() calls in udp_sock_handler() are identical between the IPv4 and IPv6 paths, as are some of the control structure updates. By using some local variables to refer to the IP version specific control arrays, make some more logic common between the IPv4 and IPv6 paths. As well as slightly reducing the code size, this makes it less likely that we'll accidentally use the IPv4 arrays in the IPv6 path or vice versa as we did in a recently fixed bug. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix incorrect use of IPv6 mh buffers in IPv4 path	David Gibson	2022-12-06	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler() incorrectly uses udp6_l2_mh_tap[] on the IPv4 path. In fact this is harmless because this assignment is redundant (the 0th entry msg_hdr will always point to the 0th iov entry for both IPv4 and IPv6 and won't change). There is also an incorrect usage of udp6_l2_mh_tap[] in udp_sock_fill_data_v4. This one can cause real problems, because we'll use stale iov_len values if we send multiple messages to the qemu socket. Most of the time that will be relatively harmless - we're likely to either drop UDP packets, or send duplicates. However, if the stale iov_len we use ends up referencing an uninitialized buffer we could desynchronize the qemu stream socket. Correct both these bugs. The UDP6 path appears to be correct, but it does have some comments that incorrectly reference the IPv4 versions, so fix those as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Correct splice forwarding when receiving from multiple sources	David Gibson	2022-12-06	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler_splice() reads a whole batch of datagrams at once with recvmmsg(). It then forwards them all via a single socket on the other side, based on the source port. However, it's entirely possible that the datagrams in the set have different source ports, and thus ought to be forwarded via different sockets on the destination side. In fact this situation arises with the iperf -P4 throughput tests in our own test suite. AFAICT we only get away with this because iperf3 is strictly one way and doesn't send reply packets which would be misdirected because of the incorrect source ports. Alter udp_sock_handler_splice() to split the packets it receives into batches with the same source address and send each batch with a separate sendmmsg(). For now we only look for already contiguous batches, which means that if there are multiple active flows interleaved this is likely to degenerate to batches of size 1. For now this is the simplest way to correct the behaviour and we can try to optimize later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split send half of udp_sock_handler_splice() from the receive half	David Gibson	2022-12-06	1	-23/+53
\| \| \| \| \| \| \| \| \|	Move the part of udp_sock_handler_splice() concerned with sending out the datagrams into a new udp_splice_sendfrom() helper. This will make later cleanups easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify buffers for tap and splice paths	David Gibson	2022-12-06	1	-40/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We maintain a set of buffers for UDP packets to be forwarded via the tap interface in udp[46]_l2_buf. We then have a separate set of buffers for packets to be "spliced" in udp_splice_buf[]. However, we only use one of these at a time, so we can share the buffer space. For the receiving splice packets we can not only re-use the data buffers but also the udp[46]_l2_iov_sock and udp[46]_l2_mh_sock control structures. For sending the splice packets we keep the same data buffers, but we need specific control structures. We create udp[46]_iov_splice - we can't reuse udp_l2_iov_sock[] because we need to write iov_len as we're writing spliced packets, but the tap path expects iov_len to remain the same (it only uses it for receive). Likewise we create udp[46]_mh_splice with the mmsghdr structures for sending spliced packets. As well as needing to reference different iovs, these need to all reference udp_splice_namebuf instead of individual msg_name fields for each slot. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Add helper to extract port from a sockaddr_in or sockaddr_in6	David Gibson	2022-12-06	1	-12/+14
\| \| \| \| \| \| \| \| \| \|	udp_sock_handler_splice() has a somewhat clunky if to extract the port from a socket address which could be either IPv4 or IPv6. Future changes are going to make this even more clunky, so introduce a helper function to do this extraction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Make UDP_SPLICE_FRAMES and UDP_TAP_FRAMES_MEM the same thing	David Gibson	2022-12-06	1	-28/+27
\| \| \| \| \| \| \| \| \| \|	These two constants have the same value, and there's not a lot of reason they'd ever need to be different. Future changes will further integrate the spliced and "tap" paths so that these need to be the same. So, merge them into UDP_MAX_FRAMES. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Simplify udp_sock_handler_splice	David Gibson	2022-12-06	1	-32/+15
\| \| \| \| \| \| \| \|	Previous cleanups mean that we can now rework some complex ifs in udp_sock_handler_splice() into a simpler set. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Update UDP "connection" timestamps in both directions	David Gibson	2022-12-06	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A UDP pseudo-connection between port A in the init namespace and port B in the pasta guest namespace involves two sockets: udp_splice_init[v6][B] and udp_splice_ns[v6][A]. The socket which originated this "connection" will be permanent but the other one will be closed on a timeout. When we get a packet from the originating socket, we update the timeout on the other socket, but we don't do the same when we get a reply packet from the other socket. However any activity on the "connection" probably indicates that it's still in use. Without this we could incorrectly time out a "connection" if it's using a protocol which involves a single initiating packet, but which then gets continuing replies from the target. Correct this by updating the timeout on both sockets for a packet in either direction. This also updates the timestamps for the permanent originating sockets which is unnecessary, but harmless. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't explicitly track originating socket for spliced "connections"	David Gibson	2022-12-06	1	-61/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we look up udp_splice_to_ns[][].orig_sock in udp_sock_handler_splice() we're finding the socket on which the originating packet for the "connection" was received on. However, we don't specifically need this socket to be the originating one - we just need one that's bound to the the source port of this reply packet in the init namespace. We can look this up in udp_splice_to_init[v6][src].target_sock, whose defining characteristic is exactly that. The same applies with init and ns swapped. In practice, of course, the port we locate this way will always be the originating port, since we couldn't have started this "connection" if it wasn't. Change this, and we no longer need the @orig_sock field at all. That leaves just @target_sock which we rename to simply @sock. The whole udp_splice_flow structure now more represents a single bound port than a "flow" per se, so rename and recomment it accordingly. Likewise the udp_splice_to_{ns,init} names are now misleading, since the ports in those maps are used in both directions. Rename them to udp_splice_{ns,init} indicating the location where the described socket is bound. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Re-use fixed bound sockets for packet forwarding when possible	David Gibson	2022-12-06	1	-9/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we look up udp_splice_to_ns[v6][src].target_sock in udp_sock_handler_splice, all we really require of the socket is that it be bound to port src in the pasta guest namespace. Similarly for udp_splice_to_init but bound in the init namespace. Usually these sockets are created temporarily by udp_splice_connect() and cleaned up by udp_timer(). However, depending on the -u and -U options its possible we have a permanent socket bound to the relevant port created by udp_sock_init(). If such a socket exists, we could use it instead of creating a temporary one. In fact we must use it, because we'll fail trying to bind() a temporary one to the same port. So allow this, store permanently bound sockets into udp_splice_to_{ns,init} in udp_sock_init(). These won't get incorrectly removed by the timer because we don't put a corresponding entry in the udp_act[] structure which directs the timer what to clean up. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't create double sockets for -U port	David Gibson	2022-12-06	1	-18/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For each IP version udp_socket() has 3 possible calls to sock_l4(). One is for the "non-spliced" bound socket in the init namespace, one for the "spliced" bound socket in the init namespace and one for the "spliced" bound socket in the pasta namespace. However when this is called to create a socket in the pasta namspeace there is a logic error which causes it to take the path for the init side spliced socket as well as the ns socket. This essentially tries to create two identical sockets on the ns side. Unsurprisingly the second bind() call fails according to strace. Correct this to only attempt to open one socket within the ns. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split splice field in udp_epoll_ref into (mostly) independent bits	David Gibson	2022-12-06	3	-35/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The @splice field in union udp_epoll_ref can have a number of values for different types of "spliced" packet flows. Split it into several single bit fields with more or less independent meanings. The new @splice field is just a boolean indicating whether the socket is associated with a spliced flow, making it identical to the @splice fiend in tcp_epoll_ref. The new bit @orig, indicates whether this is a socket which can originate new udp packet flows (created with -u or -U) or a socket created on the fly to handle reply socket. @ns indicates whether the socket lives in the init namespace or the pasta namespace. Making these bits more orthogonal to each other will simplify some future cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove the @bound field from union udp_epoll_ref	David Gibson	2022-12-06	2	-7/+4
\| \| \| \| \| \| \|	We set this field, but nothing ever checked it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't connect "forward" sockets for spliced flows	David Gibson	2022-12-06	1	-50/+35
\| \| \| \| \| \| \| \| \| \| \| \|	Currently we connect() the socket we use to forward spliced UDP flows. However, we now only ever use sendto() rather than send() on this socket so there's not actually any need to connect it. Don't do so. Rename a number of things that referred to "connect" or "conn" since that would now be misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Always use sendto() rather than send() for forwarding spliced packets	David Gibson	2022-12-06	1	-33/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler_splice() has two different ways of sending out packets once it has determined the correct destination socket. For the originating sockets (which are not connected) it uses sendto() to specify a specific address. For the forward socket (which is connected) we use send(). However we know the correct destination address even for the forward socket we do also know the correct destination address. We can use this to use sendto() instead of send(), removing the need for two different paths and some staging data structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Separate tracking of inbound and outbound packet flows	David Gibson	2022-12-06	1	-57/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each entry udp_splice_map[v6][N] keeps information about two essentially unrelated packet flows. @ns_conn_sock, @ns_conn_ts and @init_bound_sock track a packet flow from port N in the host init namespace to some other port in the pasta namespace (the one @ns_conn_sock is connected to). @init_conn_sock, @init_conn_ts and @ns_bound_sock track packet flow from port N in the pasta namespace to some other port in the host init namespace (the one @init_conn_sock is connected to). Split udp_splice_map[][] into two separate tables for the two directions. Each entry in each table is a 'struct udp_splice_flow' with @orig_sock (previously the bound socket), @target_sock (previously the connected socket) and @ts (the timeout for the target socket). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Also bind() connected ports for "splice" forwarding	David Gibson	2022-12-06	1	-52/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pasta handles "spliced" port forwarding by resending datagrams received on a bound socket in the init namespace to a connected socket in the guest namespace. This means there are actually three ports associated with each "connection". First there's the source and destination ports of the originating datagram. That's also the destination port of the forwarded datagram, but the source port of the forwarded datagram is the kernel allocated bound address of the connected socket. However, by bind()ing as well as connect()ing the forwarding socket we can choose the source port of the forwarded datagrams. By choosing it to match the original source port we remove that surprising third port number and no longer need to store port numbers in struct udp_splice_port. As a bonus this means that the recipient of the packets will see the original source port if they call getpeername(). This rarely matters, but it can't hurt. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, tap: Process data on the socket before HUP/ERR events	Richard W.M. Jones	2022-11-25	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the case where the client writes a packet and then closes the socket, because we receive EPOLLIN\|EPOLLRDHUP together we have a choice of whether to close the socket immediately, or read the packet and then close the socket. Choose the latter. This should improve fuzzing coverage and arguably is a better choice even for regular use since dropping packets on close is bad. See-also: https://archives.passt.top/passt-dev/20221117171805.3746f53a@elisabeth/ Signed-off-by: Richard W.M. Jones <rjones@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, tap: Add --fd option	Richard W.M. Jones	2022-11-25	5	-4/+46
\| \| \| \| \| \| \| \| \|	This passes a fully connected stream socket to passt. Signed-off-by: Richard W.M. Jones <rjones@redhat.com> [sbrivio: reuse fd_tap instead of adding a new descriptor, imply --one-off on --fd, add to optstring and usage()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	build: Remove *~ files with make clean	Richard W.M. Jones	2022-11-25	1	-1/+1
\| \| \| \| \| \| \| \|	These files are left around by emacs amongst other editors. Signed-off-by: Richard W.M. Jones <rjones@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	build: Force-create pasta symlink	Richard W.M. Jones	2022-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	If you run the build several times it will fail unnecessarily with: ln -s passt pasta ln: failed to create symbolic link 'pasta': File exists make: *** [Makefile:134: pasta] Error 1 Signed-off-by: Richard W.M. Jones <rjones@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Pass union tcp_conn pointer to destroy and splice timer functions	Stefano Brivio	2022-11-25	3	-16/+21
\| \| \| \| \| \| \| \| \| \| \| \|	The pointers are actually the same, but we later pass the container union to tcp_table_compact(), which might zero the size of the whole union, and this confuses Coverity Scan. Given that we have pointers to the container union to start with, just pass those instead, all the way down to tcp_table_compact(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Use dual stack sockets for port forwarding when possible	David Gibson	2022-11-25	1	-2/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as well as native IPv6 connections. By doing this we halve the number of listening sockets we need for TCP (assuming passt/pasta is listening on the same ports for IPv4 and IPv6). When forwarding many ports (e.g. -t all) this can significantly reduce the amount of kernel memory that passt consumes. When forwarding all TCP and UDP ports for both IPv4 and IPv6 (-t all -u all), this reduces kernel memory usage from ~677MiB to ~487MiB (kernel version 6.0.8 on Fedora 37, x86_64). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Always return -1 on error in sock_l4()	David Gibson	2022-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \|	According to its doc comments, sock_l4() returns -1 on error. It does, except in one case where it returns -EIO. Fix this inconsistency to match the docs and always return -1. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Allow sock_l4() to open dual stack sockets	David Gibson	2022-11-25	2	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, when instructed to open an IPv6 socket, sock_l4() explicitly sets the IPV6_V6ONLY socket option so that the socket will only respond to IPv6 connections. Linux (and probably other platforms) allow "dual stack" sockets: IPv6 sockets which can also accept IPv4 connections. Extend sock_l4() to be able to make such sockets, by passing AF_UNSPEC as the address family and no bind address (binding to a specific address would defeat the purpose). We add a Makefile define 'DUAL_STACK_SOCKETS' to indicate availability of this feature on the target platform. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Consolidate tcp_sock_init[46]	David Gibson	2022-11-25	1	-35/+15
\| \| \| \| \| \| \| \| \|	Previous cleanups mean that tcp_sock_init4() and tcp_sock_init6() are almost identical, and the remaining differences can be easily parameterized. Combine both into a single tcp_sock_init_af() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Allow splicing of connections from IPv4-mapped loopback	David Gibson	2022-11-25	1	-8/+11
\| \| \| \| \| \| \| \| \| \| \|	For non-spliced connections we now treat IPv4-mapped IPv6 addresses the same as the corresponding IPv4 addresses. However currently we won't splice a connection from ::ffff:127.0.0.1 the way we would one from 127.0.0.1. Correct this so that we can splice connections from IPv4 localhost that have been received on an IPv6 dual stack socket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses	David Gibson	2022-11-25	2	-38/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	passt usually doesn't NAT, but it does do so for the remapping of the gateway address to refer to the host. Currently we perform this NAT with slightly different rules on both IPv4 addresses and IPv6 addresses, but not on IPv4-mapped IPv6 addresses. This means we won't correctly handle the case of an IPv4 connection over an IPv6 socket, which is possible on Linux (and probably other platforms). Refactor tcp_conn_from_sock() to perform the NAT after converting either address family into an inany_addr, so IPv4 and and IPv4-mapped addresses have the same representation. With two new helpers this lets us remove the IPv4 and IPv6 specific paths from tcp_conn_from_sock(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove v6 flag from tcp_epoll_ref	David Gibson	2022-11-25	3	-13/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This bit in the TCP specific epoll reference indicates whether the connection is IPv6 or IPv4. However the sites which refer to it are already calling accept() which (optionally) returns an address for the remote end of the connection. We can use the sa_family field in that address to determine the connection type independent of the epoll reference. This does have a cost: for the spliced case, it means we now need to get that address from accept() which introduces an extran copy_to_user(). However, in future we want to allow handling IPv4 connectons through IPv6 sockets, which means we won't be able to determine the IP version at the time we create the listening socket and epoll reference. So, at some point we'll have to pay this cost anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix small errors in tcp_seq_init() time handling	David Gibson	2022-11-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It looks like tcp_seq_init() is supposed to advance the sequence number by one every 32ns. However we only right shift the ns part of the timespec not the seconds part, meaning that we'll advance by an extra 32 steps on each second. I don't know if that's exploitable in any way, but it doesn't appear to be the intent, nor what RFC 6528 suggests. In addition, we convert from seconds to nanoseconds with a multiplication by '1E9'. In C '1E9' is a floating point constant, forcing a conversion to floating point and back for what should be an integer calculation (confirmed with objdump and Makefile default compiler flags). Spell out 1000000000 in full to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Have tcp_seq_init() take its parameters from struct tcp_conn	David Gibson	2022-11-25	1	-26/+10
\| \| \| \| \| \| \| \| \| \| \|	tcp_seq_init() takes a number of parameters for the connection, but at every call site, these are already populated in the tcp_conn structure. Likewise we always store the result into the @seq_to_tap field. Use this to simplify tcp_seq_init(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify initial sequence number calculation for IPv4 and IPv6	David Gibson	2022-11-25	2	-28/+19
\| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_seq_init() has separate paths for IPv4 and IPv6 addresses, which means we will calculate different sequence numbers for IPv4 and equivalent IPv4-mapped IPv6 addresses. Change it to treat these the same by always converting the input address into an inany_addr representation and use that to calculate the sequence number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>