passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	udp: Update UDP checksum using an iovec array	Laurent Vivier	2024-10-04	1	-3/+11
\| \| \| \| \| \| \| \|	As for tcp_update_check_tcp4()/tcp_update_check_tcp6(), change csum_udp4() and csum_udp6() to use an iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't risk truncating frames on full buffer in tap_pasta_input()	David Gibson	2024-09-06	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Restructure in tap_pasta_input()	David Gibson	2024-09-06	1	-26/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Improve handling of EINTR in tap_passt_input()	David Gibson	2024-09-06	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split out handling of EPOLLIN events	David Gibson	2024-09-06	1	-14/+36
\| \| \| \| \| \| \| \| \| \|	Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Rename MAC address fields for clarity	David Gibson	2024-08-21	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \|	c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp.c: Turn NDP responder into more declarative implementation	AbdAlRahman Gad	2024-08-13	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Add structs for NA, RA, NS, MTU, prefix info, option header, link-layer address, RDNSS, DNSSL and link-layer for RA message. - Turn NA message from purely imperative, going byte by byte, to declarative by filling it's struct. - Turn part of RA message into declarative. - Move packet_add() to be before the call of ndp() in tap6_handler() if the protocol of the packet is ICMPv6. - Add a pool of packets as an additional parameter to ndp(). - Check the size of NS packet with packet_get() before sending an NA packet. - Add documentation for the structs. - Add an enum for NDP option types. Link: https://bugs.passt.top/show_bug.cgi?id=21 Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com> [sbrivio: Minor coding style fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Improve handling of partially received frames on qemu socket2024_07_26.57a21d2	David Gibson	2024-07-26	1	-13/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Correctly handle frames of odd length	David Gibson	2024-07-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Qemu socket protocol consists of a 32-bit frame length in network (BE) order, followed by the Ethernet frame itself. As far as I can tell, frames can be any length, with no particular alignment requirement. This means that although pkt_buf itself is aligned, if we have a frame of odd length, frames after it will have their frame length at an unaligned address. Currently we load the frame length by just casting a char pointer to (uint32_t ) and loading. Some platforms will generate a fatal trap on such an unaligned load. Even if they don't casting an incorrectly aligned pointer to (uint32_t ) is undefined behaviour, strictly speaking. Introduce a new helper to safely load a possibly unaligned value here. We assume that the compiler is smart enough to optimize this into nothing on platforms that provide performant unaligned loads. If that turns out not to be the case, we can look at improvements then. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't use EPOLLET on Qemu sockets	David Gibson	2024-07-26	1	-10/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we do set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't attempt to carry on if we get a bad frame length from qemu	David Gibson	2024-07-26	1	-9/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Better report errors receiving from QEMU socket	David Gibson	2024-07-26	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Change error message: it's not necessarily QEMU, and mention that we are resetting the connection] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Exit if we fail to bind a UNIX domain socket with explicit path	Stefano Brivio	2024-07-25	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally. But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down	Stefano Brivio	2024-07-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we start pasta with some ports forwarded, but no --config-net, say: $ ./pasta -u 10001 and then use a local, non-loopback address to send traffic to that port, say: $ socat -u FILE:test UDP4:192.0.2.1:10001 pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate. By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	icmp: Obtain destination addresses from the flowsides	David Gibson	2024-07-19	1	-11/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	icmp_sock_handler() obtains the guest address from it's most recently observed IP. However, this can now be obtained from the common flowside information. icmp_tap_handler() builds its socket address for sendto() directly from the destination address supplied by the incoming tap packet. This can instead be generated from the flow. Using the flowsides as the common source of truth here prepares us for allowing more flexible NAT and forwarding by properly initialising that flowside information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Replace strerror() calls	Stefano Brivio	2024-06-21	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: use in->buf_size rather than sizeof(pkt_buf)	Laurent Vivier	2024-06-13	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	buf_size is set to sizeof(pkt_buf) by default. And it seems more correct to provide the actual size of the buffer. Later a buf_size of 0 will allow vhost-user mode to detect guest memory buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	vhost-user: compare mode MODE_PASTA and not MODE_PASST	Laurent Vivier	2024-06-13	1	-6/+6
\| \| \| \| \| \| \| \| \| \|	As we are going to introduce the MODE_VU that will act like the mode MODE_PASST, compare to MODE_PASTA rather than to add a comparison to MODE_VU when we check for MODE_PASST. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: refactor packets handling functions	Laurent Vivier	2024-06-13	1	-49/+61
\| \| \| \| \| \| \| \| \| \| \| \| \|	Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(), and tap4_handler() and tap6_handler() into tap_handler(). Create a generic tap_add_packet() to consolidate packet addition logic and reduce code duplication. The purpose is to ease the export of these functions to use them with the vhost-user backend. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	clang-tidy: Enable the bugprone-macro-parentheses check	David Gibson	2024-06-07	1	-18/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We globally disabled this, with a justification lumped together with several checks about braces. They don't really go together, the others are essentially a stylistic choice which doesn't match our style. Omitting brackets on macro parameters can lead to real and hard to track down bugs if an expression is ever passed to the macro instead of a plain identifier. We've only gotten away with the macros which trigger the warning, because of other conventions its been unlikely to invoke them with anything other than a simple identifier. Fix the macros, and enable the warning for the future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, passt, tap: Open socket and PID files before switching UID/GID	Stefano Brivio	2024-05-23	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Otherwise, if the user runs us as root, and gives us paths that are only accessible by root, we'll fail to open them, which might in turn encourage users to change permissions or ownerships: definitely a bad idea in terms of security. Reported-by: Minxi Hou <mhou@redhat.com> Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	tap: Split tap_sock_unix_init() into opening and listening parts	Stefano Brivio	2024-05-23	1	-12/+27
\| \| \| \| \| \| \| \| \|	We'll need to open and bind the socket a while before listening to it, so split that into two different functions. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	passt, tap: Don't use -1 as uninitialised value for fd_tap_listen	Stefano Brivio	2024-05-23	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a remnant from the time we kept access to the original filesystem and we could reinitialise the listening AF_UNIX socket. Since commit 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), however, we can't re-bind the listening socket once we're up and running. Drop the -1 initalisation and the corresponding check. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Move all-ones initialisation of mac_guest to tap_sock_init()	Stefano Brivio	2024-05-23	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It has nothing to do with tap_sock_unix_init(). It used to be there as that function could be called multiple times per passt instance, but it's not the case anymore. This also takes care of the fact that, with --fd, we wouldn't set the initial MAC address, so we would need to wait for the guest to send us an ARP packet before we could exchange data. Fixes: 6b4e68383c66 ("passt, tap: Add --fd option") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	iov: Helper macro to construct iovs covering existing variables or fields	David Gibson	2024-05-02	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \|	Laurent's recent changes mean we use IO vectors much more heavily in the TCP code. In many of those cases, and few others around the code base, individual iovs of these vectors are constructed to exactly cover existing variables or fields. We can make initializing such iovs shorter and clearer with a macro for the purpose. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise variable names for various packet lengths	David Gibson	2024-05-02	1	-68/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At various points we need to track the lengths of a packet including or excluding various different sets of headers. We don't always use the same variable names for doing so. Worse in some places we use the same name for different things: e.g. tcp_fill_headers[46]() use ip_len for the length including the IP headers, but then tcp_send_flag() which calls it uses it to mean the IP payload length only. To improve clarity, standardise on these names: dlen: L4 protocol payload length ("data length") l4len: plen + length of L4 protocol header l3len: l4len + length of IPv4/IPv6 header l2len: l3len + length of L2 (ethernet) header Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Make csum_ip4_header() take a host endian length	David Gibson	2024-05-02	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	csum_ip4_header() takes the packet length as a network endian value. In general it's very error-prone to pass non-native-endian values as a raw integer. It's particularly bad here because this differs from other checksum functions (e.g. proto_ipv4_header_psum()) which take host native lengths. It turns out all the callers have easy access to the native endian value, so switch it to use host order like everything else. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Remove misleading and redundant endianness notes	David Gibson	2024-05-02	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, it's much less error-prone to have the endianness of values implied by the type, rather than just noting it in comments. We can't always easily avoid it, because C, but we can do so when possible. struct in_addr and in6_addr are always encoded network endian, so noting it explicitly isn't useful. Remove them. In some cases we also have endianness notes on uint8_t parameters, which doesn't make sense: for a single byte endianness is irrelevant. Remove those too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Implement tap_send() "slow path" in terms of fast path	David Gibson	2024-03-14	1	-21/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most times we send frames to the guest it goes via tap_send_frames(). However "slow path" protocols - ARP, ICMP, ICMPv6, DHCP and DHCPv6 - go via tap_send(). As well as being a semantic duplication, tap_send() contains at least one serious problem: it doesn't properly handle short sends, which can be fatal on the qemu socket connection, since frame boundaries will get out of sync. Rewrite tap_send() to call tap_send_frames(). While we're there, rename it tap_send_single() for clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Simplify some casts in the tap "slow path" functions	David Gibson	2024-03-14	1	-23/+18
\| \| \| \| \| \| \| \|	We can both remove some variables which differ from others only in type, and slightly improve type safety. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Extend tap_send_frames() to allow multi-buffer frames	David Gibson	2024-03-14	1	-31/+52
\| \| \| \| \| \| \| \| \| \| \|	tap_send_frames() takes a vector of buffers and requires exactly one frame per buffer. We have future plans where we want to have multiple buffers per frame in some circumstances, so extend tap_send_frames() to take the number of buffers per frame as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Improve comment to rembufs calculation] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Capture only packets that are actually sent	Laurent Vivier	2024-03-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	In tap_send_frames(), if we failed to send all the frames, we must only log the frames that have been sent, not all the frames we wanted to send. Fixes: dda7945ca9c9 ("pcap: Handle short writes in pcap_frame()") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: make tap_update_mac() generic	Laurent Vivier	2024-03-06	1	-5/+5
\| \| \| \| \| \| \| \| \| \|	Use ethhdr rather than tap_hdr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-9-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: use csum_ip4_header() in udp.c and tcp.c	Laurent Vivier	2024-03-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can find the same function to compute the IPv4 header checksum in tcp.c, udp.c and tap.c Use the function defined for tap.c, csum_ip4_header(), but with the code used in tcp.c and udp.c as it doesn't need a fully initialiazed IPv4 header, only protocol, tot_len, saddr and daddr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-7-lvivier@redhat.com> [dwg: Fix weird cppcheck regression; it appears to be a problem in pre-existing code, but somehow this patch is exposing it] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: move IP stuff from util.[ch] to ip.[ch]	Laurent Vivier	2024-03-06	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce ip.[ch] file to encapsulate IP protocol handling functions and structures. Modify various files to include the new header ip.h when it's needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-5-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Disallow loopback addresses on tap interface	David Gibson	2024-02-29	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \|	The "tap" interface, whether it's actually a tuntap device or a qemu socket, presents a virtual external link between different network hosts. Hence, loopback addresses make no sense there. However, nothing prevents the guest from putting bogus packets with loopback addresses onto the interface and it's not entirely clear what effect that will have on passt. Explicitly test for such packets and drop them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Use write_remainder() in tap_send_frames_passt()	David Gibson	2024-02-29	1	-25/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	When we determine we have sent a partial frame in tap_send_frames_passt(), we call tap_send_remainder() to send the remainder of it. The logic in that function is very similar to that in the more general write_remainder() except that it uses send() instead of write()/writev(). But we are dealing specifically with the qemu socket here, which is a connected stream socket. In that case write()s do the same thing as send() with the options we were using, so we can just reuse write_remainder(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pcap: Handle short writes in pcap_frame()	David Gibson	2024-02-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently pcap_frame() assumes that if write() doesn't return an error, it has written everything we want. That's not necessarily true, because it could return a short write. That's not likely to happen on a regular file, but there's not a lot of reason not to be robust here; it's conceivable we might want to direct the pcap fd at a named pipe or similar. So, make pcap_frame() handle short frames by using the write_remainder() helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Formatting fix, and avoid gcc warning in pcap_frame()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: Add helper to find skip over first n bytes of an io vector	David Gibson	2024-02-29	1	-7/+5
\| \| \| \| \| \| \| \| \|	Several of the IOV functions in iov.c, and also tap_send_frames_passt() needs to determine which buffer element a byte offset into an IO vector lies in. Split this out into a helper function iov_skip_bytes(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	1	-6/+6
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use 'z' length modifier for size_t/ssize_t conversions	Stefano Brivio	2023-12-02	1	-6/+6
\| \| \| \| \| \| \| \|	Types size_t and ssize_t are not necessarily long, it depends on the architecture. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap, pasta: Handle short writes to /dev/tap2023_11_10.5ec3634	David Gibson	2023-11-10	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_send_frames_pasta() sends frames to the namespace by sending them to our the /dev/tap device. If that write() returns an error, we already handle it. However we don't handle the case where the write() returns short, meaning we haven't successfully transmitted the whole frame. I don't know if this can ever happen with the kernel tap device, but we should at least report the case so we don't get a cryptic failure. For the purposes of the return value for tap_send_frames_pasta() we treat this case as though it was an error (on the grounds that a partial frame is no use to the namespace). Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	tap, pasta: Handle incomplete tap sends for pasta too	David Gibson	2023-11-10	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since a469fc39 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") we've handled more gracefully the case where we get data from the socket side, but are temporarily unable to send it all to the tap side (e.g. due to full buffers). That code relies on tap_send_frames() returning the number of frames it successfully sent, which in turn gets it from tap_send_frames_passt() or tap_send_frames_pasta(). While tap_send_frames_passt() has returned that information since b62ed9ca ("tap: Don't pcap frames that didn't get sent"), tap_send_frames_pasta() always returns as though it succesfully sent every frame. However there certainly are cases where it will return early without sending all frames. Update it report that properly, so that the calling functions can handle it properly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	pif: Pass originating pif to tap handler functions	David Gibson	2023-11-07	1	-10/+16
\| \| \| \| \| \| \| \| \| \| \|	For now, packets passed to the various *_tap_handler() functions always come from the single "tap" interface. We want to allow the possibility to broaden that in future. As preparation for that, have the code in tap.c pass the pif id of the originating interface to each of those handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tap: Don't increase tap-side sequence counter for dropped frames	Stefano Brivio	2023-10-04	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptors we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-7/+7
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, tap: Correctly advance through packets in udp_tap_handler()	David Gibson	2023-09-08	1	-12/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In both tap4_handler() and tap6_handler(), once we've sorted incoming l3 packets into "sequences", we then step through all the packets in each DUP sequence calling udp_tap_handler(). Or so it appears. In fact, udp_tap_handler() doesn't take an index and always starts with packet 0 of the sequence, even if called repeatedly. It appears to be written with the idea that the struct pool is a queue, from which it consumes packets as it processes them, but that's not how the pool data structure works. Correct this by adding an index parameter to udp_tap_handler() and altering the loops in tap.c to step through the pool properly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tap: Correctly advance through packets in tcp_tap_handler()	David Gibson	2023-09-08	1	-8/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In both tap4_handler() and tap6_handler(), once we've sorted incoming l3 packets into "sequences", we then step through all the packets in each TCP sequence calling tcp_tap_handler(). Or so it appears. In fact, tcp_tap_handler() doesn't take an index and always looks at packet 0 of the sequence, except when it calls tcp_data_from_tap() to process data packets. It appears to be written with the idea that the struct pool is a queue, from which it consumes packets as it processes them, but that's not how the pool data structure works - they are more like an array of packets. We only get away with this, because setup packets for TCP tend to come in separate batches (because we need to reply in between) and so we only get a bunch of packets for the same connection together when they're data packets (tcp_data_from_tap() has its own loop through packets). Correct this by adding an index parameter to tcp_tap_handler() and altering the loops in tap.c to step through the pool properly. Link: https://bugs.passt.top/show_bug.cgi?id=68 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: fix uses of l3_len in tap4_handler()	Stas Sergeev	2023-09-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	l3_len was calculated from the ethernet frame size, and it was assumed to be equal to the length stored in an IP packet. But if the ethernet frame is padded, then l3_len calculated that way can only be used as a bound check to validate the length stored in an IP header. It should not be used for calculating the l4_len. This patch makes sure the small padded ethernet frames are properly processed, by trusting the length stored in an IP header. Link: https://bugs.passt.top/show_bug.cgi?id=73 Signed-off-by: Stas Sergeev <stsp2@yandex.ru> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't pre-fill IPv4 destination address in headers	David Gibson	2023-08-22	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>