passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: Update tap specific header too in tcp_fill_headers[46]()	David Gibson	4 days	1	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_fill_headers[46]() fill most of the headers, but the tap specific header (the frame length for qemu sockets) is filled in afterwards. Filling this as well: * Removes a little redundancy between the tcp_send_flag() and tcp_data_to_tap() path * Makes calculation of the correct length a little easier * Removes the now misleadingly named 'vnet_len' variable in tcp_send_flag() Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: Helper macro to construct iovs covering existing variables or fields	David Gibson	4 days	1	-15/+9
\| \| \| \| \| \| \| \| \| \| \|	Laurent's recent changes mean we use IO vectors much more heavily in the TCP code. In many of those cases, and few others around the code base, individual iovs of these vectors are constructed to exactly cover existing variables or fields. We can make initializing such iovs shorter and clearer with a macro for the purpose. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap, tcp: (Re-)abstract TAP specific header handling	David Gibson	4 days	1	-25/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent changes to the TCP code (reworking of the buffer handling) have meant that it now (again) deals explicitly with the MODE_PASST specific vnet_len field, instead of using the (partial) abstractions provided by the tap layer. The abstractions we had don't work for the new TCP structure, so make some new ones that do: tap_hdr_iov() which constructs an iovec suitable for containing (just) the TAP specific header and tap_hdr_update() which updates it as necessary per-packet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify packet length calculation when preparing headers	David Gibson	4 days	1	-16/+10
\| \| \| \| \| \| \| \| \|	tcp_fill_headers[46]() compute the L3 packet length from the L4 packet length, then their caller tcp_l2_buf_fill_headers() converts it back to the L4 packet length. We can just use the L4 length throughout. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise variable names for various packet lengths	David Gibson	4 days	1	-51/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At various points we need to track the lengths of a packet including or excluding various different sets of headers. We don't always use the same variable names for doing so. Worse in some places we use the same name for different things: e.g. tcp_fill_headers[46]() use ip_len for the length including the IP headers, but then tcp_send_flag() which calls it uses it to mean the IP payload length only. To improve clarity, standardise on these names: dlen: L4 protocol payload length ("data length") l4len: plen + length of L4 protocol header l3len: l4len + length of IPv4/IPv6 header l2len: l3len + length of L2 (ethernet) header Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Make csum_ip4_header() take a host endian length	David Gibson	4 days	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	csum_ip4_header() takes the packet length as a network endian value. In general it's very error-prone to pass non-native-endian values as a raw integer. It's particularly bad here because this differs from other checksum functions (e.g. proto_ipv4_header_psum()) which take host native lengths. It turns out all the callers have easy access to the native endian value, so switch it to use host order like everything else. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Replace TCP buffer structure by an iovec array	Laurent Vivier	2024-04-19	1	-268/+261
\| \| \| \| \| \| \| \| \| \| \| \| \|	To be able to provide pointers to TCP headers and IP headers without worrying about alignment in the structure, split the structure into several arrays and point to each part of the frame using an iovec array. Using iovec also allows us to simply ignore the first entry when the vnet length header is not needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unconditionally force ACK for all !SYN, !RST packets2024_03_26.4988e2b	David Gibson	2024-03-26	1	-5/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set ACK on flags packets only when the acknowledged byte pointer has advanced, or we hadn't previously set a window. This means in particular that we can send a window update with no ACK flag, which doesn't appear to be correct. RFC 9293 requires a receiver to ignore such a packet [0], and indeed it appears that every non-SYN, non-RST packet should have the ACK flag. The reason for the existing logic, rather than always forcing an ACK seems to be to avoid having the packet mistaken as a duplicate ACK which might trigger a fast retransmit. However, earlier tests in the function mean we won't reach here if we don't have either an advance in the ack pointer - which will already set the ACK flag, or a window update - which shouldn't trigger a fast retransmit. [0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1 Link: https://github.com/containers/podman/issues/22146 Link: https://bugs.passt.top/show_bug.cgi?id=84 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Never automatically add the ACK flag to RST packets	David Gibson	2024-03-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets. This doesn't make sense for RST packets, where plain RST and RST+ACK have somewhat different meanings. AIUI, RST+ACK indicates an abrupt end to a connection, but acknowledges data already sent. Plain RST indicates an abort, when one end receives a packet that doesn't seem to make sense in the context of what it knows about the connection. All of the cases where we send RSTs are the second, so we don't want an ACK flag, but we currently could add one anyway. Change that, so we won't add an ACK to an RST unless the caller explicitly requests it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Rearrange logic for setting ACK flag in tcp_send_flag()	David Gibson	2024-03-26	1	-5/+4
\| \| \| \| \| \| \| \| \| \|	We have different paths for controlling the ACK flag for the SYN and !SYN paths. This amounts to sometimes forcing on the ACK flag in the !SYN path regardless of options. We can rearrange things to explicitly be that which will make things neater for some future changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split handling of DUP_ACK from ACK	David Gibson	2024-03-26	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the setting of the ACK flag in the packet, even if we otherwise wouldn't. Secondly, it causes a duplicate of the flags packet to be sent immediately after the first. Setting the ACK flag to tcp_send_flag() also has the first effect, so instead of having DUP_ACK also do that, pass both flags when we need both operations. This slightly simplifies the logic of tcp_send_flag() in a way that makes some future changes easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Rename tap_iov_{base,len}	David Gibson	2024-03-14	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \|	These two functions are typically used to calculate values to go into the iov_base and iov_len fields of a struct iovec. They don't have to be used for that, though. Rename them in terms of what they actually do: calculate the base address and total length of the complete frame, including both L2 and tap specific headers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Extend tap_send_frames() to allow multi-buffer frames	David Gibson	2024-03-14	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	tap_send_frames() takes a vector of buffers and requires exactly one frame per buffer. We have future plans where we want to have multiple buffers per frame in some circumstances, so extend tap_send_frames() to take the number of buffers per frame as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Improve comment to rembufs calculation] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Introduce tcp_fill_headers4()/tcp_fill_headers6()	Laurent Vivier	2024-03-06	1	-50/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function tcp_fill_header(). Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to tcp_fill_headers4() and tcp_fill_headers6() Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-10-lvivier@redhat.com> [dwg: Correct commit message with new function names] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: make tap_update_mac() generic	Laurent Vivier	2024-03-06	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	Use ethhdr rather than tap_hdr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-9-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: introduce functions to compute the header part checksum for TCP/UDP	Laurent Vivier	2024-03-06	1	-27/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP and UDP checksums are computed using the data in the TCP/UDP payload but also some informations in the IP header (protocol, length, source and destination addresses). We add two functions, proto_ipv4_header_psum() and proto_ipv6_header_psum(), to compute the checksum of the IP header part. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-8-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: use csum_ip4_header() in udp.c and tcp.c	Laurent Vivier	2024-03-06	1	-21/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can find the same function to compute the IPv4 header checksum in tcp.c, udp.c and tap.c Use the function defined for tap.c, csum_ip4_header(), but with the code used in tcp.c and udp.c as it doesn't need a fully initialiazed IPv4 header, only protocol, tot_len, saddr and daddr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-7-lvivier@redhat.com> [dwg: Fix weird cppcheck regression; it appears to be a problem in pre-existing code, but somehow this patch is exposing it] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: move IP stuff from util.[ch] to ip.[ch]	Laurent Vivier	2024-03-06	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce ip.[ch] file to encapsulate IP protocol handling functions and structures. Modify various files to include the new header ip.h when it's needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-5-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Rename port_fwd.[ch] and their contents	David Gibson	2024-02-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Currently port_fwd.[ch] contains helpers related to port forwarding, particular automatic port forwarding. We're planning to allow much more flexible sorts of forwarding, including both port translation and NAT based on the flow table. This will subsume the existing port forwarding logic, so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names within. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Validate TCP endpoint addresses	David Gibson	2024-02-29	1	-7/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP connections should typically not have wildcard addresses (0.0.0.0 or ::) nor a zero port number for either endpoint. It's not entirely clear (at least to me) if it's strictly against the RFCs to do so, but at any rate the socket interfaces often treat those values specially[1], so it's not really possible to manipulate such connections. Likewise they should not have broadcast or multicast addresses for either endpoint. However, nothing prevents a guest from creating a SYN packet with such values, and it's not entirely clear what the effect on passt would be. To ensure sane behaviour, explicitly check for this case and drop such packets, logging a debug warning (we don't want a higher level, because that would allow a guest to spam the logs). We never expect such an address on an accept()ed socket either, but just in case, check for it as well. [1] Depending on context as "unknown", "match any" or "kernel, pick something for me" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler()	David Gibson	2024-02-29	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_listen_handler() uses the epoll reference for the listening socket it handles, and also passes on one variant of it to tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter two functions only need a couple of specific fields from the reference. Pass those specific values instead of the whole reference, which localises the handling of the listening (as opposed to accepted) socket and its reference entirely within tcp_listen_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Improve logic deciding when to splice	David Gibson	2024-02-29	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes several tweaks to improve the logic which decides whether we're able to use the splice method for a new connection. * Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we check for pasta mode within it, better localising the checks. * Previously if we got a connection from a non-loopback address we'd always fall back to the "tap" path, even if the connection was on a socket in the namespace. If we did get a non-loopback address on a namespace socket, something has gone wrong and the "tap" path certainly won't be able to handle it. Report the error and close, rather than passing it along to tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify flow entry life cycle, introduce uniform logging	David Gibson	2024-02-29	1	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our allocation scheme for flow entries means there are some non-obvious constraints on when what things can be done with an entry. Add a big doc comment explaining the life cycle. In addition, make a FLOW_START() macro to mark one of the important transitions. This encourages correct usage, by making it natural to only access the flow type specific structure after calling it. It also logs that a new flow has been created, which is useful for debugging. We also add logging when a flow's lifecycle ends. This doesn't need a new helper, because it can only happen either from flow_alloc_cancel() or from the flow deferred handler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't precompute port remappings in epoll references	David Gibson	2024-02-29	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The epoll references for both TCP listening sockets and UDP sockets includes a port number. This gives the destination port that traffic to that socket will be sent to on the other side. That will usually be the same as the socket's bound port, but might not if the -t, -u, -T or -U options are given with different original and forwarded port numbers. As we move towards a more flexible forwarding model for passt, it's going to become possible for that destination port to vary depending on more things (for example the source or destination address). So, it will no longer make sense to have a fixed value for a listening socket. Change to simpler semantics where this field in the reference gives the bound port of the socket. We apply the translations to the correct destination port later on, when we're actually forwarding. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Introduce union sockaddr_inany	David Gibson	2024-02-29	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are a number of places where we want to handle either a sockaddr_in or a sockaddr_in6. In some of those we use a void *, which works ok and matches some standard library interfaces, but doesn't give a signature level hint that we're dealing with only sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of socket address. Other places we use a sockaddr_storage, which also works, but has the same problem in addition to allocating more on the stack than we need to. Introduce union sockaddr_inany to explictly handle this case: it has variants for sockaddr_in and sockaddr_in6. Use it in a number of places where it's easy to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Provide more conveniently typed constants for special addresses	David Gibson	2024-02-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our inany_addr type is used in some places to represent either IPv4 or IPv6 addresses, and we plan to use it more widely. We don't yet provide constants of this type for special addresses (loopback and "any"). Add some of these, both the IPv4 and IPv6 variants of those addresses, but typed as union inany_addr. To avoid actually adding more things to .data we can use some macros and casting to overlay the IPv6 versions of these with the standard library's in6addr_loopback and in6addr_any. For the IPv4 versions we need to create new constant globals. For complicated historical reasons, the standard library doesn't provide constants for IPv4 loopback and any addresses as struct in_addr. It just has macros of type in_addr_t == uint32_t, which has some gotchas w.r.t. endianness. We can use some more macros to address this lack, using macros to effectively create these IPv4 constants as pieces of the inany constants above. We use this last to avoid some awkward temporary variables just used to get an address of an IPv4 loopback address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't store errnos in socket pool	David Gibson	2024-02-27	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If tcp_sock_refill_pool() gets an error opening new sockets, it stores the negative errno of that error in the socket pool. This isn't especially useful: * It's inconsistent with the initial state of the pool (all -1) * It's inconsistent with the state of an entry that was valid and was then consumed (also -1) * By the time we did anything with this error code, it's now far removed from the situation in which the error occurred, making it difficult to report usefully We now have error reporting closer to when failures happen on the refill paths, so just leave a pool slot we can't fill as -1. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Helpers for getting sockets from the pools	David Gibson	2024-02-27	1	-5/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We maintain pools of ready-to-connect sockets in both the original and (for pasta) guest namespace to reduce latency when starting new TCP connections. If we exhaust those pools we have to take a higher latency path to get a new socket. Currently we open-code that fallback in the places we need it. To improve clarity encapsulate that into helper functions. While we're at it, give those helpers clearer error reporting. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Issue warnings if unable to refill socket pool	David Gibson	2024-02-27	1	-6/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if tcp_sock_refill_pool() is unable to fill all the slots in the pool, it will silently exit. This might lead to a later attempt to get fds from the pool to fail at which point it will be harder to tell what originally went wrong. Instead add warnings if we're unable to refill any of the socket pools when requested. We have tcp_sock_refill_pool() return an error and report it in the callers, because those callers have more context allowing for a more useful message. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Stop on first error when refilling socket pools	David Gibson	2024-02-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if we get an error opening a new socket while refilling a socket pool, we carry on to the next slot and try again. This isn't very useful, since by far the most likely cause of an error is some sort of resource exhaustion. Trying again will probably just hit the same error, and maybe even make things worse. So, instead stop on the first error while refilling the pool, making do with however many sockets we managed to open before the error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't stop refilling socket pool if we find a filled entry	David Gibson	2024-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the pool with a valid fd. This appears to makes sense: we always use fds from the front of the pool, so if we find a filled one, the rest of the pool should be filled as well. However, that's not quite correct. If a previous refill hit errors trying to open new sockets, it could leave gaps between blocks of valid fds. We're going to add some changes that could make that more likely. So, for robustness, instead skip over the filled entry but still try to refill the rest of the array. We expect simply iterating over the pool to be of small cost compared to even a single system call, so this shouldn't have much impact. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use sa_family_t for address family variables	David Gibson	2024-02-27	1	-6/+6
\| \| \| \| \| \| \| \| \| \|	Sometimes we use sa_family_t for variables and parameters containing a socket address family, other times we use a plain int. Since sa_family_t is what's actually used in struct sockaddr and friends, standardise on that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix subtle bug in fast re-transmit path	David Gibson	2024-02-11	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a duplicate ack from the tap side triggers a fast re-transmit, we set both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of the duplicate ack. Setting seq_to_tap is correct: this is what triggers the retransmit from this point onwards. Setting seq_ack_from_tap is not correct, though. In most cases setting seq_ack_from_tap will be redundant but harmless: it will have already been updated to the same value by tcp_update_seqack_from_tap() a few lines above. However that call can be skipped if tcp_sock_consume() fails, which is rare but possible. In that case this update will cause problems. We use seq_ack_from_tap to track two logically distinct things: how much of the stream has been acked by the guest, and how much of the stream from the socket has been read and discarded (as opposed to MSG_PEEKed). We attempt to keep those values the same, because we discard data exactly when it is acked by the guest. However tcp_sock_consume() failing means we weren't able to disard the acked data. To handle that case, we skip the usual update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get one which supersedes it soon enough. Setting seq_ack_from_tap in the fast retransmit path, however, means we now really will have the read/discard point in the stream out of sync with seq_ack_from_tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Avoid moving flow entries to compact table	David Gibson	2024-01-22	1	-23/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Enforce that freeing of closed flows must happen in deferred handlers	David Gibson	2024-01-22	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Abstract allocation of new flows with helper function	David Gibson	2024-01-22	1	-11/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp.c open codes the process of allocating a new flow from the flow table: twice, in fact, once for guest to host and once for host to guest connections. This duplication isn't ideal and will get worse as we add more protocols to the flow table. It also makes it harder to experiment with different ways of handling flow table allocation. Instead, introduce a function to allocate a new flow: flow_alloc(). In some cases we currently check if we're able to allocate, but delay the actual allocation. We now handle that slightly differently with a flow_alloc_cancel() function to back out a recent allocation. We have that separate from a flow_free() function, because future changes we have in mind will need to handle this case a little differently. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_count from context structure to a global	David Gibson	2024-01-22	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets	David Gibson	2024-01-22	1	-28/+8
\| \| \| \| \| \| \| \| \| \| \|	Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add handling for per-flow timers	David Gibson	2024-01-22	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add flow-centric dispatch for deferred flow handling	David Gibson	2024-01-22	1	-17/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Move per-type cleanup logic into per-type helpers	David Gibson	2024-01-22	1	-6/+7
\| \| \| \| \| \| \| \| \| \|	tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Remove redundant handling from tcp_timer()	David Gibson	2024-01-22	1	-13/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise on 'now' for current timestamp variables	David Gibson	2024-01-22	1	-3/+3
\| \| \| \| \| \| \| \| \|	In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make flow_table.h #include the protocol specific headers it needs	David Gibson	2024-01-22	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we mostly lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: make tcp_sock_set_bufsize() static (again)	Laurent Vivier	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") has exported tcp_sock_set_bufsize() to be able to use it in tcp_splice.c, but 6ccab72d9b40 has removed its use in tcp_splice.c, so we can set it static again. Fixes: 6ccab72d9b40 ("tcp: Improve handling of fallback if socket pool is empty on new splice") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use IN4ADDR_LOOPBACK_INIT more widely	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address without delving into its internals. However there are some places we don't use it, and explicitly look at the internal structure of struct in_addr, which we generally want to avoid. Use the define more widely to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix address type for tcp_sock_init_af()	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This takes a struct in_addr * (i.e. an IPv4 address), although it's explicitly supposed to handle IPv6 as well. Both its caller and sock_l4() which it calls use a void * for the address, which can be either an in_addr or an in6_addr. We get away with this, because we don't do anything with the pointer other than transfer it from the caller to sock_l4(), but it's misleading. And quite possibly technically UB, because C is like that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't account for hash table size in tcp_hash()	David Gibson	2023-12-27	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_hash() returns the hash bucket for a value, that is the hash modulo the size of the hash table. Usually it's a bit more flexible to have hash functions return a "raw" hash value and perform the modulus in the callers. That allows the same hash function to be used for multiple tables of different sizes, or to re-use the hash for other purposes. We don't do anything like that with tcp_hash() at present, but we have some plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash() return raw hash values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Implement hash table with indices rather than pointers	David Gibson	2023-12-27	1	-11/+22
\| \| \| \| \| \| \| \| \| \| \| \| \|	We implement our hash table with pointers to the entry for each bucket (or NULL). However, the entries are always allocated within the flow table, meaning that a flow index will suffice, halving the size of the hash table. For TCP, just a flow index would be enough, but future uses will want to expand the hash table to cover indexing either side of a flow, so use a flow_sidx_t as the type for each hash bucket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>