passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	flow: Common address information for target side	David Gibson	2024-07-19	1	-38/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for initiating side	David Gibson	2024-07-19	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, icmp, tcp: Clean up helpers for getting flow from index	David Gibson	2024-07-17	1	-6/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP (both regular and spliced) and ICMP both have macros to retrieve the relevant protcol specific flow structure from a flow index. In most cases what we actually want is to get the specific flow from a sidx. Replace those simple macros with a more precise inline, which also asserts that the flow is of the type we expect. While we're they're also add a pif_at_sidx() helper to get the interface of a specific flow & side, which is useful in some places. Finally, fix some minor style issues in the comments on some of the existing sidx related helpers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, tcp: Tweak handling of no_udp and no_tcp flags	David Gibson	2024-07-17	1	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We abort the UDP socket handler if the no_udp flag is set. But if UDP was disabled we should never have had a UDP socket to trigger the handler in the first place. If we somehow did, ignoring it here isn't really going to help because aborting without doing anything is likely to lead to an epoll loop. The same is the case for the TCP socket and timer handlers and the no_tcp flag. Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to several other entry points to the protocol specific code which should never be called if the protocol is disabled. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: handle shrunk window advertisements from guest	Jon Maloy	2024-07-15	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A bug in kernel TCP may lead to a deadlock where a zero window is sent from the guest peer, while it is unable to send out window updates even after socket reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisements from the peer can only be triggered by data packets arriving from this side. However, currently such packets are never sent, because the zero-window condition prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered only after the peer has dropped one or more arriving packets because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that the implementation goes against the RFC-9293 recommendation that a previously advertised window never should shrink. RFC-9293 seems to permit that we can continue sending up to the right edge of the last advertised non-zero window in such situations, so that is what we do to resolve this situation. It turns out that this solution is extremely simple to implememt in the code: We just omit to save the advertised zero-window when we see that it has shrunk, i.e., if the acknowledged sequence number in the advertisement message is lower than that of the last data byte sent from our side. When that is the case, the following happens: - The 'retr' flag in tcp_data_from_tap() will be 'false', so no retransmission will occur at this occasion. - The data stream will soon reach the right edge of the previously advertised window. In fact, in all observed cases we have seen that it is already there when the zero-advertisement arrives. - At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set, unless they already have been, meaning that only the next timer expiration will open for data retransmission or transmission. - When that happens, the memory squeeze at the guest will normally have abated, and the data flow can resume. It should be noted that although this solves the problem we have at hand, it is a work-around, and not a genuine solution to the described kernel bug. Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor fix in commit title and commit reference in comment to workaround Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: leverage support of SO_PEEK_OFF socket option when available	Jon Maloy	2024-07-15	1	-3/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	>From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction. In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not. Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Add flow_sidx_valid() helper	David Gibson	2024-07-05	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To implement the TCP hash table, we need an invalid (NULL-like) value for flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we treat (usually) anything with an out of bounds flow index the same way. That's not always done consistently though. In flow_at_sidx() we open code a check on the flow index. In tcp_hash_probe() we instead compare against FLOW_SIDX_NONE, and in some other places we use the fact that flow_at_sidx() will return NULL in this case, even if we don't otherwise need the flow it returns. Clean this up a bit, by adding an explicit flow_sidx_valid() test function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: sock_l4() determine protocol from epoll type rather than the reverse	David Gibson	2024-07-05	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Replace strerror() calls	Stefano Brivio	2024-06-21	1	-16/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Don't rely on bind() to fail to decide that connection target is valid	Stefano Brivio	2024-06-19	1	-17/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit e1a2e2780c91 ("tcp: Check if connection is local or low RTT was seen before using large MSS") added a call to bind() before we issue a connect() to the target for an outbound connection. If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we can conclude that the target address is a local (host) address, and we can use an unlimited MSS. While at it, according to the reasoning of that commit, if bind() succeeds, we would know right away that nobody is listening at that (local) address and port, and we don't even need to call connect(): we can just fail early and reset the connection attempt. But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will actually succeed, so we can't rely on it to fail in general. The visible issue with the existing behaviour is that we would reset any outbound connection to non-local addresses, if non-local binds are enabled. Keep the significant optimisation for local addresses along with the bind() call, but if it succeeds, don't draw any conclusion: close the socket, grab another one, and proceed normally. This will incur a small latency penalty if non-local binds are enabled (we'll likely fetch an existing socket from the pool but additionally call close()), or if the target is local but not bound: we'll need to call connect() and get a failure before relaying that failure back. Link: https://github.com/containers/podman/issues/23003 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: move buffers management functions to their own file	Laurent Vivier	2024-06-13	1	-548/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move all the TCP parts using internal buffers to tcp_buf.c and keep generic TCP management functions in tcp.c. Add tcp_internal.h to export needed functions from tcp.c and tcp_buf.h from tcp_buf.c With this change we can use existing TCP functions with a different kind of memory storage as for instance the shared memory provided by the guest via vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: extract buffer management from tcp_send_flag()	Laurent Vivier	2024-06-13	1	-24/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit isolates the internal data structure management used for storing data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[], tcp4_flags[], ...) from the tcp_send_flag() function. The extracted functionality is relocated to a new function named tcp_fill_flag_header(). tcp_fill_flag_header() is now a generic function that accepts parameters such as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[]. This separation sets the stage for utilizing tcp_prepare_flags() to set the memory provided by the guest via vhost-user in future developments. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Suppress constParameterCallback errors	David Gibson	2024-06-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several functions which are used as callbacks for NS_CALL() which only read their void * parameter, they don't write it. The constParameterCallback warning in cppcheck 2.14.1 complains that this parameter could be const void , also pointing out that that would require casting the function pointer when used as a callback. Casting the function pointers seems substantially uglier than using a non-const void as the parameter, especially since in each case we cast the void * to a const pointer of specific type immediately. So, suppress these errors. I think it would make logical sense to suppress this globally, but that would cause unmatchedSuppression errors on earlier cppcheck versions. So, instead individually suppress it, along with unmatchedSuppression in the relevant places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Fix some error paths which didn't clean up flows properly	David Gibson	2024-06-07	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flow table entries need to be fully initialised before returning to the main epoll loop. Commit 0060acd1 ("flow: Clarify and enforce flow state transitions") now enforces that: once a flow is allocated we must either cancel it, or activate it before returning to the main loop, or we will hit an ASSERT(). Some error paths in tcp_conn_from_tap() weren't correctly updated for this requirement - we can exit with a flow entry incompletely initialised. Correct that by cancelling the flows in those situations. I don't have enough information to be certain if this is the cause for podman bug 22925, but it plausibly could be. Fixes: 0060acd11b19 ("flow: Clarify and enforce flow state transitions") Link: https://github.com/containers/podman/issues/22925 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	clang-tidy: Enable the bugprone-macro-parentheses check	David Gibson	2024-06-07	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We globally disabled this, with a justification lumped together with several checks about braces. They don't really go together, the others are essentially a stylistic choice which doesn't match our style. Omitting brackets on macro parameters can lead to real and hard to track down bugs if an expression is ever passed to the macro instead of a plain identifier. We've only gotten away with the macros which trigger the warning, because of other conventions its been unlikely to invoke them with anything other than a simple identifier. Fix the macros, and enable the warning for the future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make pointer const in tcp_revert_seq	David Gibson	2024-06-07	1	-1/+1
\| \| \| \| \| \| \| \| \|	The th pointer could be const, which causes a cppcheck warning on at least some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40). Fixes: e84a01e94c9f ("tcp: move seq_to_tap update to when frame is queued") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: move seq_to_tap update to when frame is queued	Jon Maloy	2024-06-05	1	-22/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem. This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue. We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove interim 'tapside' field from connection	David Gibson	2024-05-22	1	-6/+6
\| \| \| \| \| \| \| \| \| \|	We recently introduced this field to keep track of which side of a TCP flow is the guest/tap facing one. Now that we generically record which pif each side of each flow is connected to, we can easily derive that, and no longer need to keep track of it explicitly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Record the pifs for each side of each flow	David Gibson	2024-05-22	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have no generic information flows apart from the type and state, everything else is specific to the flow type. Start introducing generic flow information by recording the pifs which the flow connects. To keep track of what information is valid, introduce new flow states: INI for when the initiating side information is complete, and TGT for when both sides information is complete, but we haven't chosen the flow type yet. For now, these states don't do an awful lot, but they'll become more important as we add more generic information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make side 0 always be the initiating side	David Gibson	2024-05-22	1	-11/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each flow in the flow table has two sides, 0 and 1, representing the two interfaces between which passt/pasta will forward data for that flow. Which side is which is currently up to the protocol specific code: TCP uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except for spliced connections where it uses 0 for the initiating side and 1 for the target side. ICMP also uses 0 for the host/"sock" side and 1 for the guest/"tap" side, but in its case the latter is always also the initiating side. Make this generically consistent by always using side 0 for the initiating side and 1 for the target side. This doesn't simplify a lot for now, and arguably makes TCP slightly more complex, since we add an extra field to the connection structure to record which is the guest facing side. This is an interim change, which we'll be able to remove later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify and enforce flow state transitions	David Gibson	2024-05-22	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flows move over several different states in their lifetime. The rules for these are documented in comments, but they're pretty complex and a number of the transitions are implicit, which makes this pretty fragile and error prone. Change the code to explicitly track the states in a field. Make all transitions explicit and logged. To the extent that it's practical in C, enforce what can and can't be done in various states with ASSERT()s. While we're at it, tweak the docs to clarify the restrictions on each state a bit. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Better helpers for using inany and specific family addrs together	David Gibson	2024-05-22	1	-18/+11
\| \| \| \| \| \| \| \| \|	This adds some extra inany helpers for comparing an inany address to addresses of a specific family (including special addresses), and building an inany from an IPv4 address (either statically or at runtime). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Properly type callbacks to protocol specific handlers	David Gibson	2024-05-22	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The flow dispatches deferred and timer handling for flows centrally, but needs to call into protocol specific code for the handling of individual flows. Currently this passes a general union flow *. It makes more sense to pass the specific relevant flow type structure. That brings the check on the flow type adjacent to casting to the union variant which it tags. Arguably, this is a slight abstraction violation since it involves the generic flow code using protocol specific types. It's already calling into protocol specific functions, so I don't think this really makes any difference. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, tcp: Add helper to display socket addresses	David Gibson	2024-05-22	1	-14/+9
\| \| \| \| \| \| \| \| \| \| \|	When reporting errors, we sometimes want to show a relevant socket address. Doing so by extracting the various relevant fields can be pretty awkward, so introduce a sockaddr_ntop() helper to make it simpler. For now we just have one user in tcp.c, but I have further upcoming patches which can make use of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Update tap specific header too in tcp_fill_headers[46]()	David Gibson	2024-05-02	1	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_fill_headers[46]() fill most of the headers, but the tap specific header (the frame length for qemu sockets) is filled in afterwards. Filling this as well: * Removes a little redundancy between the tcp_send_flag() and tcp_data_to_tap() path * Makes calculation of the correct length a little easier * Removes the now misleadingly named 'vnet_len' variable in tcp_send_flag() Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: Helper macro to construct iovs covering existing variables or fields	David Gibson	2024-05-02	1	-15/+9
\| \| \| \| \| \| \| \| \| \| \|	Laurent's recent changes mean we use IO vectors much more heavily in the TCP code. In many of those cases, and few others around the code base, individual iovs of these vectors are constructed to exactly cover existing variables or fields. We can make initializing such iovs shorter and clearer with a macro for the purpose. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap, tcp: (Re-)abstract TAP specific header handling	David Gibson	2024-05-02	1	-25/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent changes to the TCP code (reworking of the buffer handling) have meant that it now (again) deals explicitly with the MODE_PASST specific vnet_len field, instead of using the (partial) abstractions provided by the tap layer. The abstractions we had don't work for the new TCP structure, so make some new ones that do: tap_hdr_iov() which constructs an iovec suitable for containing (just) the TAP specific header and tap_hdr_update() which updates it as necessary per-packet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Simplify packet length calculation when preparing headers	David Gibson	2024-05-02	1	-16/+10
\| \| \| \| \| \| \| \| \|	tcp_fill_headers[46]() compute the L3 packet length from the L4 packet length, then their caller tcp_l2_buf_fill_headers() converts it back to the L4 packet length. We can just use the L4 length throughout. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise variable names for various packet lengths	David Gibson	2024-05-02	1	-51/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At various points we need to track the lengths of a packet including or excluding various different sets of headers. We don't always use the same variable names for doing so. Worse in some places we use the same name for different things: e.g. tcp_fill_headers[46]() use ip_len for the length including the IP headers, but then tcp_send_flag() which calls it uses it to mean the IP payload length only. To improve clarity, standardise on these names: dlen: L4 protocol payload length ("data length") l4len: plen + length of L4 protocol header l3len: l4len + length of IPv4/IPv6 header l2len: l3len + length of L2 (ethernet) header Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Make csum_ip4_header() take a host endian length	David Gibson	2024-05-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	csum_ip4_header() takes the packet length as a network endian value. In general it's very error-prone to pass non-native-endian values as a raw integer. It's particularly bad here because this differs from other checksum functions (e.g. proto_ipv4_header_psum()) which take host native lengths. It turns out all the callers have easy access to the native endian value, so switch it to use host order like everything else. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Replace TCP buffer structure by an iovec array	Laurent Vivier	2024-04-19	1	-268/+261
\| \| \| \| \| \| \| \| \| \| \| \| \|	To be able to provide pointers to TCP headers and IP headers without worrying about alignment in the structure, split the structure into several arrays and point to each part of the frame using an iovec array. Using iovec also allows us to simply ignore the first entry when the vnet length header is not needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unconditionally force ACK for all !SYN, !RST packets2024_03_26.4988e2b	David Gibson	2024-03-26	1	-5/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set ACK on flags packets only when the acknowledged byte pointer has advanced, or we hadn't previously set a window. This means in particular that we can send a window update with no ACK flag, which doesn't appear to be correct. RFC 9293 requires a receiver to ignore such a packet [0], and indeed it appears that every non-SYN, non-RST packet should have the ACK flag. The reason for the existing logic, rather than always forcing an ACK seems to be to avoid having the packet mistaken as a duplicate ACK which might trigger a fast retransmit. However, earlier tests in the function mean we won't reach here if we don't have either an advance in the ack pointer - which will already set the ACK flag, or a window update - which shouldn't trigger a fast retransmit. [0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1 Link: https://github.com/containers/podman/issues/22146 Link: https://bugs.passt.top/show_bug.cgi?id=84 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Never automatically add the ACK flag to RST packets	David Gibson	2024-03-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets. This doesn't make sense for RST packets, where plain RST and RST+ACK have somewhat different meanings. AIUI, RST+ACK indicates an abrupt end to a connection, but acknowledges data already sent. Plain RST indicates an abort, when one end receives a packet that doesn't seem to make sense in the context of what it knows about the connection. All of the cases where we send RSTs are the second, so we don't want an ACK flag, but we currently could add one anyway. Change that, so we won't add an ACK to an RST unless the caller explicitly requests it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Rearrange logic for setting ACK flag in tcp_send_flag()	David Gibson	2024-03-26	1	-5/+4
\| \| \| \| \| \| \| \| \| \|	We have different paths for controlling the ACK flag for the SYN and !SYN paths. This amounts to sometimes forcing on the ACK flag in the !SYN path regardless of options. We can rearrange things to explicitly be that which will make things neater for some future changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split handling of DUP_ACK from ACK	David Gibson	2024-03-26	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the setting of the ACK flag in the packet, even if we otherwise wouldn't. Secondly, it causes a duplicate of the flags packet to be sent immediately after the first. Setting the ACK flag to tcp_send_flag() also has the first effect, so instead of having DUP_ACK also do that, pass both flags when we need both operations. This slightly simplifies the logic of tcp_send_flag() in a way that makes some future changes easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Rename tap_iov_{base,len}	David Gibson	2024-03-14	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \|	These two functions are typically used to calculate values to go into the iov_base and iov_len fields of a struct iovec. They don't have to be used for that, though. Rename them in terms of what they actually do: calculate the base address and total length of the complete frame, including both L2 and tap specific headers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Extend tap_send_frames() to allow multi-buffer frames	David Gibson	2024-03-14	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	tap_send_frames() takes a vector of buffers and requires exactly one frame per buffer. We have future plans where we want to have multiple buffers per frame in some circumstances, so extend tap_send_frames() to take the number of buffers per frame as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Improve comment to rembufs calculation] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Introduce tcp_fill_headers4()/tcp_fill_headers6()	Laurent Vivier	2024-03-06	1	-50/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function tcp_fill_header(). Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to tcp_fill_headers4() and tcp_fill_headers6() Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-10-lvivier@redhat.com> [dwg: Correct commit message with new function names] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: make tap_update_mac() generic	Laurent Vivier	2024-03-06	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	Use ethhdr rather than tap_hdr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-9-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: introduce functions to compute the header part checksum for TCP/UDP	Laurent Vivier	2024-03-06	1	-27/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP and UDP checksums are computed using the data in the TCP/UDP payload but also some informations in the IP header (protocol, length, source and destination addresses). We add two functions, proto_ipv4_header_psum() and proto_ipv6_header_psum(), to compute the checksum of the IP header part. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-8-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: use csum_ip4_header() in udp.c and tcp.c	Laurent Vivier	2024-03-06	1	-21/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can find the same function to compute the IPv4 header checksum in tcp.c, udp.c and tap.c Use the function defined for tap.c, csum_ip4_header(), but with the code used in tcp.c and udp.c as it doesn't need a fully initialiazed IPv4 header, only protocol, tot_len, saddr and daddr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-7-lvivier@redhat.com> [dwg: Fix weird cppcheck regression; it appears to be a problem in pre-existing code, but somehow this patch is exposing it] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: move IP stuff from util.[ch] to ip.[ch]	Laurent Vivier	2024-03-06	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce ip.[ch] file to encapsulate IP protocol handling functions and structures. Modify various files to include the new header ip.h when it's needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-5-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Rename port_fwd.[ch] and their contents	David Gibson	2024-02-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Currently port_fwd.[ch] contains helpers related to port forwarding, particular automatic port forwarding. We're planning to allow much more flexible sorts of forwarding, including both port translation and NAT based on the flow table. This will subsume the existing port forwarding logic, so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names within. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Validate TCP endpoint addresses	David Gibson	2024-02-29	1	-7/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP connections should typically not have wildcard addresses (0.0.0.0 or ::) nor a zero port number for either endpoint. It's not entirely clear (at least to me) if it's strictly against the RFCs to do so, but at any rate the socket interfaces often treat those values specially[1], so it's not really possible to manipulate such connections. Likewise they should not have broadcast or multicast addresses for either endpoint. However, nothing prevents a guest from creating a SYN packet with such values, and it's not entirely clear what the effect on passt would be. To ensure sane behaviour, explicitly check for this case and drop such packets, logging a debug warning (we don't want a higher level, because that would allow a guest to spam the logs). We never expect such an address on an accept()ed socket either, but just in case, check for it as well. [1] Depending on context as "unknown", "match any" or "kernel, pick something for me" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler()	David Gibson	2024-02-29	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_listen_handler() uses the epoll reference for the listening socket it handles, and also passes on one variant of it to tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter two functions only need a couple of specific fields from the reference. Pass those specific values instead of the whole reference, which localises the handling of the listening (as opposed to accepted) socket and its reference entirely within tcp_listen_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Improve logic deciding when to splice	David Gibson	2024-02-29	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes several tweaks to improve the logic which decides whether we're able to use the splice method for a new connection. * Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we check for pasta mode within it, better localising the checks. * Previously if we got a connection from a non-loopback address we'd always fall back to the "tap" path, even if the connection was on a socket in the namespace. If we did get a non-loopback address on a namespace socket, something has gone wrong and the "tap" path certainly won't be able to handle it. Report the error and close, rather than passing it along to tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify flow entry life cycle, introduce uniform logging	David Gibson	2024-02-29	1	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our allocation scheme for flow entries means there are some non-obvious constraints on when what things can be done with an entry. Add a big doc comment explaining the life cycle. In addition, make a FLOW_START() macro to mark one of the important transitions. This encourages correct usage, by making it natural to only access the flow type specific structure after calling it. It also logs that a new flow has been created, which is useful for debugging. We also add logging when a flow's lifecycle ends. This doesn't need a new helper, because it can only happen either from flow_alloc_cancel() or from the flow deferred handler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't precompute port remappings in epoll references	David Gibson	2024-02-29	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The epoll references for both TCP listening sockets and UDP sockets includes a port number. This gives the destination port that traffic to that socket will be sent to on the other side. That will usually be the same as the socket's bound port, but might not if the -t, -u, -T or -U options are given with different original and forwarded port numbers. As we move towards a more flexible forwarding model for passt, it's going to become possible for that destination port to vary depending on more things (for example the source or destination address). So, it will no longer make sense to have a fixed value for a listening socket. Change to simpler semantics where this field in the reference gives the bound port of the socket. We apply the translations to the correct destination port later on, when we're actually forwarding. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Introduce union sockaddr_inany	David Gibson	2024-02-29	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are a number of places where we want to handle either a sockaddr_in or a sockaddr_in6. In some of those we use a void *, which works ok and matches some standard library interfaces, but doesn't give a signature level hint that we're dealing with only sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of socket address. Other places we use a sockaddr_storage, which also works, but has the same problem in addition to allocating more on the stack than we need to. Introduce union sockaddr_inany to explictly handle this case: it has variants for sockaddr_in and sockaddr_in6. Use it in a number of places where it's easy to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Provide more conveniently typed constants for special addresses	David Gibson	2024-02-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our inany_addr type is used in some places to represent either IPv4 or IPv6 addresses, and we plan to use it more widely. We don't yet provide constants of this type for special addresses (loopback and "any"). Add some of these, both the IPv4 and IPv6 variants of those addresses, but typed as union inany_addr. To avoid actually adding more things to .data we can use some macros and casting to overlay the IPv6 versions of these with the standard library's in6addr_loopback and in6addr_any. For the IPv4 versions we need to create new constant globals. For complicated historical reasons, the standard library doesn't provide constants for IPv4 loopback and any addresses as struct in_addr. It just has macros of type in_addr_t == uint32_t, which has some gotchas w.r.t. endianness. We can use some more macros to address this lack, using macros to effectively create these IPv4 constants as pieces of the inany constants above. We use this last to avoid some awkward temporary variables just used to get an address of an IPv4 loopback address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>