passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: Obtain guest address from flowside	David Gibson	2024-07-19	3	-33/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will probably result in an RST from the guest. Worse, if the guest added a new address but also retains the old one, then we could break those old connections by redirecting them to the new address. Now that we maintain flowside information, we have a record of the correct guest side address and can just use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Remove redundant information, repack connection structures	David Gibson	2024-07-19	3	-51/+47
\| \| \| \| \| \| \| \| \| \| \|	Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for target side	David Gibson	2024-07-19	8	-78/+153
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for initiating side	David Gibson	2024-07-19	6	-11/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	doc: Extend zero-recv test with methods using msghdr	David Gibson	2024-07-17	1	-8/+52
\| \| \| \| \| \| \| \| \| \| \|	This test program verifies that we can receive and discard datagrams by using recv() with a NULL buffer and zero-length. Extend it to verify it also works using recvmsg() and either an iov with a zero-length NULL buffer or an iov that itself is NULL and zero-length. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Fixed printf() message in main of recv-zero.c] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	doc: Test behaviour of closing duplicate UDP sockets	David Gibson	2024-07-17	3	-2/+108
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	To simplify lifetime management of "listening" UDP sockets, UDP flow support needs to duplicate existing bound sockets. Those duplicates will be close()d when their corresponding flow expires, but we expect the original to still receive datagrams as always. That is, we expect the close() on the duplicate to remove the duplicated fd, but not to close the underlying UDP socket. Add a test program to doc/platform-requirements to verify this requirement. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Use parameterised macros for per-side event/flag bits	David Gibson	2024-07-17	2	-41/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both the events and flags fields in tcp_splice_conn have several bits which are per-side, e.g. OUT_WAIT_0 for side 0 and OUT_WAIT_1 for side 1. This necessitates some rather awkward ternary expressions when we need to get the relevant bit for a particular side. Simplify this by using a parameterised macro for the bit values. This needs a ternary expression inside the macros, but makes the places we use it substantially clearer. That simplification in turn allows us to use a loop across each side to implement several things which are currently open coded to do equivalent things for each side in turn. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Introduce flow_foreach_sidei() macro	David Gibson	2024-07-17	2	-3/+9
\| \| \| \| \| \| \| \| \|	We have a handful of places where we use a loop to step through each side of a flow or flows, and we're probably going to have mroe in future. Introduce a macro to implement this loop for convenience. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp_splice: Prefer 'sidei' for variables referring to side index	David Gibson	2024-07-17	3	-62/+63
\| \| \| \| \| \| \| \| \| \| \| \| \|	In various places we have variables named 'side' or similar which always have the value 0 or 1 (INISIDE or TGTSIDE). Given a flow, this refers to a specific side of it. Upcoming flow table work will make it more useful for "side" to refer to a specific side of a specific flow. To make things less confusing then, prefer the name term "side index" and name 'sidei' for variables with just the 0 or 1 value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Fixed minor detail in comment to struct flow_common] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, icmp, tcp: Clean up helpers for getting flow from index	David Gibson	2024-07-17	4	-17/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP (both regular and spliced) and ICMP both have macros to retrieve the relevant protcol specific flow structure from a flow index. In most cases what we actually want is to get the specific flow from a sidx. Replace those simple macros with a more precise inline, which also asserts that the flow is of the type we expect. While we're they're also add a pif_at_sidx() helper to get the interface of a specific flow & side, which is useful in some places. Finally, fix some minor style issues in the comments on some of the existing sidx related helpers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle errors on UDP sockets	David Gibson	2024-07-17	3	-0/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we ignore all events other than EPOLLIN on UDP sockets. This means that if we ever receive an EPOLLERR event, we'll enter an infinite loop on epoll, because we'll never do anything to clear the error. Luckily that doesn't seem to have happened in practice, but it's certainly fragile. Furthermore changes in how we handle UDP sockets with the flow table mean we will start receiving error events. Add handling of EPOLLERR events. For now we just read the error from the error queue (thereby clearing the error state) and print a debug message. We can add more substantial handling of specific events in future if we want to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add AF_UNSPEC support to sockaddr_ntop()	David Gibson	2024-07-17	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \|	Allow sockaddr_ntop() to format AF_UNSPEC socket addresses. There do exist a few cases where we might legitimately have either an AF_UNSPEC or a real address, such as the origin address from MSG_ERRQUEUE. Even in cases where we shouldn't get an AF_UNSPEC address, formatting it is likely to make things easier to debug if we ever somehow do. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, tcp: Tweak handling of no_udp and no_tcp flags	David Gibson	2024-07-17	2	-5/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We abort the UDP socket handler if the no_udp flag is set. But if UDP was disabled we should never have had a UDP socket to trigger the handler in the first place. If we somehow did, ignoring it here isn't really going to help because aborting without doing anything is likely to lead to an epoll loop. The same is the case for the TCP socket and timer handlers and the no_tcp flag. Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to several other entry points to the protocol specific code which should never be called if the protocol is disabled. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Make udp_sock_recv static	David Gibson	2024-07-17	1	-2/+2
\| \| \| \| \| \| \| \|	Through an oversight this was previously declared as a public function although it's only used in udp.c and there is no prototype in any header. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Don't configure port forwarding for a disabled protocol	David Gibson	2024-07-17	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	UDP and/or TCP can be disabled with the --no-udp and --no-tcp options. However, when this is specified, it's still possible to configure forwarded ports for the disabled protocol. In some cases this will open sockets and perform other actions, which might not be safe since the entire protocol won't be initialised. Check for this case, and explicitly forbid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: handle shrunk window advertisements from guest	Jon Maloy	2024-07-15	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A bug in kernel TCP may lead to a deadlock where a zero window is sent from the guest peer, while it is unable to send out window updates even after socket reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisements from the peer can only be triggered by data packets arriving from this side. However, currently such packets are never sent, because the zero-window condition prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered only after the peer has dropped one or more arriving packets because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that the implementation goes against the RFC-9293 recommendation that a previously advertised window never should shrink. RFC-9293 seems to permit that we can continue sending up to the right edge of the last advertised non-zero window in such situations, so that is what we do to resolve this situation. It turns out that this solution is extremely simple to implememt in the code: We just omit to save the advertised zero-window when we see that it has shrunk, i.e., if the acknowledged sequence number in the advertisement message is lower than that of the last data byte sent from our side. When that is the case, the following happens: - The 'retr' flag in tcp_data_from_tap() will be 'false', so no retransmission will occur at this occasion. - The data stream will soon reach the right edge of the previously advertised window. In fact, in all observed cases we have seen that it is already there when the zero-advertisement arrives. - At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set, unless they already have been, meaning that only the next timer expiration will open for data retransmission or transmission. - When that happens, the memory squeeze at the guest will normally have abated, and the data flow can resume. It should be noted that although this solves the problem we have at hand, it is a work-around, and not a genuine solution to the described kernel bug. Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor fix in commit title and commit reference in comment to workaround Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: leverage support of SO_PEEK_OFF socket option when available	Jon Maloy	2024-07-15	3	-9/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	>From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction. In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not. Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	doc: Trivial fix for reuseaddr-priority	David Gibson	2024-07-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	This test program checks for particular behaviour regardless of order of operations. So, we step through the test with all possible orders for a number of different of parts. Or at least, we're supposed to, a copy pasta error led to using the same order for two things which should be independent. Fixes: 299c40750137 ("doc: Add program to document and test assumptions about SO_REUSEADDR") Reported-by: David Taylor <davidt@yadt.co.uk> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	doc: Test behaviour of zero length datagram recv()s	David Gibson	2024-07-05	3	-3/+78
\| \| \| \| \| \| \| \|	Add a test program verifying that we're able to discard datagrams from a socket without needing a big discard buffer, by using a zero length recv(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	doc: Add program to document and test assumptions about SO_REUSEADDR	David Gibson	2024-07-05	6	-0/+417
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	For the approach we intend to use for handling UDP flows, we have some pretty specific requirements about how SO_REUSEADDR works with UDP sockets. Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s, and therefore there can be multiple sockets which are eligible to receive the same datagram. Which one will actually receive it is important to us. Add a test program which verifies things work the way we expect, which documents what those expectations are in the process. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Consolidate datagram batching	David Gibson	2024-07-05	1	-90/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we receive datagrams on a socket, we need to split them into batches depending on how they need to be forwarded (either via a specific splice socket, or via tap). The logic to do this, is somewhat awkwardly split between udp_buf_sock_handler() itself, udp_splice_send() and udp_tap_send(). Move all the batching logic into udp_buf_sock_handler(), leaving udp_splice_send() to just send the prepared batch. udp_tap_send() reduces to just a call to tap_send_frames() so open-code that call in udp_buf_sock_handler(). This will allow separating the batching logic from the rest of the datagram forwarding logic, which we'll need for upcoming flow table support. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Move some more of sock_handler tasks into sub-functions	David Gibson	2024-07-05	1	-46/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four things between them: 1. Receive some datagrams from a socket 2. Split those datagrams into batches depending on how they need to be sent (via tap or via a specific splice socket) 3. Prepare buffers for each datagram to send it onwards 4. Actually send it onwards Split (1) and (3) into specific helper functions. This isn't immediately useful (udp_splice_prepare(), in particular, is trivial), but it will make further reworks clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't repeatedly initialise udp[46]_eth_hdr	David Gibson	2024-07-05	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Since we split our packet frame buffers into different pieces, we have a single buffer per IP version for the ethernet header, rather than one per frame. This makes sense since our ethernet header is alwaus the same. However we initialise those buffers udp[46]_eth_hdr inside a per frame loop. Pull that outside the loop so we just initialise them once. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify udp[46]_l2_iov	David Gibson	2024-07-05	1	-23/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The only differences between these arrays are that udp4_l2_iov is pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame header and udp6_l2_iov points to the IPv6 versions. We already have to set up a bunch of headers per-frame, including updating udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust the IOV entries to point at the correct headers for the frame than to have two complete sets of iovecs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify udp[46]_mh_splice	David Gibson	2024-07-05	1	-27/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where the only difference is that they point to different sockaddr buffers for the destination address. Unify these by having the common array point at a sockaddr_inany as the address. This does mean slightly more work when we're about to splice, because we need to write the whole socket address, rather than just the port. However it removes 32 mmsghdr structures and we're going to need more flexibility constructing that target address for the flow table. Because future changes might mean that the address isn't always loopback, change the name of the common address from *_localname to udp_splicename. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Rename IOV and mmsghdr arrays	David Gibson	2024-07-05	1	-34/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make the salient points about these various arrays clearer with renames: * udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do with L2. They are, however, specific to receiving not sending. Rename to udp_iov_recv and udp[46]_mh_recv. * udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa. Rename to udp[46]_l2_iov * udp[46]_localname are (for now) pre-populated with the local address but the more salient point is that these are the destination address for the splice arrays. Rename to udp[46]_splice_to Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Pass full epoll reference through more of sock handler path	David Gibson	2024-07-05	1	-30/+33
\| \| \| \| \| \| \| \| \| \| \| \|	udp_buf_sock_handler() takes the epoll reference from the receiving socket, and passes the UDP relevant part on to several other functions. Future changes are going to need several different epoll types for UDP, and to pass that information through to some of those functions. To avoid extra noise in the patches making the real changes, change those functions now to take the full epoll reference, rather than just the UDP part. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Add flow_sidx_valid() helper	David Gibson	2024-07-05	3	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To implement the TCP hash table, we need an invalid (NULL-like) value for flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we treat (usually) anything with an out of bounds flow index the same way. That's not always done consistently though. In flow_at_sidx() we open code a check on the flow index. In tcp_hash_probe() we instead compare against FLOW_SIDX_NONE, and in some other places we use the fact that flow_at_sidx() will return NULL in this case, even if we don't otherwise need the flow it returns. Clean this up a bit, by adding an explicit flow_sidx_valid() test function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: sock_l4() determine protocol from epoll type rather than the reverse	David Gibson	2024-07-05	7	-67/+81
\| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Use the right maximum buffer size for c->sock_path	Stefano Brivio	2024-07-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	UNIX_SOCK_MAX is the maximum number we'll append to the socket path if we generate it automatically. If it's given on the command line, it can be up to UNIX_PATH_MAX (including the terminating character) long. UNIX_SOCK_MAX happened to kind of fit because it's 100 (instead of 108). Commit ceddcac74a6e ("conf, tap: False "Buffer not null terminated" positives, CWE-170") fixed the wrong problem: the right fix for the problem at hand was actually commit cc287af173ca ("conf: Fix incorrect bounds checking for sock_path parameter"). Fixes: ceddcac74a6e ("conf, tap: False "Buffer not null terminated" positives, CWE-170") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp_splice: Check return value of setsockopt() for SO_RCVLOWAT	Stefano Brivio	2024-07-02	1	-5/+10
\| \| \| \| \| \| \| \| \|	Spotted by Coverity, harmless as we would consider that successful and check on the socket later from the timer, but printing a debug message in that case is definitely wise, should it ever happen. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Copy up to MAXDNSRCH - 1 bytes, not MAXDNSRCH	Stefano Brivio	2024-07-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Spotted by Coverity just recently. Not that it really matters as MAXDNSRCH always appears to be defined as 1025, while a full domain name can have up to 253 characters: it would be a bit pointless to have a longer search domain. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Reduce scope of rport in udp_invert_portmap()2024_06_24.1ee2eca	Stefano Brivio	2024-06-24	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	cppcheck 2.14 warns that the scope of the rport variable could be reduced: do that, as reverted commit c80fa6a6bb44 ("udp: Make rport calculation more local") did, but keep the temporary variable of in_port_t type, otherwise the sum gets promoted to int. While at it, add a comment explaining why we calculate rport like this instead of directly using the sum as array index. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Revert "udp: Make rport calculation more local"	Stefano Brivio	2024-06-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	This reverts commit c80fa6a6bb4415ad48f9e11424310875d0d99bc7, as it reintroduces the issue fixed by commit 1e6f92b995a9 ("udp: Fix 16-bit overflow in udp_invert_portmap()"). Reported-by: Laurent Jacquot <jk@lutty.net> Link: https://bugs.passt.top/show_bug.cgi?id=80 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Don't report syslog failures to stderr after initialisation	Stefano Brivio	2024-06-21	1	-1/+1
\| \| \| \| \| \| \| \| \|	If we daemonised, we can't use standard error. If we didn't, it's rather annoying to have all those messages on standard error anyway, and kind of pointless too, as the messages we wanted to print were printed to standard error anyway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, passt: Don't call __openlog() if a log file is used	Stefano Brivio	2024-06-21	2	-8/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a log file is configured, we would otherwise open a connection to the system logger (if any), print any message that we might have before we initialise the log file, and then keep that connection around for no particular reason. Call __openlog() as an alternative to the log file setup, instead. This way, we might skip printing some messages during the initialisation phase, but they're probably not really valuable to have in a system log, and we're going to print them to standard error anyway. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Replace strerror() calls	Stefano Brivio	2024-06-21	11	-77/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Replace perror() calls with calls to logging functions	Stefano Brivio	2024-06-21	6	-58/+38
\| \| \| \| \| \| \| \| \| \| \| \|	perror() prints directly to standard error, but in many cases standard error might be already closed, or we might want to skip logging, based on configuration. Our logging functions provide all that. While at it, make errors more descriptive, replacing some of the existing basic perror-style messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log: Add _perror() logging function variants	Stefano Brivio	2024-06-21	2	-4/+39
\| \| \| \| \| \| \| \| \| \| \| \|	In many places, we have direct perror() calls, which completely bypass logging functions and log files. They are definitely convenient: offer similar convenience with _perror() logging variants, so that we can drop those direct perror() calls. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	log, passt: Always print to stderr before initialisation is complete	Stefano Brivio	2024-06-21	3	-15/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After commit 15001b39ef1d ("conf: set the log level much earlier"), we had a phase during initialisation when messages wouldn't be printed to standard error anymore. Commit f67238aa864d ("passt, log: Call __openlog() earlier, log to stderr until we detach") fixed that, but only for the case where no log files are given. If a log file is configured, vlogmsg() will not call passt_vsyslog(), but during initialisation, LOG_PERROR is set, so to avoid duplicated prints (which would result from passt_vsyslog() printing to stderr), we don't call fprintf() from vlogmsg() either. This is getting a bit too complicated. Instead of abusing LOG_PERROR, define an internal logging flag that clearly represents that we're not done with the initialisation phase yet. If this flag is not set, make sure we always print to stderr, if the log mask matches. Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, log: Instead of abusing log levels, add log_conf_parsed flag	Stefano Brivio	2024-06-21	3	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We currently use a LOG_EMERG log mask to represent the fact that we don't know yet what the mask resulting from configuration should be, before the command line is parsed. However, we have the necessity of representing another phase as well, that is, configuration is parsed but we didn't daemonise yet, or we're not ready for operation yet. The next patch will add that notion explicitly. Mapping these cases to further log levels isn't really practical. Introduce boolean log flags to represent them, instead of abusing log priorities. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, passt: Make --stderr do nothing, and deprecate it	Stefano Brivio	2024-06-21	4	-29/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original behaviour of printing messages to standard error by default when running from a non-interactive terminal was introduced because the first KubeVirt integration draft used to start passt in foreground and get messages via standard error. For development purposes, the system logger was more convenient at that point, and passt was running from interactive terminals only if not started by the KubeVirt integration. This behaviour was introduced by 84a62b79a2bc ("passt: Also log to stderr, don't fork to background if not interactive"). Later, I added command-line options in 1e49d194d017 ("passt, pasta: Introduce command-line options and port re-mapping") and accidentally reversed this condition, which wasn't a problem as --stderr could force printing to standard error anyway (and it was used by KubeVirt). Nowadays, the KubeVirt integration uses a log file (requested via libvirt configuration), and the same applies for Podman if one actually needs to look at runtime logs. There are no use cases left, as far as I know, where passt runs in foreground in non-interactive terminals. Seize the chance to reintroduce some sanity here. If we fork to background, standard error is closed, so --stderr is useless in that case. If we run in foreground, there's no harm in printing messages to standard error, and that accidentally became the default behaviour anyway, so --stderr is not needed in that case. It would be needed for non-interactive terminals, but there are no use cases, and if there were, let's log to standard error anyway: the user can always redirect standard error to /dev/null if needed. Before we're up and running, we need to print to standard error anyway if something happens, otherwise we can't report failure to start in any kind of usage, stand-alone or in integrations. So, make --stderr do nothing, and deprecate it. While at it, drop a left-over comment about --foreground being the default only for interactive terminals, because it's not the case anymore. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, passt: Don't try to log to stderr after we close it	Stefano Brivio	2024-06-21	2	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \|	If we don't run in foreground, we close standard error as we daemonise, so it makes no sense to check if the controlling terminal is an interactive terminal or if --force-stderr was given, to decide if we want to log to standard error. Make --force-stderr depend on --foreground. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Accept duplicate and conflicting options, the last one wins	Stefano Brivio	2024-06-21	2	-104/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In multiple occasions, especially when passt(1) and pasta(1) are used in integrations such as the one with Podman, the ability to override earlier options on the command line with later one would have been convenient. Recently, to debug a number of issues happening with Podman, I would have liked to ask users to share a debug log by passing --debug as additional option, but pasta refuses --quiet (always passed by Podman) and --debug at the same time. On top of this, Podman lets users specify other pasta options in its containers.conf(5) file, as well as on the command line. The options from the configuration files are appended together with the ones from the command line, which makes it impossible for users to override options from the configuration file, if duplicated options are refused, unless Podman takes care of sorting them, which is clearly not sustainable. For --debug and --trace, somebody took care of this on Podman side at: https://github.com/containers/common/pull/2052 but this doesn't fix the issue with other options, and we'll have anyway older versions of Podman around, too. I think there's some value in telling users about duplicated or conflicting options, because that might reveal issues in integrations or accidental misconfigurations, but by now I'm fairly convinced that the downsides outweigh this. Drop checks about duplicate options and mutually exclusive ones. In some cases, we need to also undo a couple of initialisations caused by earlier options, but this looks like a simplification, overall. Notable exception: --stderr still conflicts with --log-file, because users might have the expectation that they don't actually conflict. But they do conflict in the existing implementation, so it's safer to make sure that the users notice that. Suggested-by: Paul Holzinger <pholzing@redhat.com> Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Paul Holzinger <pholzing@redhat.com>
*	netlink: Strip nexthop identifiers when duplicating routes	Stefano Brivio	2024-06-20	1	-7/+16
\| \| \| \| \| \| \| \| \| \| \|	If routing daemons set up host routes, for example FRR via OSPF as in the reported issue, they might add nexthop identifiers (not objects) that are generally not valid in the target namespace. Strip them off as well, otherwise we'll get EINVAL from the kernel. Link: https://github.com/containers/podman/issues/22960 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt.1, qrap.1: align license description with SPDX identifier	Danish Prakash	2024-06-19	2	-4/+4
\| \| \| \| \| \| \| \| \| \|	The SPDX identifier states GPL-2.0-or-later but the copyright section mentions GPL-3.0 or later causing a mismatch. Also, only correctly refers to GPL instead of AGPL. Signed-off-by: Danish Prakash <contact@danishpraka.sh> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Ignore EHOSTUNREACH failures when duplicating routes	Stefano Brivio	2024-06-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To implicitly resolve possible dependencies between routes as we duplicate them into the target namespace, we go through a set of n routes n times, and ignore EEXIST responses to netlink messages (we already inserted the route) and ENETUNREACH (we didn't insert the route yet, but we need to insert another one first). Until now, we didn't ignore EHOSTUNREACH responses. However, NetworkManager users with multiple non-subnet routes for the same interface report that pasta exits with "no route to host" while duplicating routes. This happens because NetworkManager sets the 'noprefixroute' attribute on addresses, meaning that the kernel won't create subnet routes automatically depending on the prefix length of the address. We copy this attribute as we copy the address into the target namespace, and as a result, the kernel doesn't create subnet routes in the target namespace either. This means that the gateway for routes that are inserted later can be unreachable at some points during the sequence of route duplication. That is, we don't just have dependencies between regular routes, but we can also have dependencies between regular routes and subnet routes, as subnet routes are not automatically inserted in advance. Link: https://github.com/containers/podman/issues/22824 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: With no default route, pick the first interface with a route	Stefano Brivio	2024-06-19	2	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While commit f919dc7a4b1c ("conf, netlink: Don't require a default route to start") sounded reasonable in the assumption that, if we don't find default routes for a given address family, we can still proceed by selecting an interface with any route iff it's the only one for that protocol family, Jelle reported a further issue in a similar setup. There, multiple interfaces are present, and while remote container connectivity doesn't matter for the container, local connectivity is desired. There are no default routes, but those multiple interfaces all have non-default routes, so we should just pick one and start. Pick the first interface reported by the kernel with any route, if there are no default routes. There should be no harm in doing so. Reported-by: Jelle van der Waa <jvanderwaa@redhat.com> Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2277954 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>
*	tcp: Don't rely on bind() to fail to decide that connection target is valid	Stefano Brivio	2024-06-19	1	-17/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit e1a2e2780c91 ("tcp: Check if connection is local or low RTT was seen before using large MSS") added a call to bind() before we issue a connect() to the target for an outbound connection. If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we can conclude that the target address is a local (host) address, and we can use an unlimited MSS. While at it, according to the reasoning of that commit, if bind() succeeds, we would know right away that nobody is listening at that (local) address and port, and we don't even need to call connect(): we can just fail early and reset the connection attempt. But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will actually succeed, so we can't rely on it to fail in general. The visible issue with the existing behaviour is that we would reset any outbound connection to non-local addresses, if non-local binds are enabled. Keep the significant optimisation for local addresses along with the bind() call, but if it succeeds, don't draw any conclusion: close the socket, grab another one, and proceed normally. This will incur a small latency penalty if non-local binds are enabled (we'll likely fetch an existing socket from the pool but additionally call close()), or if the target is local but not bound: we'll need to call connect() and get a failure before relaying that failure back. Link: https://github.com/containers/podman/issues/23003 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	siphash: Remove stale prototypes	David Gibson	2024-06-19	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \|	In fc8f0f8c ("siphash: Use incremental rather than all-at-once siphash functions") we removed the older interface to the SipHash implementation, which took fixed sized blocks of data. However, we forgot to remove the prototypes for those functions, so do that now. Fixes: fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>