passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	treewide: Use "our address" instead of "forwarding address"	David Gibson	2024-08-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead. (While we're there correct an error in flow_initiate_af()s comments where we referred to parameters by the wrong name). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, icmp: Use general flow forwarding rules for ICMP	David Gibson	2024-07-19	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current ICMP hard codes its forwarding rules, and never applies any translations. Change it to use the flow_target() function, so that it's translated the same as TCP (excluding TCP specific port redirection). This means that gw mapping now applies to ICMP so "ping <gw address>" will now ping the host's loopback instead of the actual gw machine. This removes the surprising behaviour that the target you ping might not be the same as you connect to with TCP. This removes the last user of flow_target_af(), so that's removed as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Manage outbound socket address via flow table	David Gibson	2024-07-19	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now when we forward a ping to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and id will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or id we use in future. To implement this we create a new helper which sets up a new socket based on information in a flowside, which will also have future uses. It behaves slightly differently from the existing ICMP code, in that it doesn't bind to a specific interface if given a loopback address. This is logically correct - the loopback address means we need to operate through the host's loopback interface, not ifname_out. We didn't need it in ICMP because ICMP will never generate a loopback address at this point, however we intend to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Eliminate icmp_id_map	David Gibson	2024-07-19	1	-17/+2
\| \| \| \| \| \| \| \|	With previous reworks the icmp_id_map data structure is now maintained, but never used for anything. Eliminate it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Look up ping flows using flow hash	David Gibson	2024-07-19	1	-3/+15
\| \| \| \| \| \| \| \| \| \|	When we receive a ping packet from the tap interface, we currently locate the correct flow entry (if present) using an anciliary data structure, the icmp_id_map[] tables. However, we can look this up using the flow hash table - that's what it's for. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Obtain destination addresses from the flowsides	David Gibson	2024-07-19	1	-10/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	icmp_sock_handler() obtains the guest address from it's most recently observed IP. However, this can now be obtained from the common flowside information. icmp_tap_handler() builds its socket address for sendto() directly from the destination address supplied by the incoming tap packet. This can instead be generated from the flow. Using the flowsides as the common source of truth here prepares us for allowing more flexible NAT and forwarding by properly initialising that flowside information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Remove redundant id field from flow table entry	David Gibson	2024-07-19	1	-5/+5
\| \| \| \| \| \| \| \| \|	struct icmp_ping_flow contains a field for the ICMP id of the ping, but this is now redundant, since the id is also stored as the "port" in the common flowsides. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for target side	David Gibson	2024-07-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Common address information for initiating side	David Gibson	2024-07-19	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, icmp, tcp: Clean up helpers for getting flow from index	David Gibson	2024-07-17	1	-3/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP (both regular and spliced) and ICMP both have macros to retrieve the relevant protcol specific flow structure from a flow index. In most cases what we actually want is to get the specific flow from a sidx. Replace those simple macros with a more precise inline, which also asserts that the flow is of the type we expect. While we're they're also add a pif_at_sidx() helper to get the interface of a specific flow & side, which is useful in some places. Finally, fix some minor style issues in the comments on some of the existing sidx related helpers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: sock_l4() determine protocol from epoll type rather than the reverse	David Gibson	2024-07-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Record the pifs for each side of each flow	David Gibson	2024-05-22	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have no generic information flows apart from the type and state, everything else is specific to the flow type. Start introducing generic flow information by recording the pifs which the flow connects. To keep track of what information is valid, introduce new flow states: INI for when the initiating side information is complete, and TGT for when both sides information is complete, but we haven't chosen the flow type yet. For now, these states don't do an awful lot, but they'll become more important as we add more generic information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make side 0 always be the initiating side	David Gibson	2024-05-22	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each flow in the flow table has two sides, 0 and 1, representing the two interfaces between which passt/pasta will forward data for that flow. Which side is which is currently up to the protocol specific code: TCP uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except for spliced connections where it uses 0 for the initiating side and 1 for the target side. ICMP also uses 0 for the host/"sock" side and 1 for the guest/"tap" side, but in its case the latter is always also the initiating side. Make this generically consistent by always using side 0 for the initiating side and 1 for the target side. This doesn't simplify a lot for now, and arguably makes TCP slightly more complex, since we add an extra field to the connection structure to record which is the guest facing side. This is an interim change, which we'll be able to remove later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify and enforce flow state transitions	David Gibson	2024-05-22	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flows move over several different states in their lifetime. The rules for these are documented in comments, but they're pretty complex and a number of the transitions are implicit, which makes this pretty fragile and error prone. Change the code to explicitly track the states in a field. Make all transitions explicit and logged. To the extent that it's practical in C, enforce what can and can't be done in various states with ASSERT()s. While we're at it, tweak the docs to clarify the restrictions on each state a bit. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Properly type callbacks to protocol specific handlers	David Gibson	2024-05-22	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The flow dispatches deferred and timer handling for flows centrally, but needs to call into protocol specific code for the handling of individual flows. Currently this passes a general union flow *. It makes more sense to pass the specific relevant flow type structure. That brings the check on the flow type adjacent to casting to the union variant which it tags. Arguably, this is a slight abstraction violation since it involves the generic flow code using protocol specific types. It's already calling into protocol specific functions, so I don't think this really makes any difference. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise variable names for various packet lengths	David Gibson	2024-05-02	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At various points we need to track the lengths of a packet including or excluding various different sets of headers. We don't always use the same variable names for doing so. Worse in some places we use the same name for different things: e.g. tcp_fill_headers[46]() use ip_len for the length including the IP headers, but then tcp_send_flag() which calls it uses it to mean the IP payload length only. To improve clarity, standardise on these names: dlen: L4 protocol payload length ("data length") l4len: plen + length of L4 protocol header l3len: l4len + length of IPv4/IPv6 header l2len: l3len + length of L2 (ethernet) header Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Use 'flowside' epoll references for ping sockets	David Gibson	2024-03-12	1	-17/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently ping sockets use a custom epoll reference type which includes the ICMP id. However, now that we have entries in the flow table for ping flows, finding that is sufficient to get everything else we want, including the id. Therefore remove the icmp_epoll_ref type and use the general 'flowside' field for ping sockets. Having done this we no longer need separate EPOLL_TYPE_ICMP and EPOLL_TYPE_ICMPV6 reference types, because we can easily determine which case we have from the flow type. Merge both types into EPOLL_TYPE_PING. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Flow based error reporting	David Gibson	2024-03-12	1	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \|	Use flow_dbg() and flow_err() helpers to generate flow-linked error messages in most places. Make a few small improvements to the messages while we're at it. This allows us to avoid the awkward 'pname' variables since whether we're dealing with ICMP or ICMPv6 is already built into the flow type which these helpers include. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Coding style fix in icmp_tap_handler()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Store ping socket information in flow table	David Gibson	2024-03-12	1	-79/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently icmp_id_map[][] stores information about ping sockets in a bespoke structure. Move the same information into new types of flow in the flow table. To match that change, replace the existing ICMP timer with a flow-based timer for expiring ping sockets. This has the advantage that we only need to scan the active flows, not all possible ids. We convert icmp_id_map[][] to point to the flow table entries, rather than containing its own information. We do still use that array for locating the right ping flows, rather than using a "flow native" form of lookup for the time being. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Update id_sock description in comment to icmp_ping_new()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: move IP stuff from util.[ch] to ip.[ch]	Laurent Vivier	2024-03-06	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce ip.[ch] file to encapsulate IP protocol handling functions and structures. Modify various files to include the new header ip.h when it's needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-5-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Introduce union sockaddr_inany	David Gibson	2024-02-29	1	-12/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are a number of places where we want to handle either a sockaddr_in or a sockaddr_in6. In some of those we use a void *, which works ok and matches some standard library interfaces, but doesn't give a signature level hint that we're dealing with only sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of socket address. Other places we use a sockaddr_storage, which also works, but has the same problem in addition to allocating more on the stack than we need to. Introduce union sockaddr_inany to explictly handle this case: it has variants for sockaddr_in and sockaddr_in6. Use it in a number of places where it's easy to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use sa_family_t for address family variables	David Gibson	2024-02-27	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	Sometimes we use sa_family_t for variables and parameters containing a socket address family, other times we use a plain int. Since sa_family_t is what's actually used in struct sockaddr and friends, standardise on that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Dedicated functions for starting and closing ping sequences	David Gibson	2024-01-22	1	-35/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	ICMP sockets are cleaned up on a timeout implemented in icmp_timer_one(), and the logic to do that cleanup is open coded in that function. Similarly new sockets are opened when we discover we don't have an existing one in icmp_tap_handler(), and again the logic is open-coded. That's not the worst thing, but it's a bit cleaner to have dedicated functions for the creation and destruction of ping sockets. This will also make things a bit easier for future changes we have in mind. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Validate packets received on ping sockets	David Gibson	2024-01-22	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	We access fields of packets received from ping sockets assuming they're echo replies, without actually checking that. Of course, we don't expect anything else from the kernel, but it's probably best to verify. While we're at it, also check for short packets, or a receive address of the wrong family. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Warn on receive errors from ping sockets	David Gibson	2024-01-22	1	-1/+4
\| \| \| \| \| \| \| \| \|	Currently we silently ignore an errors receiving a packet from a ping socket. We don't expect that to happen, so it's probably worth reporting if it does. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Consolidate icmp_sock_handler() with icmpv6_sock_handler()	David Gibson	2024-01-22	1	-55/+34
\| \| \| \| \| \| \| \| \| \| \|	Currently we have separate handlers for ICMP and ICMPv6 ping replies. Although there are a number of points of difference, with some creative refactoring we can combine these together sensibly. Although it doesn't save a vast amount of code, it does make it clearer that we're performing basically the same steps for each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Share more between IPv4 and IPv6 paths in icmp_tap_handler()	David Gibson	2024-01-22	1	-68/+68
\| \| \| \| \| \| \| \| \| \|	Currently icmp_tap_handler() consists of two almost disjoint paths for the IPv4 and IPv6 cases. The only thing they share is an error message. We can use some intermediate variables to refactor this to share some more code between those paths. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Simplify socket expiry scanning	David Gibson	2024-01-22	1	-32/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we use icmp_act[] to scan for ICMP ids which might have an open socket which could time out. However icmp_act[] contains no information that's not already in icmp_id_map[] - it's just an "index" which allows scanning for relevant entries with less cache footprint. We only scan for ICMP socket expiry every 1s, though, so it's not clear that cache footprint really matters. Furthermore, there's no strong reason we need to scan even that often - the timeout is fairly arbitrary and approximate. So, eliminate icmp_act[] in favour of directly scanning icmp_id_map[] and compensate for the cache impact by reducing the scan frequency to once every 10s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Use -1 to represent "missing" sockets	David Gibson	2024-01-22	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \|	icmp_id_map[] contains, amongst other things, fds for "ping" sockets associated with various ICMP echo ids. However, we only lazily open() those sockets, so many will be missing. We currently represent that with a 0, which isn't great, since that's technically a valid fd. Use -1 instead. This does require initializing the fields in icmp_id_map[] but we already have an obvious place to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to match host IDs to guest IDs	David Gibson	2024-01-22	1	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding pings from tap, currently we create a ping socket with a socket address whose port is set to the ID of the ping received from the guest. This causes the socket to send pings with the same ID on the host. Although this seems look a good idea for maximum transparency, it's probably unwise. First, it's fallible - the bind() could fail, and we already have fallback logic which will overwrite the packets with the expected guest id if the id we get on replies doesn't already match. We might as well do that unconditionally. But more importantly, we don't know what else on the host might be using ping sockets, so we could end up with an ID that's the same as an existing socket. You'd expect that to fail the bind() with EADDRINUSE, which would be fine: we'd fall back to rewriting the reply ids. However it appears the kernel (v6.6.3 at least), does not fail the bind() and instead it's "last socket wins" in terms of who gets the replies. So we could accidentally intercept ping replies for something else on the host. So, instead of using bind() to set the id, just let the kernel pick one and expect to translate the replies back. Although theoretically this makes the passt/pasta link a bit less "transparent", essentially nothing cares about specific ping IDs, much like TCP source ports, which we also don't preserve. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to handle "wrong direction" ping socket traffic	David Gibson	2024-01-22	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linux ICMP "ping" sockets are very specific in what they do. They let userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY). They don't let you intercept or handle incoming ping requests. In the case of passt/pasta that means we can process echo requests from tap and forward them to a ping socket, then take the replies from the ping socket and forward them to tap. We can't do the reverse: take echo requests from the host and somehow forward them to the guest. There's really no way for something outside to initiate a ping to a passt/pasta connected guest and if there was we'd need an entirely different mechanism to handle it. However, we have some logic to deal with packets going in that reverse direction. Remove it, since it can't ever be used that way. While we're there use defines for the ICMPv6 types, instead of open coded type values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Remove redundant initialisation of sendto() address	David Gibson	2024-01-22	1	-2/+0
\| \| \| \| \| \| \| \| \|	We initialise the address portion of the sockaddr for sendto() to the unspecified address, but then always overwrite it with the actual destination address before we call the sendto(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't set "port" on destination sockaddr for ping sockets	David Gibson	2024-01-22	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We set the port to the ICMP id on the sendto() address when using ICMP ping sockets. However, this has no effect: the ICMP id the kernel uses is determined only by the "port" on the socket's bound address (which is constructed inside sock_l4(), using the id we also pass to it). For unclear reasons this change triggers cppcheck 2.13.0 to give new "variable could be const pointer" warnings, so make *ih const as well to fix that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise on 'now' for current timestamp variables	David Gibson	2024-01-22	1	-6/+6
\| \| \| \| \| \| \| \| \|	In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make sock_l4() treat empty string ifname like NULL	David Gibson	2023-12-27	1	-11/+4
\| \| \| \| \| \| \| \| \| \| \|	sock_l4() takes NULL for ifname if you don't want to bind the socket to a particular interface. However, for a number of the callers, it's more natural to use an empty string for that case. Change sock_l4() to accept either NULL or an empty string equivalently, and simplify some callers using that change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Avoid unnecessary handling of unspecified bind address	David Gibson	2023-12-27	1	-12/+4
\| \| \| \| \| \| \| \| \| \|	We go to some trouble, if the configured output address is unspecified, to pass NULL to sock_l4(). But while passing NULL is one way to get sock_l4() not to specify a bind address, passing the "any" address explicitly works too. Use this to simplify icmp_tap_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Add IN4ADDR_ANY_INIT macro	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address, make a similar one for the unspecified / any address. This avoids messying things with the internal structure of struct in_addr where we don't care about it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Pass originating pif to tap handler functions	David Gibson	2023-11-07	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	For now, packets passed to the various *_tap_handler() functions always come from the single "tap" interface. We want to allow the possibility to broaden that in future. As preparation for that, have the code in tap.c pass the pif id of the originating interface to each of those handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Consolidate and improve workarounds for clang-tidy issue 58992	David Gibson	2023-09-27	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several workarounds for a clang-tidy bug where the checker doesn't recognize that a number of system calls write to - and therefore initialise - a socket address. We can't neatly use a suppression, because the bogus warning shows up some time after the actual system call, when we access a field of the socket address which clang-tidy erroneously thinks is uninitialised. Consolidate these workarounds into one place by using macros to implement wrappers around affected system calls which add a memset() of the sockaddr to silence clang-tidy. This removes the need for the individual memset() workarounds at the callers - and the somewhat longwinded explanatory comments. We can then use a #define to not include the hack in "real" builds, but only consider it for clang-tidy. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Pass source address to protocol handler functions	David Gibson	2023-08-22	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \|	The tap code passes the IPv4 or IPv6 destination address of packets it receives to the protocol specific code. Currently that protocol code doesn't use the source address, but we want it to in future. So, in preparation, pass the IPv4/IPv6 source address of tap packets to those functions as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split handling of ICMP and ICMPv6 sockets	David Gibson	2023-08-13	1	-48/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have different epoll type values for ICMP and ICMPv6 sockets, but they both call the same handler function, icmp_sock_handler(). However that function does essentially nothing in common for the two cases. So, split it into icmp_sock_handler() and icmpv6_sock_handler() and dispatch them separately from the top level. While we're there remove some parameters that the function was never using anyway. Also move the test for c->no_icmp into the functions, so that all the logic specific to ICMP is within the handler, rather than in the top level dispatch code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Generalize epoll_ref to cover things other than sockets	David Gibson	2023-08-13	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use C11 anonymous members to make poll refs less verbose to use	David Gibson	2023-08-04	1	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \|	union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, icmp, tcp, udp: Add options to bind to outbound address and interface	Stefano Brivio	2023-03-09	1	-4/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I didn't notice earlier: libslirp (and slirp4netns) supports binding outbound sockets to specific IPv4 and IPv6 addresses, to force the source addresse selection. If we want to claim feature parity, we should implement that as well. Further, Podman supports specifying outbound interfaces as well, but this is simply done by resolving the primary address for an interface when the network back-end is started. However, since kernel version 5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users"), we can actually bind to a specific interface name, which doesn't need to be validated in advance. Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses, and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets to given interfaces. Given that it probably makes little sense to select addresses and routes from interfaces different than the ones given for outbound sockets, also assign those as "template" interfaces, by default, unless explicitly overridden by '-i'. For ICMP and UDP, we call sock_l4() to open outbound sockets, as we already needed to bind to given ports or echo identifiers, and we can bind() a socket only once: there, pass address (if any) and interface (if any) for the existing bind() and setsockopt() calls. For TCP, in general, we wouldn't otherwise bind sockets. Add a specific helper to do that. For UDP outbound sockets, we need to know if the final destination of the socket is a loopback address, before we decide whether it makes sense to bind the socket at all: move the block mangling the address destination before the creation of the socket in the IPv4 path. This was already the case for the IPv6 path. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Use typing to reduce chances of IPv4 endianness errors	David Gibson	2022-11-04	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Correct some missing endian conversions of IPv4 addresses	David Gibson	2022-11-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The INADDR_LOOPBACK constant is in host endianness, and similarly the IN_MULTICAST macro expects a host endian address. However, there are some places in passt where we use those with network endian values. This means that passt will incorrectly allow you to set 127.0.0.1 or a multicast address as the guest address or DNS forwarding address. Add the necessary conversions to correct this. INADDR_ANY and INADDR_BROADCAST logically behave the same way, although because they're palindromes it doesn't have an effect in practice. Change them to be logically correct while we're there, though. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't discard first reply sequence for a given echo ID2022_10_26.f212044	Stefano Brivio	2022-10-27	1	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, ICMP and ICMPv6 echo sockets relay back to us any reply we send: we're on the same host as the target, after all. We discard them by comparing the last sequence we sent with the sequence we receive. However, on the first reply for a given identifier, the sequence might be zero, depending on the implementation of ping(8): we need another value to indicate we haven't sent any sequence number, yet. Use -1 as initialiser in the echo identifier map. This is visible with Busybox's ping, and was reported by Paul on the integration at https://github.com/containers/podman/pull/16141, with: $ podman run --net=pasta alpine ping -c 2 192.168.188.1 ...where only the second reply would be routed back. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 33482d5bf293 ("passt: Add PASTA mode, major rework") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	icmp: Add debugging messages for handled replies and requests	Stefano Brivio	2022-10-27	1	-5/+25
\| \| \| \| \| \| \|	...instead of just reporting errors. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Split tap_ip4_send() into UDP and ICMP variants	David Gibson	2022-10-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_ip4_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp4_send() and tap_icmp4_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv4 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>