passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	icmp: Use -1 to represent "missing" sockets	David Gibson	2024-01-22	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \|	icmp_id_map[] contains, amongst other things, fds for "ping" sockets associated with various ICMP echo ids. However, we only lazily open() those sockets, so many will be missing. We currently represent that with a 0, which isn't great, since that's technically a valid fd. Use -1 instead. This does require initializing the fields in icmp_id_map[] but we already have an obvious place to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to match host IDs to guest IDs	David Gibson	2024-01-22	1	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding pings from tap, currently we create a ping socket with a socket address whose port is set to the ID of the ping received from the guest. This causes the socket to send pings with the same ID on the host. Although this seems look a good idea for maximum transparency, it's probably unwise. First, it's fallible - the bind() could fail, and we already have fallback logic which will overwrite the packets with the expected guest id if the id we get on replies doesn't already match. We might as well do that unconditionally. But more importantly, we don't know what else on the host might be using ping sockets, so we could end up with an ID that's the same as an existing socket. You'd expect that to fail the bind() with EADDRINUSE, which would be fine: we'd fall back to rewriting the reply ids. However it appears the kernel (v6.6.3 at least), does not fail the bind() and instead it's "last socket wins" in terms of who gets the replies. So we could accidentally intercept ping replies for something else on the host. So, instead of using bind() to set the id, just let the kernel pick one and expect to translate the replies back. Although theoretically this makes the passt/pasta link a bit less "transparent", essentially nothing cares about specific ping IDs, much like TCP source ports, which we also don't preserve. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to handle "wrong direction" ping socket traffic	David Gibson	2024-01-22	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linux ICMP "ping" sockets are very specific in what they do. They let userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY). They don't let you intercept or handle incoming ping requests. In the case of passt/pasta that means we can process echo requests from tap and forward them to a ping socket, then take the replies from the ping socket and forward them to tap. We can't do the reverse: take echo requests from the host and somehow forward them to the guest. There's really no way for something outside to initiate a ping to a passt/pasta connected guest and if there was we'd need an entirely different mechanism to handle it. However, we have some logic to deal with packets going in that reverse direction. Remove it, since it can't ever be used that way. While we're there use defines for the ICMPv6 types, instead of open coded type values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Remove redundant initialisation of sendto() address	David Gibson	2024-01-22	1	-2/+0
\| \| \| \| \| \| \| \| \|	We initialise the address portion of the sockaddr for sendto() to the unspecified address, but then always overwrite it with the actual destination address before we call the sendto(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't set "port" on destination sockaddr for ping sockets	David Gibson	2024-01-22	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We set the port to the ICMP id on the sendto() address when using ICMP ping sockets. However, this has no effect: the ICMP id the kernel uses is determined only by the "port" on the socket's bound address (which is constructed inside sock_l4(), using the id we also pass to it). For unclear reasons this change triggers cppcheck 2.13.0 to give new "variable could be const pointer" warnings, so make *ih const as well to fix that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Avoid moving flow entries to compact table	David Gibson	2024-01-22	7	-87/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Enforce that freeing of closed flows must happen in deferred handlers	David Gibson	2024-01-22	5	-15/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Abstract allocation of new flows with helper function	David Gibson	2024-01-22	3	-11/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp.c open codes the process of allocating a new flow from the flow table: twice, in fact, once for guest to host and once for host to guest connections. This duplication isn't ideal and will get worse as we add more protocols to the flow table. It also makes it harder to experiment with different ways of handling flow table allocation. Instead, introduce a function to allocate a new flow: flow_alloc(). In some cases we currently check if we're able to allocate, but delay the actual allocation. We now handle that slightly differently with a flow_alloc_cancel() function to back out a recent allocation. We have that separate from a flow_free() function, because future changes we have in mind will need to handle this case a little differently. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_count from context structure to a global	David Gibson	2024-01-22	7	-18/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_log_() to near top of flow.c	David Gibson	2024-01-22	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \|	flow_log_() is a very basic widely used function that many other functions in flow.c will end up needing. At present it's below flow_table_compact() which happens not to need it, but that's likely to change. Move it to near the top of flow.c to avoid forward declarations. Code motion only, no changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets	David Gibson	2024-01-22	5	-39/+27
\| \| \| \| \| \| \| \| \| \| \|	Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Better handling of number of epoll types	David Gibson	2024-01-22	2	-3/+5
\| \| \| \| \| \| \| \| \|	As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add handling for per-flow timers	David Gibson	2024-01-22	4	-12/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add flow-centric dispatch for deferred flow handling	David Gibson	2024-01-22	5	-17/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Move per-type cleanup logic into per-type helpers	David Gibson	2024-01-22	3	-10/+14
\| \| \| \| \| \| \| \| \| \|	tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Remove redundant handling from tcp_timer()	David Gibson	2024-01-22	3	-19/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise on 'now' for current timestamp variables	David Gibson	2024-01-22	7	-37/+37
\| \| \| \| \| \| \| \| \|	In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make flow_table.h #include the protocol specific headers it needs	David Gibson	2024-01-22	4	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we mostly lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Remove unused pif_name() function	David Gibson	2024-01-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	pif_name() has no current callers, although we expect some as we expand the flow table support. I'm not sure why this didn't get caught by one of our static checkers earlier, but it's now causing cppcheck failures for me. Add a cppcheck suppression. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	11	-34/+41
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fix passt.mbuto for cases where /usr/sbin doesn't exist	David Gibson	2024-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f0ccca74 ("test: make passt.mbuto script more robust") is supposed to make mbuto more robust by standardizing on always putting things in /usr/sbin with /sbin a symlink to it. This matters because different distros have different conventions about how the two are used. However, the logic there requires that /usr/sbin at least exists to start with. This isn't always the case with Fedora derived mbuto images. Ironically the DIRS variable ensures that /sbin exists, although we then remove it, but doesn't require /usr/sbin to exist. Fix that up so that the new logic will work with Fedora. Fixes: f0ccca741f64 ("test: make passt.mbuto script more robust") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fetch most specific (longest prefix) address in nl_addr_get()2023_12_30.f091893	Stefano Brivio	2023-12-30	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This happened in most cases implicitly before commit eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions"): while going through results from netlink, we would only copy an address into the provided return buffer if no address had been picked yet. Because of the insertion logic in the kernel (ipv6_link_dev_addr()), the first returned address would also be the one added last, and, in case of a Linux guest using a DHCPv6 client as well as SLAAC, that would be the address assigned via DHCPv6, because SLAAC happens before the DHCPv6 exchange. The effect of, instead, picking the last returned address (first assigned) is visible when passt or pasta runs nested, given that, by default, they advertise a prefix for SLAAC usage, plus an address via DHCPv6. The first level (L1 guest) would get a /64 address by means of SLAAC, and a /128 address via DHCPv6, the latter matching the address on the host. The second level (L2 guest) would also get two addresses: a /64 via SLAAC (same prefix as the host), and a /128 via DHCPv6, matching the the L1 SLAAC-assigned address, not the one obtained via DHCPv6. That is, none of the L2 addresses would match the address on the host. The whole point of having a DHCPv6 server is to avoid (implicit) NAT when possible, though. Fix this in a more explicit way than the behaviour we initially had: pick the first address among the set of most specific ones, by comparing prefix lengths. Do this for IPv4 and for link-local addresses, too, to match in any case the implementation of the default source address selection. Reported-by: Yalan Zhang <yalzhang@redhat.com> Fixes: eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Default SLAAC prefix comes from address (not prefix) on host	Stefano Brivio	2023-12-30	1	-7/+7
\| \| \| \| \|	Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Fix broken link to CentOS Stream package	Stefano Brivio	2023-12-30	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: make passt.mbuto script more robust	Jon Paul Maloy	2023-12-27	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	Creation of a symbolic link from /sbin to /usr/sbin fails if /sbin exists and is non-empty. This is the case on Ubuntu-23.04. We fix this by removing /sbin before creating the link. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: make tcp_sock_set_bufsize() static (again)	Laurent Vivier	2023-12-27	2	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") has exported tcp_sock_set_bufsize() to be able to use it in tcp_splice.c, but 6ccab72d9b40 has removed its use in tcp_splice.c, so we can set it static again. Fixes: 6ccab72d9b40 ("tcp: Improve handling of fallback if socket pool is empty on new splice") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make sock_l4() treat empty string ifname like NULL	David Gibson	2023-12-27	3	-16/+7
\| \| \| \| \| \| \| \| \| \| \|	sock_l4() takes NULL for ifname if you don't want to bind the socket to a particular interface. However, for a number of the callers, it's more natural to use an empty string for that case. Change sock_l4() to accept either NULL or an empty string equivalently, and simplify some callers using that change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Avoid in_addr_t	David Gibson	2023-12-27	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IPv4 addresses can be stored in an in_addr_t or a struct in_addr. The former is just a type alias to a 32-bit integer, so doesn't really give us any type checking. Therefore we generally prefer the structure, since we mostly want to treat IP address as opaque objects. Fix a few places where we still use in_addr_t, but can just as easily use struct in_addr. Note there are still some uses of in_addr_t in conf.c, but those are justified: since they're doing prefix calculations, they actually need to look at the internals of the address as an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Avoid unnecessary handling of unspecified bind address	David Gibson	2023-12-27	1	-12/+4
\| \| \| \| \| \| \| \| \| \|	We go to some trouble, if the configured output address is unspecified, to pass NULL to sock_l4(). But while passing NULL is one way to get sock_l4() not to specify a bind address, passing the "any" address explicitly works too. Use this to simplify icmp_tap_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Drop explicit setting to INADDR_ANY/in6addr_any in sock_l4()	David Gibson	2023-12-27	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original commit message says: --- Currently we initialise the address field of the sockaddrs we construct to the any/unspecified address, but not in a very clear way: we use explicit 0 values, which is only interpretable if you know the order of fields in the sockaddr structures. Use explicit field names, and explicit initialiser macros for the address. Because we initialise to this default value, we don't need to explicitly set the any/unspecified address later on if the caller didn't pass an overriding bind address. --- and the original patch modified the initialisation of addr4 and addr6: - instead of { 0 }, { 0 } for sin_addr and sin_zero, .sin_addr = IN4ADDR_ANY_INIT - instead of 0, IN6ADDR_ANY_INIT, 0: .sin6_addr = IN6ADDR_ANY_INIT but I dropped those hunks: they break gcc versions 7 to 9 as reported in eed6933e6c29 ("udp: Explicitly initialise sin6_scope_id and sin_zero in sockaddr_in{,6}"). I applied the rest of the changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Dropped first two hunks] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use htonl_constant() in more places	David Gibson	2023-12-27	1	-2/+2
\| \| \| \| \| \| \| \|	We might as well when we're passing a known constant value, giving the compiler the best chance to optimise things away. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Add IN4ADDR_ANY_INIT macro	David Gibson	2023-12-27	2	-1/+4
\| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address, make a similar one for the unspecified / any address. This avoids messying things with the internal structure of struct in_addr where we don't care about it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use IN4ADDR_LOOPBACK_INIT more widely	David Gibson	2023-12-27	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address without delving into its internals. However there are some places we don't use it, and explicitly look at the internal structure of struct in_addr, which we generally want to avoid. Use the define more widely to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix address type for tcp_sock_init_af()	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This takes a struct in_addr * (i.e. an IPv4 address), although it's explicitly supposed to handle IPv6 as well. Both its caller and sock_l4() which it calls use a void * for the address, which can be either an in_addr or an in6_addr. We get away with this, because we don't do anything with the pointer other than transfer it from the caller to sock_l4(), but it's misleading. And quite possibly technically UB, because C is like that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Don't use linux/icmp.h when netinet/ip_icmp.h will do	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \|	In most places where we need to get ICMP definitions, we get them from <netinet/ip_icmp.h>. However in checksum.c we instead include <linux/icmp.h>. Change it to use <netinet/ip_icmp.h> for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't account for hash table size in tcp_hash()	David Gibson	2023-12-27	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_hash() returns the hash bucket for a value, that is the hash modulo the size of the hash table. Usually it's a bit more flexible to have hash functions return a "raw" hash value and perform the modulus in the callers. That allows the same hash function to be used for multiple tables of different sizes, or to re-use the hash for other purposes. We don't do anything like that with tcp_hash() at present, but we have some plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash() return raw hash values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Implement hash table with indices rather than pointers	David Gibson	2023-12-27	2	-11/+33
\| \| \| \| \| \| \| \| \| \| \| \| \|	We implement our hash table with pointers to the entry for each bucket (or NULL). However, the entries are always allocated within the flow table, meaning that a flow index will suffice, halving the size of the hash table. For TCP, just a flow index would be enough, but future uses will want to expand the hash table to cover indexing either side of a flow, so use a flow_sidx_t as the type for each hash bucket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Switch hash table to linear probing instead of chaining	David Gibson	2023-12-27	3	-56/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we deal with hash collisions by letting a hash bucket contain multiple entries, forming a linked list using an index in the connection structure. That's a pretty standard and simple approach, but in our case we can use an even simpler one: linear probing. Here if a hash bucket is occupied we just move onto the next one until we find a feww one. This slightly simplifies lookup and more importantly saves some precious bytes in the connection structure by removing the need for a link. It does require some additional complexity for hash removal. This approach can perform poorly with hash table load is high. However, we already size our hash table of pointers larger than the connection table, which puts an upper bound on the load. It's relatively cheap to decrease that bound if we find we need to. I adapted the linear probing operations from Knuth's The Art of Computer Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm R in Section 6.4. Note that there is an error in Algorithm R as printed, see errata at [0]. [0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler()	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_hash_lookup() expects the port numbers in host order, but the TCP header, of course, has them in network order, so we need to switch them. However we call htons() (host to network) instead of ntohs() (network to host). This works because those do the same thing in practice (they only wouldn't on very strange theoretical platforms which are neither big nor little endian). But, having this the "wrong" way around is misleading, so switch it around. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Update "Availability" section	Stefano Brivio	2023-12-27	1	-13/+11
\| \| \| \| \| \| \| \| \|	It's been a while -- there are now official packages for Arch Linux, Gentoo, Void Linux. Suggested-by: Rahil Bhimjiani <me@rahil.website> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Cast timeval fields to unsigned long long for printing	Stefano Brivio	2023-12-27	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	On x32, glibc defines time_t and suseconds_t (the latter, also known as __syscall_slong_t) as unsigned long long, whereas "everywhere else", including x86_64 and i686, those are unsigned long. See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for all the gory details. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Add missing include, stdio.h	Stefano Brivio	2023-12-27	1	-0/+1
\| \| \| \| \| \| \|	Reported-by: lemmi <lemmi@nerd2nerd.org> Link: https://github.com/void-linux/void-packages/actions/runs/7097192513/job/19316903568 Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Select first reported IPv6 address for guest/host comparison	Stefano Brivio	2023-12-27	5	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we run passt nested (a guest connected via passt to a guest connected via passt to the host), the first guest (L1) typically has two IPv6 addresses on the same interface: one formed from the prefix assigned via SLAAC, and another one assigned via DHCPv6 (to match the address on the host). When we select addresses for comparison, in this case, we have multiple global unicast addresses -- again, on the same interface. Selecting the first reported one on both host and guest is not entirely correct (in theory, the order might differ), but works reasonably well. Use the trick from 5beef085978e ("test: Only select a single interface or gateway in tests") to ask jq(1) for the first address returned by the query. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp: Extend lifetime of prefix, router, RDNSS and search list	Stefano Brivio	2023-12-27	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we have no mechanism to dynamically update IPv6 addressing, routing or DNS information (which should eventually be implemented via netlink monitor), so it makes no sense to limit lifetimes of NDP information to any particular value. If we do, with common configurations of systemd-networkd in a guest, we can end up in a situation where we have a /128 address assigned via DHCPv6, the NDP-assigned prefix expires, and the default route also expires. However, as there's a valid address, the prefix is not renewed. As a result, the default route becomes invalid and we lose it altogether, which implies that the guest loses IPv6 connectivity except for link-local communication. Set the router lifetime to the maximum allowed by RFC 8319, that is, 65535 seconds (about 18 hours). RFC 4861 limited this value to 9000 seconds, but RFC 8319 later updated this limit. Set prefix and DNS information lifetime to infinity. This is allowed by RFC 4861 and RFC 8319. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Make handling of shell prompts with escapes a little more reliable	David Gibson	2023-12-07	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using the old-style "pane" methods of executing commands during the tests, we need to scan the shell output for prompts in order to tell when commands have finished. This is inherently unreliable because commands could output things that look like prompts, and prompts might not look like we expect them to. The only way to really fix this is to use a better way of dispatching commands, like the newer "context" system. However, it's awkward to convert everything to "context" right at the moment, so we're still relying on some tests that do work most of the time. It is, however, particularly sensitive to fancy coloured prompts using escape sequences. Currently we try to handle this by stripping actual ESC characters with tr, then looking for some common variants. We can do a bit better: instead strip all escape sequences using sed before looking for our prompt. Or, at least, any one using [a-zA-Z] as the terminating character. Strictly speaking ANSI escapes can be terminated by any character in 0x40..0x7e, which isn't easily expressed in a regexp. This should capture all common ones, though. With this transformation we can simplify the list of patterns we then look for as a prompt, removing some redundant variants. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't defer hash table removal2023_12_04.b86afe3	David Gibson	2023-12-04	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a TCP connection is closed, we mark it by setting events to CLOSED, then some time later we do final cleanups: closing sockets, removing from the hash table and so forth. This does mean that when making a hash lookup we need to exclude any apparent matches that are CLOSED, since they represent a stale connection. This can happen in practice if one connection closes and a new one with the same endpoints is started shortly afterward. Checking for CLOSED is quite specific to TCP however, and won't work when we extend the hash table to more general flows. So, alter the code to immediately remove the connection from the hash table when CLOSED, although we still defer closing sockets and other cleanup. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: "TCP" hash secret doesn't need to be TCP specific	David Gibson	2023-12-04	4	-35/+44
\| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP state structure includes a 128-bit hash_secret which we use for SipHash calculations to mitigate attacks on the TCP hash table and initial sequence number. We have plans to use SipHash in places that aren't TCP related, and there's no particular reason they'd need their own secret. So move the hash_secret to the general context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Add helpers to get the name of a pif	David Gibson	2023-12-04	3	-1/+42
\| \| \| \| \| \| \| \| \| \| \|	Future debugging will want to identify a specific passt interface. We make a distinction in these helpers between the name of the type of pif, and name of the pif itself. For the time being these are always the same thing, since we have at most instance of each type of pif. However, that might change in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Avoid hitting guestfish command length limits	David Gibson	2023-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In test/prepare-distro-img.sh we use guestfish to tweak our distro guest images to be suitable. Part of this is using a 'copy-in' directive to copy in the source files for passt itself. Currently we copy in all the files with a single 'copy-in', since it allows listing multiple files. However it turns out that the number of arguments it can accept is fairly limited and our current list of files is already very close to that limit. Instead, expand our list of files to one copy-in per file, avoiding that limitation. This isn't much slower, because all the commands still run in a single invocation of guestfish itself. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Use epoll_ref type including flow and side	David Gibson	2023-12-04	5	-30/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP uses the 'flow' epoll_ref field for both connected sockets and timers, which consists of just the index of the relevant flow (connection). This is just fine for timers, for while it obviously works, it's subtly incomplete for sockets on spliced connections. In that case we want to know which side of the connection the event is occurring on as well as which connection. At present, we deduce that information by looking at the actual fd, and comparing it to the fds of the sockets on each side. When we use the flow table for more things, we expect more cases where something will need to know a specific side of a specific flow for an event, but nothing more. Therefore add a new 'flowside' epoll_ref field, with exactly that information. We use it for TCP connected sockets. This allows us to directly know the side for spliced connections. For "tap" connections, it's pretty meaningless, since the side is always the socket side. It still makes logical sense though, and it may become important for future flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>