passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	epoll: Better handling of number of epoll types	David Gibson	2024-01-22	2	-3/+5
\| \| \| \| \| \| \| \| \|	As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add handling for per-flow timers	David Gibson	2024-01-22	4	-12/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add flow-centric dispatch for deferred flow handling	David Gibson	2024-01-22	5	-17/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Move per-type cleanup logic into per-type helpers	David Gibson	2024-01-22	3	-10/+14
\| \| \| \| \| \| \| \| \| \|	tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Remove redundant handling from tcp_timer()	David Gibson	2024-01-22	3	-19/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise on 'now' for current timestamp variables	David Gibson	2024-01-22	7	-37/+37
\| \| \| \| \| \| \| \| \|	In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make flow_table.h #include the protocol specific headers it needs	David Gibson	2024-01-22	4	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we mostly lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Remove unused pif_name() function	David Gibson	2024-01-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	pif_name() has no current callers, although we expect some as we expand the flow table support. I'm not sure why this didn't get caught by one of our static checkers earlier, but it's now causing cppcheck failures for me. Add a cppcheck suppression. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	11	-34/+41
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fix passt.mbuto for cases where /usr/sbin doesn't exist	David Gibson	2024-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f0ccca74 ("test: make passt.mbuto script more robust") is supposed to make mbuto more robust by standardizing on always putting things in /usr/sbin with /sbin a symlink to it. This matters because different distros have different conventions about how the two are used. However, the logic there requires that /usr/sbin at least exists to start with. This isn't always the case with Fedora derived mbuto images. Ironically the DIRS variable ensures that /sbin exists, although we then remove it, but doesn't require /usr/sbin to exist. Fix that up so that the new logic will work with Fedora. Fixes: f0ccca741f64 ("test: make passt.mbuto script more robust") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fetch most specific (longest prefix) address in nl_addr_get()2023_12_30.f091893	Stefano Brivio	2023-12-30	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This happened in most cases implicitly before commit eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions"): while going through results from netlink, we would only copy an address into the provided return buffer if no address had been picked yet. Because of the insertion logic in the kernel (ipv6_link_dev_addr()), the first returned address would also be the one added last, and, in case of a Linux guest using a DHCPv6 client as well as SLAAC, that would be the address assigned via DHCPv6, because SLAAC happens before the DHCPv6 exchange. The effect of, instead, picking the last returned address (first assigned) is visible when passt or pasta runs nested, given that, by default, they advertise a prefix for SLAAC usage, plus an address via DHCPv6. The first level (L1 guest) would get a /64 address by means of SLAAC, and a /128 address via DHCPv6, the latter matching the address on the host. The second level (L2 guest) would also get two addresses: a /64 via SLAAC (same prefix as the host), and a /128 via DHCPv6, matching the the L1 SLAAC-assigned address, not the one obtained via DHCPv6. That is, none of the L2 addresses would match the address on the host. The whole point of having a DHCPv6 server is to avoid (implicit) NAT when possible, though. Fix this in a more explicit way than the behaviour we initially had: pick the first address among the set of most specific ones, by comparing prefix lengths. Do this for IPv4 and for link-local addresses, too, to match in any case the implementation of the default source address selection. Reported-by: Yalan Zhang <yalzhang@redhat.com> Fixes: eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Default SLAAC prefix comes from address (not prefix) on host	Stefano Brivio	2023-12-30	1	-7/+7
\| \| \| \| \|	Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Fix broken link to CentOS Stream package	Stefano Brivio	2023-12-30	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: make passt.mbuto script more robust	Jon Paul Maloy	2023-12-27	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	Creation of a symbolic link from /sbin to /usr/sbin fails if /sbin exists and is non-empty. This is the case on Ubuntu-23.04. We fix this by removing /sbin before creating the link. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: make tcp_sock_set_bufsize() static (again)	Laurent Vivier	2023-12-27	2	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") has exported tcp_sock_set_bufsize() to be able to use it in tcp_splice.c, but 6ccab72d9b40 has removed its use in tcp_splice.c, so we can set it static again. Fixes: 6ccab72d9b40 ("tcp: Improve handling of fallback if socket pool is empty on new splice") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make sock_l4() treat empty string ifname like NULL	David Gibson	2023-12-27	3	-16/+7
\| \| \| \| \| \| \| \| \| \| \|	sock_l4() takes NULL for ifname if you don't want to bind the socket to a particular interface. However, for a number of the callers, it's more natural to use an empty string for that case. Change sock_l4() to accept either NULL or an empty string equivalently, and simplify some callers using that change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Avoid in_addr_t	David Gibson	2023-12-27	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IPv4 addresses can be stored in an in_addr_t or a struct in_addr. The former is just a type alias to a 32-bit integer, so doesn't really give us any type checking. Therefore we generally prefer the structure, since we mostly want to treat IP address as opaque objects. Fix a few places where we still use in_addr_t, but can just as easily use struct in_addr. Note there are still some uses of in_addr_t in conf.c, but those are justified: since they're doing prefix calculations, they actually need to look at the internals of the address as an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Avoid unnecessary handling of unspecified bind address	David Gibson	2023-12-27	1	-12/+4
\| \| \| \| \| \| \| \| \| \|	We go to some trouble, if the configured output address is unspecified, to pass NULL to sock_l4(). But while passing NULL is one way to get sock_l4() not to specify a bind address, passing the "any" address explicitly works too. Use this to simplify icmp_tap_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Drop explicit setting to INADDR_ANY/in6addr_any in sock_l4()	David Gibson	2023-12-27	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original commit message says: --- Currently we initialise the address field of the sockaddrs we construct to the any/unspecified address, but not in a very clear way: we use explicit 0 values, which is only interpretable if you know the order of fields in the sockaddr structures. Use explicit field names, and explicit initialiser macros for the address. Because we initialise to this default value, we don't need to explicitly set the any/unspecified address later on if the caller didn't pass an overriding bind address. --- and the original patch modified the initialisation of addr4 and addr6: - instead of { 0 }, { 0 } for sin_addr and sin_zero, .sin_addr = IN4ADDR_ANY_INIT - instead of 0, IN6ADDR_ANY_INIT, 0: .sin6_addr = IN6ADDR_ANY_INIT but I dropped those hunks: they break gcc versions 7 to 9 as reported in eed6933e6c29 ("udp: Explicitly initialise sin6_scope_id and sin_zero in sockaddr_in{,6}"). I applied the rest of the changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Dropped first two hunks] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use htonl_constant() in more places	David Gibson	2023-12-27	1	-2/+2
\| \| \| \| \| \| \| \|	We might as well when we're passing a known constant value, giving the compiler the best chance to optimise things away. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Add IN4ADDR_ANY_INIT macro	David Gibson	2023-12-27	2	-1/+4
\| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address, make a similar one for the unspecified / any address. This avoids messying things with the internal structure of struct in_addr where we don't care about it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use IN4ADDR_LOOPBACK_INIT more widely	David Gibson	2023-12-27	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address without delving into its internals. However there are some places we don't use it, and explicitly look at the internal structure of struct in_addr, which we generally want to avoid. Use the define more widely to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix address type for tcp_sock_init_af()	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This takes a struct in_addr * (i.e. an IPv4 address), although it's explicitly supposed to handle IPv6 as well. Both its caller and sock_l4() which it calls use a void * for the address, which can be either an in_addr or an in6_addr. We get away with this, because we don't do anything with the pointer other than transfer it from the caller to sock_l4(), but it's misleading. And quite possibly technically UB, because C is like that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Don't use linux/icmp.h when netinet/ip_icmp.h will do	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \|	In most places where we need to get ICMP definitions, we get them from <netinet/ip_icmp.h>. However in checksum.c we instead include <linux/icmp.h>. Change it to use <netinet/ip_icmp.h> for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't account for hash table size in tcp_hash()	David Gibson	2023-12-27	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_hash() returns the hash bucket for a value, that is the hash modulo the size of the hash table. Usually it's a bit more flexible to have hash functions return a "raw" hash value and perform the modulus in the callers. That allows the same hash function to be used for multiple tables of different sizes, or to re-use the hash for other purposes. We don't do anything like that with tcp_hash() at present, but we have some plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash() return raw hash values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Implement hash table with indices rather than pointers	David Gibson	2023-12-27	2	-11/+33
\| \| \| \| \| \| \| \| \| \| \| \| \|	We implement our hash table with pointers to the entry for each bucket (or NULL). However, the entries are always allocated within the flow table, meaning that a flow index will suffice, halving the size of the hash table. For TCP, just a flow index would be enough, but future uses will want to expand the hash table to cover indexing either side of a flow, so use a flow_sidx_t as the type for each hash bucket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Switch hash table to linear probing instead of chaining	David Gibson	2023-12-27	3	-56/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we deal with hash collisions by letting a hash bucket contain multiple entries, forming a linked list using an index in the connection structure. That's a pretty standard and simple approach, but in our case we can use an even simpler one: linear probing. Here if a hash bucket is occupied we just move onto the next one until we find a feww one. This slightly simplifies lookup and more importantly saves some precious bytes in the connection structure by removing the need for a link. It does require some additional complexity for hash removal. This approach can perform poorly with hash table load is high. However, we already size our hash table of pointers larger than the connection table, which puts an upper bound on the load. It's relatively cheap to decrease that bound if we find we need to. I adapted the linear probing operations from Knuth's The Art of Computer Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm R in Section 6.4. Note that there is an error in Algorithm R as printed, see errata at [0]. [0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler()	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_hash_lookup() expects the port numbers in host order, but the TCP header, of course, has them in network order, so we need to switch them. However we call htons() (host to network) instead of ntohs() (network to host). This works because those do the same thing in practice (they only wouldn't on very strange theoretical platforms which are neither big nor little endian). But, having this the "wrong" way around is misleading, so switch it around. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Update "Availability" section	Stefano Brivio	2023-12-27	1	-13/+11
\| \| \| \| \| \| \| \| \|	It's been a while -- there are now official packages for Arch Linux, Gentoo, Void Linux. Suggested-by: Rahil Bhimjiani <me@rahil.website> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Cast timeval fields to unsigned long long for printing	Stefano Brivio	2023-12-27	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	On x32, glibc defines time_t and suseconds_t (the latter, also known as __syscall_slong_t) as unsigned long long, whereas "everywhere else", including x86_64 and i686, those are unsigned long. See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for all the gory details. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Add missing include, stdio.h	Stefano Brivio	2023-12-27	1	-0/+1
\| \| \| \| \| \| \|	Reported-by: lemmi <lemmi@nerd2nerd.org> Link: https://github.com/void-linux/void-packages/actions/runs/7097192513/job/19316903568 Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Select first reported IPv6 address for guest/host comparison	Stefano Brivio	2023-12-27	5	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we run passt nested (a guest connected via passt to a guest connected via passt to the host), the first guest (L1) typically has two IPv6 addresses on the same interface: one formed from the prefix assigned via SLAAC, and another one assigned via DHCPv6 (to match the address on the host). When we select addresses for comparison, in this case, we have multiple global unicast addresses -- again, on the same interface. Selecting the first reported one on both host and guest is not entirely correct (in theory, the order might differ), but works reasonably well. Use the trick from 5beef085978e ("test: Only select a single interface or gateway in tests") to ask jq(1) for the first address returned by the query. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp: Extend lifetime of prefix, router, RDNSS and search list	Stefano Brivio	2023-12-27	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we have no mechanism to dynamically update IPv6 addressing, routing or DNS information (which should eventually be implemented via netlink monitor), so it makes no sense to limit lifetimes of NDP information to any particular value. If we do, with common configurations of systemd-networkd in a guest, we can end up in a situation where we have a /128 address assigned via DHCPv6, the NDP-assigned prefix expires, and the default route also expires. However, as there's a valid address, the prefix is not renewed. As a result, the default route becomes invalid and we lose it altogether, which implies that the guest loses IPv6 connectivity except for link-local communication. Set the router lifetime to the maximum allowed by RFC 8319, that is, 65535 seconds (about 18 hours). RFC 4861 limited this value to 9000 seconds, but RFC 8319 later updated this limit. Set prefix and DNS information lifetime to infinity. This is allowed by RFC 4861 and RFC 8319. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test: Make handling of shell prompts with escapes a little more reliable	David Gibson	2023-12-07	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using the old-style "pane" methods of executing commands during the tests, we need to scan the shell output for prompts in order to tell when commands have finished. This is inherently unreliable because commands could output things that look like prompts, and prompts might not look like we expect them to. The only way to really fix this is to use a better way of dispatching commands, like the newer "context" system. However, it's awkward to convert everything to "context" right at the moment, so we're still relying on some tests that do work most of the time. It is, however, particularly sensitive to fancy coloured prompts using escape sequences. Currently we try to handle this by stripping actual ESC characters with tr, then looking for some common variants. We can do a bit better: instead strip all escape sequences using sed before looking for our prompt. Or, at least, any one using [a-zA-Z] as the terminating character. Strictly speaking ANSI escapes can be terminated by any character in 0x40..0x7e, which isn't easily expressed in a regexp. This should capture all common ones, though. With this transformation we can simplify the list of patterns we then look for as a prompt, removing some redundant variants. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't defer hash table removal2023_12_04.b86afe3	David Gibson	2023-12-04	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a TCP connection is closed, we mark it by setting events to CLOSED, then some time later we do final cleanups: closing sockets, removing from the hash table and so forth. This does mean that when making a hash lookup we need to exclude any apparent matches that are CLOSED, since they represent a stale connection. This can happen in practice if one connection closes and a new one with the same endpoints is started shortly afterward. Checking for CLOSED is quite specific to TCP however, and won't work when we extend the hash table to more general flows. So, alter the code to immediately remove the connection from the hash table when CLOSED, although we still defer closing sockets and other cleanup. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: "TCP" hash secret doesn't need to be TCP specific	David Gibson	2023-12-04	4	-35/+44
\| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP state structure includes a 128-bit hash_secret which we use for SipHash calculations to mitigate attacks on the TCP hash table and initial sequence number. We have plans to use SipHash in places that aren't TCP related, and there's no particular reason they'd need their own secret. So move the hash_secret to the general context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Add helpers to get the name of a pif	David Gibson	2023-12-04	3	-1/+42
\| \| \| \| \| \| \| \| \| \| \|	Future debugging will want to identify a specific passt interface. We make a distinction in these helpers between the name of the type of pif, and name of the pif itself. For the time being these are always the same thing, since we have at most instance of each type of pif. However, that might change in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Avoid hitting guestfish command length limits	David Gibson	2023-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In test/prepare-distro-img.sh we use guestfish to tweak our distro guest images to be suitable. Part of this is using a 'copy-in' directive to copy in the source files for passt itself. Currently we copy in all the files with a single 'copy-in', since it allows listing multiple files. However it turns out that the number of arguments it can accept is fairly limited and our current list of files is already very close to that limit. Instead, expand our list of files to one copy-in per file, avoiding that limitation. This isn't much slower, because all the commands still run in a single invocation of guestfish itself. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Use epoll_ref type including flow and side	David Gibson	2023-12-04	5	-30/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP uses the 'flow' epoll_ref field for both connected sockets and timers, which consists of just the index of the relevant flow (connection). This is just fine for timers, for while it obviously works, it's subtly incomplete for sockets on spliced connections. In that case we want to know which side of the connection the event is occurring on as well as which connection. At present, we deduce that information by looking at the actual fd, and comparing it to the fds of the sockets on each side. When we use the flow table for more things, we expect more cases where something will need to know a specific side of a specific flow for an event, but nothing more. Therefore add a new 'flowside' epoll_ref field, with exactly that information. We use it for TCP connected sockets. This allows us to directly know the side for spliced connections. For "tap" connections, it's pretty meaningless, since the side is always the socket side. It still makes logical sense though, and it may become important for future flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Use unsigned to represent side	David Gibson	2023-12-04	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we use 'int' values to represent the "side" of a connection, which must always be 0 or 1. This turns out to be dangerous. In some cases we're going to want to put the side into a 1-bit bitfield. However, if that bitfield has type 'int', when we copy it out to a regular 'int' variable, it will be sign-extended and so have values 0 and -1, instead of 0 and 1. To avoid this, always use unsigned variables for the side. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Generalise TCP epoll_ref to generic flows	David Gibson	2023-12-04	3	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP uses three different epoll object types: one for connected sockets, one for timers and one for listening sockets. Listening sockets really need information that's specific to TCP, so need their own epoll_ref field. Timers and connected sockets, however, only need the connection (flow) they're associated with. As we expand the use of the flow table, we expect that to be true for more epoll fds. So, rename the "TCP" epoll_ref field to be a "flow" epoll_ref field that can be used both for TCP and for other future cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove unneccessary bounds check in tcp_timer_handler()	David Gibson	2023-12-04	1	-2/+2
\| \| \| \| \| \| \| \| \|	In tcp_timer_handler() we use conn_at_idx() to interpret the flow index from the epoll reference. However, this will never be NULL - we always put a valid index into the epoll_ref. Simplify slightly based on this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Introduce 'sidx' type to represent one side of one flow	David Gibson	2023-12-04	2	-0/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a number of places, we use indices into the flow table to identify a specific flow. We also have cases where we need to identify a particular side of a particular flow, and we expect those to become more common as we generalise the flow table to cover more things. To assist with that, introduces flow_sidx_t, an index type which identifies a specific side of a specific flow in the table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Suppress false cppcheck positive in flow_sidx()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add logging helpers for connection related messages	David Gibson	2023-12-04	4	-79/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most of the messages logged by the TCP code (be they errors, debug or trace messages) are related to a specific connection / flow. We're fairly consistent about prefixing these with the type of connection and the connection / flow index. However there are a few places where we put the index later in the message or omit it entirely. The template with the prefix is also a little bulky to carry around for every message, particularly for spliced connections. To help keep this consistent, introduce some helpers to log messages linked to a specific flow. It takes the flow as a parameter and adds a uniform prefix to each message. This makes things slightly neater now, but more importantly will help keep formatting consistent as we add more things to the flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make unified version of flow table compaction	David Gibson	2023-12-04	5	-44/+48
\| \| \| \| \| \| \| \| \| \| \|	tcp_table_compact() will move entries in the connection/flow table to keep it compact when other entries are removed. The moved entries need not have the same type as the flow removed, so it needs to be able to handle moving any type of flow. Therefore, move it to flow.c rather than being purportedly TCP specific. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: MAX_FROM_BITS() should be unsigned	David Gibson	2023-12-04	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MAX_FROM_BITS() computes the maximum value representable in a number of bits. The expression for that is an unsigned value, but we explicitly cast it to a signed int. It looks like this is because one of the main users is for FD_REF_MAX, which is used to bound fd values, typically stored as a signed int. The value MAX_FROM_BITS() is calculating is naturally non-negative, though, so it makes more sense for it to be unsigned, and to move the case to the definition of FD_REF_MAX. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Consolidate flow pointer<->index helpers	David Gibson	2023-12-04	4	-46/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety. They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper. In the process, we standardise on always using an unsigned type to store the connection / flow index, which makes more sense. tcp.c's conn_at_idx() remains for now, but we change its parameter to unsigned to match. That in turn means we can remove a check for negative values from it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Move TCP connection table to unified flow table	David Gibson	2023-12-04	9	-82/+107
\| \| \| \| \| \| \| \| \| \| \| \| \|	We want to generalise "connection" tracking to things other than true TCP connections. Continue implenenting this by renaming the TCP connection table to the "flow table" and moving it to flow.c. The definitions are split between flow.h and flow_table.h - we need this separation to avoid circular dependencies: the definitions in flow.h will be needed by many headers using the flow mechanism, but flow_table.h needs all those protocol specific headers in order to define the full flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Generalise connection types	David Gibson	2023-12-04	6	-40/+112
\| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP connections use a 1-bit selector, 'spliced', to determine the rest of the contents of the structure. We want to generalise the TCP connection table to other types of flows in other protocols. Make a start on this by replacing the tcp_conn_common structure with a new flow_common structure with an enum rather than a simple boolean indicating the type of flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Add messages to static_assert() calls	David Gibson	2023-12-04	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A while ago, we updated passt to require C11, for several reasons, but one was to be able to use static_assert() for build time checks. The C11 version of static_assert() requires a message to print in case of failure as well as the test condition it self. It was extended in C23 to make the message optional, but we don't want to require C23 at this time. Unfortunately we missed that in some of the static_assert()s we already added which use the C23 form without a message. clang-tidy has a warning for this, but for some reason it's not seeming to trigger in the current cases (but did for some I'm working on adding). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>