passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	log: setlogmask(0) can actually result in a system call, don't use it	Stefano Brivio	2024-02-14	2	-13/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before commit 32d07f5e59f2 ("passt, pasta: Completely avoid dynamic memory allocation"), we didn't store the current log mask in a variable, and we fetched it using setlogmask(0) wherever needed. But after that commit, we can use our log_mask copy instead. And we should: with recent glibc versions, setlogmask(0) actually results in a system call, which causes a substantial overhead with high transfer rates: we use setlogmask(0) even to decide we don't want to print debug messages. Now that we rely on log_mask in early stages, before setlogmask() is called, we need to initialise that variable to the special LOG_EMERG mask value right away: define LOG_EARLY to make this clearer, and, while at it, group conditions in vlogmsg() into something more terse. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Fix subtle bug in fast re-transmit path	David Gibson	2024-02-11	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a duplicate ack from the tap side triggers a fast re-transmit, we set both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of the duplicate ack. Setting seq_to_tap is correct: this is what triggers the retransmit from this point onwards. Setting seq_ack_from_tap is not correct, though. In most cases setting seq_ack_from_tap will be redundant but harmless: it will have already been updated to the same value by tcp_update_seqack_from_tap() a few lines above. However that call can be skipped if tcp_sock_consume() fails, which is rare but possible. In that case this update will cause problems. We use seq_ack_from_tap to track two logically distinct things: how much of the stream has been acked by the guest, and how much of the stream from the socket has been read and discarded (as opposed to MSG_PEEKed). We attempt to keep those values the same, because we discard data exactly when it is acked by the guest. However tcp_sock_consume() failing means we weren't able to disard the acked data. To handle that case, we skip the usual update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get one which supersedes it soon enough. Setting seq_ack_from_tap in the fast retransmit path, however, means we now really will have the read/discard point in the stream out of sync with seq_ack_from_tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add support to fetch default gateway from multipath routes	Stefano Brivio	2024-02-09	2	-5/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the default route for a given IP version is a multipath one, instead of refusing to start because there's no RTA_GATEWAY attribute in the set returned by the kernel, we can just pick one of the paths. To make this somewhat less arbitrary, pick the path with the highest weight, if weights differ. Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/20927 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	icmp: Dedicated functions for starting and closing ping sequences	David Gibson	2024-01-22	1	-35/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	ICMP sockets are cleaned up on a timeout implemented in icmp_timer_one(), and the logic to do that cleanup is open coded in that function. Similarly new sockets are opened when we discover we don't have an existing one in icmp_tap_handler(), and again the logic is open-coded. That's not the worst thing, but it's a bit cleaner to have dedicated functions for the creation and destruction of ping sockets. This will also make things a bit easier for future changes we have in mind. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Validate packets received on ping sockets	David Gibson	2024-01-22	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	We access fields of packets received from ping sockets assuming they're echo replies, without actually checking that. Of course, we don't expect anything else from the kernel, but it's probably best to verify. While we're at it, also check for short packets, or a receive address of the wrong family. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Warn on receive errors from ping sockets	David Gibson	2024-01-22	1	-1/+4
\| \| \| \| \| \| \| \| \|	Currently we silently ignore an errors receiving a packet from a ping socket. We don't expect that to happen, so it's probably worth reporting if it does. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Consolidate icmp_sock_handler() with icmpv6_sock_handler()	David Gibson	2024-01-22	3	-59/+37
\| \| \| \| \| \| \| \| \| \| \|	Currently we have separate handlers for ICMP and ICMPv6 ping replies. Although there are a number of points of difference, with some creative refactoring we can combine these together sensibly. Although it doesn't save a vast amount of code, it does make it clearer that we're performing basically the same steps for each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Share more between IPv4 and IPv6 paths in icmp_tap_handler()	David Gibson	2024-01-22	1	-68/+68
\| \| \| \| \| \| \| \| \| \|	Currently icmp_tap_handler() consists of two almost disjoint paths for the IPv4 and IPv6 cases. The only thing they share is an error message. We can use some intermediate variables to refactor this to share some more code between those paths. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Simplify socket expiry scanning	David Gibson	2024-01-22	2	-33/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we use icmp_act[] to scan for ICMP ids which might have an open socket which could time out. However icmp_act[] contains no information that's not already in icmp_id_map[] - it's just an "index" which allows scanning for relevant entries with less cache footprint. We only scan for ICMP socket expiry every 1s, though, so it's not clear that cache footprint really matters. Furthermore, there's no strong reason we need to scan even that often - the timeout is fairly arbitrary and approximate. So, eliminate icmp_act[] in favour of directly scanning icmp_id_map[] and compensate for the cache impact by reducing the scan frequency to once every 10s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Use -1 to represent "missing" sockets	David Gibson	2024-01-22	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \|	icmp_id_map[] contains, amongst other things, fds for "ping" sockets associated with various ICMP echo ids. However, we only lazily open() those sockets, so many will be missing. We currently represent that with a 0, which isn't great, since that's technically a valid fd. Use -1 instead. This does require initializing the fields in icmp_id_map[] but we already have an obvious place to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to match host IDs to guest IDs	David Gibson	2024-01-22	1	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding pings from tap, currently we create a ping socket with a socket address whose port is set to the ID of the ping received from the guest. This causes the socket to send pings with the same ID on the host. Although this seems look a good idea for maximum transparency, it's probably unwise. First, it's fallible - the bind() could fail, and we already have fallback logic which will overwrite the packets with the expected guest id if the id we get on replies doesn't already match. We might as well do that unconditionally. But more importantly, we don't know what else on the host might be using ping sockets, so we could end up with an ID that's the same as an existing socket. You'd expect that to fail the bind() with EADDRINUSE, which would be fine: we'd fall back to rewriting the reply ids. However it appears the kernel (v6.6.3 at least), does not fail the bind() and instead it's "last socket wins" in terms of who gets the replies. So we could accidentally intercept ping replies for something else on the host. So, instead of using bind() to set the id, just let the kernel pick one and expect to translate the replies back. Although theoretically this makes the passt/pasta link a bit less "transparent", essentially nothing cares about specific ping IDs, much like TCP source ports, which we also don't preserve. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't attempt to handle "wrong direction" ping socket traffic	David Gibson	2024-01-22	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linux ICMP "ping" sockets are very specific in what they do. They let userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY). They don't let you intercept or handle incoming ping requests. In the case of passt/pasta that means we can process echo requests from tap and forward them to a ping socket, then take the replies from the ping socket and forward them to tap. We can't do the reverse: take echo requests from the host and somehow forward them to the guest. There's really no way for something outside to initiate a ping to a passt/pasta connected guest and if there was we'd need an entirely different mechanism to handle it. However, we have some logic to deal with packets going in that reverse direction. Remove it, since it can't ever be used that way. While we're there use defines for the ICMPv6 types, instead of open coded type values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Remove redundant initialisation of sendto() address	David Gibson	2024-01-22	1	-2/+0
\| \| \| \| \| \| \| \| \|	We initialise the address portion of the sockaddr for sendto() to the unspecified address, but then always overwrite it with the actual destination address before we call the sendto(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't set "port" on destination sockaddr for ping sockets	David Gibson	2024-01-22	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We set the port to the ICMP id on the sendto() address when using ICMP ping sockets. However, this has no effect: the ICMP id the kernel uses is determined only by the "port" on the socket's bound address (which is constructed inside sock_l4(), using the id we also pass to it). For unclear reasons this change triggers cppcheck 2.13.0 to give new "variable could be const pointer" warnings, so make *ih const as well to fix that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Avoid moving flow entries to compact table	David Gibson	2024-01-22	7	-87/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Enforce that freeing of closed flows must happen in deferred handlers	David Gibson	2024-01-22	5	-15/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Abstract allocation of new flows with helper function	David Gibson	2024-01-22	3	-11/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp.c open codes the process of allocating a new flow from the flow table: twice, in fact, once for guest to host and once for host to guest connections. This duplication isn't ideal and will get worse as we add more protocols to the flow table. It also makes it harder to experiment with different ways of handling flow table allocation. Instead, introduce a function to allocate a new flow: flow_alloc(). In some cases we currently check if we're able to allocate, but delay the actual allocation. We now handle that slightly differently with a flow_alloc_cancel() function to back out a recent allocation. We have that separate from a flow_free() function, because future changes we have in mind will need to handle this case a little differently. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_count from context structure to a global	David Gibson	2024-01-22	7	-18/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_log_() to near top of flow.c	David Gibson	2024-01-22	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \|	flow_log_() is a very basic widely used function that many other functions in flow.c will end up needing. At present it's below flow_table_compact() which happens not to need it, but that's likely to change. Move it to near the top of flow.c to avoid forward declarations. Code motion only, no changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets	David Gibson	2024-01-22	5	-39/+27
\| \| \| \| \| \| \| \| \| \| \|	Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Better handling of number of epoll types	David Gibson	2024-01-22	2	-3/+5
\| \| \| \| \| \| \| \| \|	As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add handling for per-flow timers	David Gibson	2024-01-22	4	-12/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add flow-centric dispatch for deferred flow handling	David Gibson	2024-01-22	5	-17/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Move per-type cleanup logic into per-type helpers	David Gibson	2024-01-22	3	-10/+14
\| \| \| \| \| \| \| \| \| \|	tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Remove redundant handling from tcp_timer()	David Gibson	2024-01-22	3	-19/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Standardise on 'now' for current timestamp variables	David Gibson	2024-01-22	7	-37/+37
\| \| \| \| \| \| \| \| \|	In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make flow_table.h #include the protocol specific headers it needs	David Gibson	2024-01-22	4	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we mostly lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Remove unused pif_name() function	David Gibson	2024-01-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	pif_name() has no current callers, although we expect some as we expand the flow table support. I'm not sure why this didn't get caught by one of our static checkers earlier, but it's now causing cppcheck failures for me. Add a cppcheck suppression. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	11	-34/+41
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fix passt.mbuto for cases where /usr/sbin doesn't exist	David Gibson	2024-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f0ccca74 ("test: make passt.mbuto script more robust") is supposed to make mbuto more robust by standardizing on always putting things in /usr/sbin with /sbin a symlink to it. This matters because different distros have different conventions about how the two are used. However, the logic there requires that /usr/sbin at least exists to start with. This isn't always the case with Fedora derived mbuto images. Ironically the DIRS variable ensures that /sbin exists, although we then remove it, but doesn't require /usr/sbin to exist. Fix that up so that the new logic will work with Fedora. Fixes: f0ccca741f64 ("test: make passt.mbuto script more robust") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fetch most specific (longest prefix) address in nl_addr_get()2023_12_30.f091893	Stefano Brivio	2023-12-30	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This happened in most cases implicitly before commit eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions"): while going through results from netlink, we would only copy an address into the provided return buffer if no address had been picked yet. Because of the insertion logic in the kernel (ipv6_link_dev_addr()), the first returned address would also be the one added last, and, in case of a Linux guest using a DHCPv6 client as well as SLAAC, that would be the address assigned via DHCPv6, because SLAAC happens before the DHCPv6 exchange. The effect of, instead, picking the last returned address (first assigned) is visible when passt or pasta runs nested, given that, by default, they advertise a prefix for SLAAC usage, plus an address via DHCPv6. The first level (L1 guest) would get a /64 address by means of SLAAC, and a /128 address via DHCPv6, the latter matching the address on the host. The second level (L2 guest) would also get two addresses: a /64 via SLAAC (same prefix as the host), and a /128 via DHCPv6, matching the the L1 SLAAC-assigned address, not the one obtained via DHCPv6. That is, none of the L2 addresses would match the address on the host. The whole point of having a DHCPv6 server is to avoid (implicit) NAT when possible, though. Fix this in a more explicit way than the behaviour we initially had: pick the first address among the set of most specific ones, by comparing prefix lengths. Do this for IPv4 and for link-local addresses, too, to match in any case the implementation of the default source address selection. Reported-by: Yalan Zhang <yalzhang@redhat.com> Fixes: eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Default SLAAC prefix comes from address (not prefix) on host	Stefano Brivio	2023-12-30	1	-7/+7
\| \| \| \| \|	Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Fix broken link to CentOS Stream package	Stefano Brivio	2023-12-30	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: make passt.mbuto script more robust	Jon Paul Maloy	2023-12-27	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	Creation of a symbolic link from /sbin to /usr/sbin fails if /sbin exists and is non-empty. This is the case on Ubuntu-23.04. We fix this by removing /sbin before creating the link. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: make tcp_sock_set_bufsize() static (again)	Laurent Vivier	2023-12-27	2	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") has exported tcp_sock_set_bufsize() to be able to use it in tcp_splice.c, but 6ccab72d9b40 has removed its use in tcp_splice.c, so we can set it static again. Fixes: 6ccab72d9b40 ("tcp: Improve handling of fallback if socket pool is empty on new splice") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make sock_l4() treat empty string ifname like NULL	David Gibson	2023-12-27	3	-16/+7
\| \| \| \| \| \| \| \| \| \| \|	sock_l4() takes NULL for ifname if you don't want to bind the socket to a particular interface. However, for a number of the callers, it's more natural to use an empty string for that case. Change sock_l4() to accept either NULL or an empty string equivalently, and simplify some callers using that change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Avoid in_addr_t	David Gibson	2023-12-27	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IPv4 addresses can be stored in an in_addr_t or a struct in_addr. The former is just a type alias to a 32-bit integer, so doesn't really give us any type checking. Therefore we generally prefer the structure, since we mostly want to treat IP address as opaque objects. Fix a few places where we still use in_addr_t, but can just as easily use struct in_addr. Note there are still some uses of in_addr_t in conf.c, but those are justified: since they're doing prefix calculations, they actually need to look at the internals of the address as an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Avoid unnecessary handling of unspecified bind address	David Gibson	2023-12-27	1	-12/+4
\| \| \| \| \| \| \| \| \| \|	We go to some trouble, if the configured output address is unspecified, to pass NULL to sock_l4(). But while passing NULL is one way to get sock_l4() not to specify a bind address, passing the "any" address explicitly works too. Use this to simplify icmp_tap_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Drop explicit setting to INADDR_ANY/in6addr_any in sock_l4()	David Gibson	2023-12-27	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original commit message says: --- Currently we initialise the address field of the sockaddrs we construct to the any/unspecified address, but not in a very clear way: we use explicit 0 values, which is only interpretable if you know the order of fields in the sockaddr structures. Use explicit field names, and explicit initialiser macros for the address. Because we initialise to this default value, we don't need to explicitly set the any/unspecified address later on if the caller didn't pass an overriding bind address. --- and the original patch modified the initialisation of addr4 and addr6: - instead of { 0 }, { 0 } for sin_addr and sin_zero, .sin_addr = IN4ADDR_ANY_INIT - instead of 0, IN6ADDR_ANY_INIT, 0: .sin6_addr = IN6ADDR_ANY_INIT but I dropped those hunks: they break gcc versions 7 to 9 as reported in eed6933e6c29 ("udp: Explicitly initialise sin6_scope_id and sin_zero in sockaddr_in{,6}"). I applied the rest of the changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Dropped first two hunks] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use htonl_constant() in more places	David Gibson	2023-12-27	1	-2/+2
\| \| \| \| \| \| \| \|	We might as well when we're passing a known constant value, giving the compiler the best chance to optimise things away. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Add IN4ADDR_ANY_INIT macro	David Gibson	2023-12-27	2	-1/+4
\| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address, make a similar one for the unspecified / any address. This avoids messying things with the internal structure of struct in_addr where we don't care about it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use IN4ADDR_LOOPBACK_INIT more widely	David Gibson	2023-12-27	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address without delving into its internals. However there are some places we don't use it, and explicitly look at the internal structure of struct in_addr, which we generally want to avoid. Use the define more widely to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix address type for tcp_sock_init_af()	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This takes a struct in_addr * (i.e. an IPv4 address), although it's explicitly supposed to handle IPv6 as well. Both its caller and sock_l4() which it calls use a void * for the address, which can be either an in_addr or an in6_addr. We get away with this, because we don't do anything with the pointer other than transfer it from the caller to sock_l4(), but it's misleading. And quite possibly technically UB, because C is like that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Don't use linux/icmp.h when netinet/ip_icmp.h will do	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \|	In most places where we need to get ICMP definitions, we get them from <netinet/ip_icmp.h>. However in checksum.c we instead include <linux/icmp.h>. Change it to use <netinet/ip_icmp.h> for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't account for hash table size in tcp_hash()	David Gibson	2023-12-27	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_hash() returns the hash bucket for a value, that is the hash modulo the size of the hash table. Usually it's a bit more flexible to have hash functions return a "raw" hash value and perform the modulus in the callers. That allows the same hash function to be used for multiple tables of different sizes, or to re-use the hash for other purposes. We don't do anything like that with tcp_hash() at present, but we have some plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash() return raw hash values. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Implement hash table with indices rather than pointers	David Gibson	2023-12-27	2	-11/+33
\| \| \| \| \| \| \| \| \| \| \| \| \|	We implement our hash table with pointers to the entry for each bucket (or NULL). However, the entries are always allocated within the flow table, meaning that a flow index will suffice, halving the size of the hash table. For TCP, just a flow index would be enough, but future uses will want to expand the hash table to cover indexing either side of a flow, so use a flow_sidx_t as the type for each hash bucket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Switch hash table to linear probing instead of chaining	David Gibson	2023-12-27	3	-56/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we deal with hash collisions by letting a hash bucket contain multiple entries, forming a linked list using an index in the connection structure. That's a pretty standard and simple approach, but in our case we can use an even simpler one: linear probing. Here if a hash bucket is occupied we just move onto the next one until we find a feww one. This slightly simplifies lookup and more importantly saves some precious bytes in the connection structure by removing the need for a link. It does require some additional complexity for hash removal. This approach can perform poorly with hash table load is high. However, we already size our hash table of pointers larger than the connection table, which puts an upper bound on the load. It's relatively cheap to decrease that bound if we find we need to. I adapted the linear probing operations from Knuth's The Art of Computer Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm R in Section 6.4. Note that there is an error in Algorithm R as printed, see errata at [0]. [0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler()	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_hash_lookup() expects the port numbers in host order, but the TCP header, of course, has them in network order, so we need to switch them. However we call htons() (host to network) instead of ntohs() (network to host). This works because those do the same thing in practice (they only wouldn't on very strange theoretical platforms which are neither big nor little endian). But, having this the "wrong" way around is misleading, so switch it around. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Update "Availability" section	Stefano Brivio	2023-12-27	1	-13/+11
\| \| \| \| \| \| \| \| \|	It's been a while -- there are now official packages for Arch Linux, Gentoo, Void Linux. Suggested-by: Rahil Bhimjiani <me@rahil.website> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Cast timeval fields to unsigned long long for printing	Stefano Brivio	2023-12-27	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	On x32, glibc defines time_t and suseconds_t (the latter, also known as __syscall_slong_t) as unsigned long long, whereas "everywhere else", including x86_64 and i686, those are unsigned long. See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for all the gory details. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>