passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	flow: Avoid moving flow entries to compact table	David Gibson	2024-01-22	1	-11/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b [sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid Coverity Scan false positive] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Enforce that freeing of closed flows must happen in deferred handlers	David Gibson	2024-01-22	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Move flow_count from context structure to a global	David Gibson	2024-01-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets	David Gibson	2024-01-22	1	-7/+9
\| \| \| \| \| \| \| \| \| \| \|	Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Move per-type cleanup logic into per-type helpers	David Gibson	2024-01-22	1	-3/+6
\| \| \| \| \| \| \| \| \| \|	tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Remove redundant handling from tcp_timer()	David Gibson	2024-01-22	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make flow_table.h #include the protocol specific headers it needs	David Gibson	2024-01-22	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we mostly lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use IN4ADDR_LOOPBACK_INIT more widely	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to the loopback address without delving into its internals. However there are some places we don't use it, and explicitly look at the internal structure of struct in_addr, which we generally want to avoid. Use the define more widely to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Use epoll_ref type including flow and side	David Gibson	2023-12-04	1	-25/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP uses the 'flow' epoll_ref field for both connected sockets and timers, which consists of just the index of the relevant flow (connection). This is just fine for timers, for while it obviously works, it's subtly incomplete for sockets on spliced connections. In that case we want to know which side of the connection the event is occurring on as well as which connection. At present, we deduce that information by looking at the actual fd, and comparing it to the fds of the sockets on each side. When we use the flow table for more things, we expect more cases where something will need to know a specific side of a specific flow for an event, but nothing more. Therefore add a new 'flowside' epoll_ref field, with exactly that information. We use it for TCP connected sockets. This allows us to directly know the side for spliced connections. For "tap" connections, it's pretty meaningless, since the side is always the socket side. It still makes logical sense though, and it may become important for future flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Use unsigned to represent side	David Gibson	2023-12-04	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we use 'int' values to represent the "side" of a connection, which must always be 0 or 1. This turns out to be dangerous. In some cases we're going to want to put the side into a 1-bit bitfield. However, if that bitfield has type 'int', when we copy it out to a regular 'int' variable, it will be sign-extended and so have values 0 and -1, instead of 0 and 1. To avoid this, always use unsigned variables for the side. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Generalise TCP epoll_ref to generic flows	David Gibson	2023-12-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP uses three different epoll object types: one for connected sockets, one for timers and one for listening sockets. Listening sockets really need information that's specific to TCP, so need their own epoll_ref field. Timers and connected sockets, however, only need the connection (flow) they're associated with. As we expand the use of the flow table, we expect that to be true for more epoll fds. So, rename the "TCP" epoll_ref field to be a "flow" epoll_ref field that can be used both for TCP and for other future cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Add logging helpers for connection related messages	David Gibson	2023-12-04	1	-35/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most of the messages logged by the TCP code (be they errors, debug or trace messages) are related to a specific connection / flow. We're fairly consistent about prefixing these with the type of connection and the connection / flow index. However there are a few places where we put the index later in the message or omit it entirely. The template with the prefix is also a little bulky to carry around for every message, particularly for spliced connections. To help keep this consistent, introduce some helpers to log messages linked to a specific flow. It takes the flow as a parameter and adds a uniform prefix to each message. This makes things slightly neater now, but more importantly will help keep formatting consistent as we add more things to the flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make unified version of flow table compaction	David Gibson	2023-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	tcp_table_compact() will move entries in the connection/flow table to keep it compact when other entries are removed. The moved entries need not have the same type as the flow removed, so it needs to be able to handle moving any type of flow. Therefore, move it to flow.c rather than being purportedly TCP specific. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Consolidate flow pointer<->index helpers	David Gibson	2023-12-04	1	-11/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety. They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper. In the process, we standardise on always using an unsigned type to store the connection / flow index, which makes more sense. tcp.c's conn_at_idx() remains for now, but we change its parameter to unsigned to match. That in turn means we can remove a check for negative values from it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Move TCP connection table to unified flow table	David Gibson	2023-12-04	1	-9/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	We want to generalise "connection" tracking to things other than true TCP connections. Continue implenenting this by renaming the TCP connection table to the "flow table" and moving it to flow.c. The definitions are split between flow.h and flow_table.h - we need this separation to avoid circular dependencies: the definitions in flow.h will be needed by many headers using the flow mechanism, but flow_table.h needs all those protocol specific headers in order to define the full flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow, tcp: Generalise connection types	David Gibson	2023-12-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP connections use a 1-bit selector, 'spliced', to determine the rest of the contents of the structure. We want to generalise the TCP connection table to other types of flows in other protocols. Make a start on this by replacing the tcp_conn_common structure with a new flow_common structure with an enum rather than a simple boolean indicating the type of flow. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use 'z' length modifier for size_t/ssize_t conversions	Stefano Brivio	2023-12-02	1	-4/+4
\| \| \| \| \| \| \| \|	Types size_t and ssize_t are not necessarily long, it depends on the architecture. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp_splice: Simplify selection of socket and pipe sides in socket handler	David Gibson	2023-11-07	1	-59/+22
\| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice_sock_handler() uses the tcp_splice_dir() helper to select which of the socket, pipe and counter fields to use depending on which side of the connection the socket event is coming from. Now that we are using arrays for the two sides, rather than separate named fields, we can instead just use a variable indicating the side and use that to index the arrays whever we need a particular side's field. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Exploit side symmetry in tcp_splice_destroy()	David Gibson	2023-11-07	1	-18/+14
\| \| \| \| \| \| \| \| \|	tcp_splice_destroy() has some close-to-duplicated logic handling closing of the socket and pipes for each side of the connection. We can use a loop across the sides to reduce the duplication. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Exploit side symmetry in tcp_splice_connect_finish()	David Gibson	2023-11-07	1	-40/+25
\| \| \| \| \| \| \| \| \|	tcp_splice_connect_finish() has two very similar blocks opening the two pipes for each direction of the connection. We can deduplicate this with a loop across the two sides. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Exploit side symmetry in tcp_splice_timer()	David Gibson	2023-11-07	1	-16/+11
\| \| \| \| \| \| \| \| \|	tcp_splice_timer() has two very similar blocks one after another that handle the SO_RCVLOWAT flags for the two sides of the connection. We can deduplicate this with a loop across the two sides. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Rename sides of connection from a/b to 0/1	David Gibson	2023-11-07	1	-113/+111
\| \| \| \| \| \| \| \| \| \| \| \| \|	Each spliced connection has two mostly, although not entirely, symmetric sides. We currently call those "a" and "b" and have different fields in the connection structure for each one. We can better exploit that symmetry if we use two element arrays rather thatn separately named fields. Do that in the places we can, and for the others change the "a"/"b" terminology to 0/1 to match. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Don't pool pipes in pairs	David Gibson	2023-11-07	1	-29/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To reduce latencies, the tcp splice code maintains a pool of pre-opened pipes to use for new connections. This is structured as an array of pairs of pipes, with each pipe, of course, being a pair of fds. Thus when we use the pool, a single pool "slot" provides both the a->b and b->a pipes. There's no strong reason to store the pool in pairs, though - we can with not much difficulty instead take the a->b and b->a pipes for a new connection independently from separate slots in the pool, or even take one from the the pool and create the other as we need it, if there's only one pipe left in the pool. This marginally increases the length of code, but simplifies the structure of the pipe pool. We should be able to re-shrink the code with later changes, too. In the process we also fix some minor bugs: - If we both failed to find a pipe in the pool and to create a new one, we didn't log an error and would silently drop the connection. That could make debugging such a situation difficult. Add in an error message for that case - When refilling the pool, if we were only able to open a single pipe in the pair, we attempted to rollback, but instead of closing the opened pipe, we instead closed the pipe we failed to open (probably leading to some ignored EBADFD errors). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Avoid awkward temporaries in tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-13/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We initialise the events_a and events_b variables with tcp_splice_conn_epoll_events() function, then immediately copy the values into ev_a.events and ev_b.events. We can't simply pass &ev_[ab].events to tcp_splice_conn_epoll_events(), because struct epoll_event is packed, leading to 'pointer may be unaligned' warnings if we attempt that. We can, however, make tcp_splice_conn_epoll_events() take struct epoll_event pointers rather than raw u32 pointers, avoiding the awkward temporaries. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Remove unnecessary forward declaration	David Gibson	2023-11-07	1	-37/+34
\| \| \| \| \| \| \| \| \|	In tcp_splice.c we forward declare tcp_splice_epoll_ctl() then define it later on. However, there are no circular dependencies which prevent us from simply having the full definition in place of the forward declaration. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Don't handle EPOLL_CTL_DEL as part of tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice_epoll_ctl() removes both sockets from the epoll set if called when conn->flags & CLOSING. This will always happen immediately after setting that flag, since conn_flag_do() makes the call itself. That's also the _only_ time it can happen: we perform the EPOLL_CTL_DEL without clearing the conn->in_epoll flag, meaning that any further calls to tcp_splice_epoll_ctl() would attempt EPOLL_CTL_MOD, which would necessarily fail since the fds are no longer in the epoll. The EPOLL_CTL_DEL path in tcp_splice_epoll_ctl() has essentially zero overlap with anything else the function does, so just move them to be open coded in conn_flag_do(). This does require kernel 2.6.9 or later, in order to pass NULL as the event structure for epoll_ctl(). However, we already require at least 3.13 to allow unprivileged user namespaces. Given that, simply directly perform the EPOLL_CTL_DEL operations from conn_flag_do() rather than unnecessarily multiplexini Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Correct error handling in tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we get an error from epoll_ctl() in tcp_splice_epoll_ctl() we goto the 'delete' path where we remove both sockets from the epoll set and return an error. There are several problems with this: - We 'return -errno' after the EPOLL_CTL_DEL operations, which means the deleting epoll_ctl() calls may have overwritten the errno values which actually triggered the failures. - The call from conn_flag_do() occurs when the CLOSING flag is set, in which case we go do the delete path regardless of error. In that case the 'return errno' is meaningless since we don't expect the EPOLL_CTL_DEL operations to fail and we ignore the return code anyway. - All other calls to tcp_splice_epoll_ctl() check the return code and if non-zero immediately call conn_flag(..., CLOSING) which will call tcp_splice_epoll_ctl() again explicitly to remove the sockets from epoll. That means removing them when the error first occurs is redundant. - We never specifically report an error on the epoll_ctl() operations. We just set the connection to CLOSING, more or less silently killing it. This could make debugging difficult in the unlikely even that we get a failure here. Re-organise tcp_splice_epoll_ctl() to just log a message then return in the error case, and only EPOLL_CTL_DEL when explicitly asked to with the CLOSING flag. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Remove redundant tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice_conn_update() calls tcp_splice_epoll_ctl() twice: first ignoring the return value, then checking it. This serves no purpose. If the first call succeeds, the second call will do exactly the same thing again, since nothing has changed in conn. If the first call fails, then tcp_splice_epoll_ctl() itself will EPOLL_CTL_DEL both fds, meaning when the second call tries to EPOLL_CTL_MOD them it will necessarily fail. It appears that this duplication was introduced by accident in an otherwise unrelated patch. Fixes: bb708111 ("treewide: Packet abstraction with mandatory boundary checks") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Record originating pif in listening socket refs	David Gibson	2023-11-07	1	-4/+6
\| \| \| \| \| \| \| \| \| \|	For certain socket types, we record in the epoll ref whether they're sockets in the namespace, or on the host. We now have the notion of "pif" to indicate what "place" a socket is associated with, so generalise the simple one-bit 'ns' to a pif id. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-2/+3
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	siphash: Use incremental rather than all-at-once siphash functions	David Gibson	2023-09-30	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have a bunch of variants of the siphash functions for different data sizes. The callers, in tcp.c, need to pack the various values they want to hash into a temporary structure, then call the appropriate version. We can avoid the copy into the temporary by directly using the incremental siphash functions. The length specific hash functions also have an undocumented constraint that the data pointer they take must, in fact, be aligned to avoid unaligned accesses, which may cause crashes on some architectures. So, prefer the incremental approach and remove the length-specific functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Avoid shadowing index(3)	David Gibson	2023-09-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A classic gotcha of the standard C library is that its unwise to call any variable 'index' because it will shadow the standard string library function index(3). This can cause warnings from cppcheck amongst others, and it also means that if the variable is removed you tend to get confusing type errors (or sometimes nothing at all) instead of a nice simple "name is not defined" error. Strictly speaking this only occurs if <string.h> is included, but that is so common that as a rule it's best to just avoid it always. We have a number of places which hit this trap, so rename variables and parameters to avoid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove broken pressure calculations for tcp_defer_handler()	David Gibson	2023-08-22	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler() performs a potentially expensive linear scan of the connection table. So, to mitigate the cost of that we skip if if we're not under at least moderate pressure: either 30% of available connections or 30% (estimated) of available fds used. But, the calculation for this has been broken since it was introduced: we calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS, meaning we only exit early if conn_count is less than 30% of itself, i.e. never. If that calculation is "corrected" to be based on TCP_MAX_CONNS, it completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my laptop. My guess is that this is because in the case of many short lived connections, we're letting the table become much fuller before compacting it. That means that other places which perform a table scan now have to do much, much more. For the time being, simply remove the tests, since they're not doing anything useful. We can reintroduce them more carefully if we see a need for them. This also removes the only user of c->tcp.splice_conn_count, so that can be removed as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move in_epoll flag out of common connection structure	David Gibson	2023-08-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The in_epoll boolean is one of only two fields (currently) in the common structure shared between tap and spliced connections. It seems like it belongs there, because both tap and spliced connections use it, and it has roughly the same meaning. Roughly, however, isn't exactly: which fds this flag says are in the epoll varies between the two connection types, and are in type specific fields. So, it's only possible to meaningfully use this value locally in type specific code anyway. This common field is going to get in the way of more widespread generalisation of connection / flow tracking, so move it to separate fields in the tap and splice specific structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split handling of listening TCP sockets into their own handler	David Gibson	2023-08-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_handler() handles both listening TCP sockets, and connected TCP sockets, but what it needs to do in those cases has essentially nothing in common. Therefore, give listening sockets their own epoll_type value and dispatch directly to their own handler from the top level. Furthermore, the two handlers need essentially entirely different information from the reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate the port for the listening socket, but that's not the same meaning. So, switch listening sockets to their own reference type which we can lay out as we please. That lets us remove the listen and outbound fields from the normal (connected) tcp_epoll_ref, reducing it to just the connection table index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Generalize epoll_ref to cover things other than sockets	David Gibson	2023-08-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use C11 anonymous members to make poll refs less verbose to use	David Gibson	2023-08-04	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \|	union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Work around weird false positives with cppcheck-2.9.1	David Gibson	2023-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 89e38f55 "treewide: Fix header includes to build with musl" added extra #includes to work with musl. Unfortunately with the cppcheck version I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false positives: specifically cppcheck seems to hit a #error in <bits/unistd.h> complaining about including it directly instead of via <unistd.h> (which is not something we're doing). I have no idea why that would be happening; but I'm guessing it has to be a bug in the cpp implementation in that cppcheck version. In any case, it's possible to work around this by moving the include of <unistd.h> before the include of <signal.h>. So, do that. Fixes: 89e38f55405d ("treewide: Fix header includes to build with musl") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Fix header includes to build with musl	Chris Kuhn	2023-03-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Roughly inspired from a patch by Chris Kuhn: fix up includes so that we can build against musl: glibc is more lenient as headers generally include a larger amount of other headers. Compared to the original patch, I only included what was needed directly in C files, instead of adding blanket includes in local header files. It's a bit more involved, but more consistent with the current (not ideal) situation. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()	Stefano Brivio	2023-02-27	1	-8/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We use the return value of fls() as array index for debug strings. While fls() can return -1 (if no bit is set), Coverity Scan doesn't see that we're first checking the return value of another fls() call with the same bitmask, before using it. Call fls() once, store its return value, check it, and use the stored value as array index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Improve handling of fallback if socket pool is empty on new splice	David Gibson	2023-02-14	1	-59/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When creating a new spliced connection, we need to get a socket in the other ns from the originating one. To avoid excessive ns switches we usually get these from a pool refilled on a timer. However, if the pool runs out we need a fallback. Currently that's done by passing -1 as the socket to tcp_splice_connnect() and running it in the target ns. This means that tcp_splice_connect() itself needs to have different cases depending on whether it's given an existing socket or not, which is a separate concern from what it's mostly doing. We change it to require a suitable open socket to be passed in, and ensuring in the caller that we have one. This requires adding the fallback paths to the caller, tcp_splice_new(). We use slightly different approaches for a socket in the init ns versus the guest ns. This also means that we no longer need to run tcp_splice_connect() itself in the guest ns, which allows us to remove a bunch of boilerplate code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split pool lookup from creating new sockets in tcp_conn_new_sock()	David Gibson	2023-02-14	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if that's empty creates a new socket in the init namespace. Both parts of this are duplicated in other places: the pool lookup logic is duplicated in tcp_splice_new(), and the socket opening logic is duplicated in tcp_sock_refill_pool(). Split the function into separate parts so we can remove both these duplications. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move socket pool declarations around	David Gibson	2023-02-14	1	-7/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice.c has some explicit extern declarations to access the socket pools. This is pretty dangerous - if we changed the type of these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the same memory in different ways with no compiler error. So, move the extern declarations to tcp_conn.h so they're visible to both tcp.c and tcp_splice.c, but not the rest of pasta. In fact the pools for the guest namespace are necessarily only used by tcp_splice.c - we have no sockets on the guest side if we're not splicing. So move those declarations and the functions that deal exclusively with them to tcp_splice.c Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Make assertions actually useful	David Gibson	2023-02-12	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Pass union tcp_conn pointer to destroy and splice timer functions	Stefano Brivio	2022-11-25	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \|	The pointers are actually the same, but we later pass the container union to tcp_table_compact(), which might zero the size of the whole union, and this confuses Coverity Scan. Given that we have pointers to the container union to start with, just pass those instead, all the way down to tcp_table_compact(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp_splice: Allow splicing of connections from IPv4-mapped loopback	David Gibson	2022-11-25	1	-8/+11
\| \| \| \| \| \| \| \| \| \| \|	For non-spliced connections we now treat IPv4-mapped IPv6 addresses the same as the corresponding IPv4 addresses. However currently we won't splice a connection from ::ffff:127.0.0.1 the way we would one from 127.0.0.1. Correct this so that we can splice connections from IPv4 localhost that have been received on an IPv6 dual stack socket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove v6 flag from tcp_epoll_ref	David Gibson	2022-11-25	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This bit in the TCP specific epoll reference indicates whether the connection is IPv6 or IPv4. However the sites which refer to it are already calling accept() which (optionally) returns an address for the remote end of the connection. We can use the sa_family field in that address to determine the connection type independent of the epoll reference. This does have a cost: for the spliced case, it means we now need to get that address from accept() which introduces an extran copy_to_user(). However, in future we want to allow handling IPv4 connectons through IPv6 sockets, which means we won't be able to determine the IP version at the time we create the listening socket and epoll reference. So, at some point we'll have to pay this cost anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Helper functions for handling addresses which could be IPv4 or IPv6	David Gibson	2022-11-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	struct tcp_conn stores an address which could be IPv6 or IPv4 using a union. We can do this without an additional tag by encoding IPv4 addresses as IPv4-mapped IPv6 addresses. This approach is useful wider than the specific place in tcp_conn, so expose a new 'union inany_addr' like this from a new inany.h. Along with that create a number of helper functions to make working with these "inany" addresses easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>